chasewnelson / SNPGenie

Program for estimating πN/πS, dN/dS, and other diversity measures from next-generation sequencing data
GNU General Public License v3.0
102 stars 37 forks source link

Input error gtf #13

Closed HirokiTomida closed 5 years ago

HirokiTomida commented 5 years ago

Hi. Thank you so much for developing scripts for Biologist!.

I have issue with gtf file input. I am trying to calculate dN/dS ratio between particular strain of Japanese rice fish with samples which were gathered in wild. However I got error message several times and have no idea to fix the problem. (I am totally new to bioinformatics)

I obtained gtf file from here (Ensembl) And genome fasta file from same ensembl page. For VCF file, I got the file from [here](url https://www.ebi.ac.uk/ena/data/view/PRJEB5473)

and I retrieve only CDS data from that gtf file by following command. `awk 'BEGIN{FS="\t"}$3 ~ /CDS/{print $0}' Medaka.gtf

Then, I conducted following command and got error message for 5 or 6 times. `Perl snpgenie.pl --vcfformat=4 --snpreport=.vcf --fastafile=.fa --gtffile=.gtf

I eliminated eliminated the error genes by my hand through searching it to confirm whether it is fixed or not, But, it did not fix. Do you have any idea to fix this problem?

WARNING: The CDS coordinates for gene ENSORLG00000003078 in the gtf file do not yield a set of complete codons,

or are absent from the file. The number of nucleotides must be a multiple of 3.

SNPGenie terminated.

Sincerely Hiroki Tomida

singing-scientist commented 5 years ago

Thanks very much for using SNPGenie, @HirokiTomida ! Apologies for the slight delay. This warning means that the number of sites in the gene do not add up to a multiple of 3, i.e., (LENGTH % 3) != 0. You will have to remove this gene from your GTF file.

If you remove this gene and the problem persists, please let me know. If you need me to look at a particular file, please send a link to the exact file.

Thank you!

HirokiTomida commented 5 years ago

Thank you so much for you kind response! I am still not sure how to solve this input file error. May I ask your help? I obtained gtf file from here ftp://ftp.ensembl.org/pub/release-95/gtf/oryzias_latipes/

I tried to eliminate the CDS with not multiple of 3. By following command. awk '($5 - $4) % 3 == 0' new_Oryzias_latipes.ASM223467v1.95.chr.gtf > new_corrected_Oryzias_latipes.ASM223467v1.95.chr.gtf

And I still got an error message as you can see below. `gunzip Oryzias_latipes.ASM223467v1.95.chr.gtf.gz

awk 'BEGIN{FS="\t"}$3 ~ /CDS/{print $0}' Oryzias_latipes.ASM223467v1.95.chr.gtf > new_Oryzias_latipes.ASM223467v1.95.chr.gtf--

awk '($5 - $4) % 3 == 0' new_Oryzias_latipes.ASM223467v1.95.chr.gtf > new_corrected_Oryzias_latipes.ASM223467v1.95.chr.gtf

Perl snpgenie.pl --vcfformat=4 --snpreport=.vcf --fastafile=.fa --gtffile=.gtf`

WARNING: The CDS coordinates for gene ENSORLG00000014201 in the gtf file do not yield a set of complete codons,

or are absent from the file. The number of nucleotides must be a multiple of 3.

SNPGenie terminated.

I am not sure what is the problem behind this error message. Maybe, I commit mistake, if so, I am so sorry for bothering you.

Sincerely Hiroki Tomida

singing-scientist commented 5 years ago

Most genes have more than one exon. Using awk, you have only excluded individual exons which are not a multiple of 3. However, it is very common that individual exons are not multiples of 3, but the sum of all exons (i.e., the CDS length sum) is a perfect multiple of 3. What is needed is to only exclude whole genes (CDS) which are not a multiple of 3. This means you will first need to use the original GTF file, then if there are any warnings, manually delete the problematic genes (i.e., every line for the problematic gene).

For example, for the first error you sent, you might generate a new GTF using grep by removing all lines for the problematic gene:

cat original.gtf | grep -v ENSORLG00000003078 > modified.gtf

Does this make sense?

HirokiTomida commented 5 years ago

Thank you again!. @cwnelson88 Yes, It make sense and now, I realized that I committed mistake there by removing all single CDS with not multiple of 3. I suppose that I have to remove all whole gene line with not multiple of 3 manually. and I did it for over 30times and the error message still pop up....

I understand this is not your script's problem. But may I borrow your wisdom? Is there any quick way to remove all problematic gene line which is not multiple of 3?

I really need to run SNPGenie for my research. I feel so sorry for bothering you. Thank you so much for your kind response.

Sincerely Hiroki Tomida

singing-scientist commented 5 years ago

No problem, @HirokiTomida . Please send me one GTF file, and I will inspect it to see if there is an obvious problem or solution. It does seem strange to have so many genes without complete codons.

HirokiTomida commented 5 years ago

@cwnelson88 Thank you so much! It is very helpful!!! yeah it seems strange that this error happen multiple times. I hope this gtf file is appropriate. Here is link of my gtf file. Oryzias_latipes.ASM223467v1.95.gtf.gz

Sincerely Hiroki Tomida

singing-scientist commented 5 years ago

No problem, @HirokiTomida . I have looked at the GTF, and I believe your problem is due to having multiple transcripts for the same gene. Unfortunately, as described in the documentation, SNPGenie groups by CDS gene_id, and so it does not allow multiple transcripts for the same gene. (Apologies for this limitation — this was originally written for viruses, and I have not had time to make a better version yet.) You have two options: (1) filter your GTF to keep only one transcript_id for each unique gene_id; or (2) re-name the gene_id values as the transcript_id values, which could be accomplished by deleting everything before the transcript_id, and then replacing transcript_id with gene_id. That way, each gene/transcript will have a unique gene_id. Let me know what you think.

HirokiTomida commented 5 years ago

@cwnelson88 Thank you so much for prompt reply! Oh. I see... I am so sorry for not reading your documentation carefully. I would like to take (1) option. Is there any useful command for such file trimming?

I am really grateful for spending time for my research. Again, Thank you so much.

singing-scientist commented 5 years ago

@HirokiTomida no problem. Unfortunately, although a simple solution may exist (e.g., in R or bioawk), I do not know a simple way to do this. I would write a script that stores each {gene_id}->{transcript_id} and then determines which transcript_id is the one you want to keep for each gene_id. For example, you might decide to keep the longest transcript for each gene. You could then loop the GTF file again and keep (print) only the lines containing a 'keeper' transcript_id. Does this make sense?

HirokiTomida commented 5 years ago

@cwnelson88 Yes it make sense. I would like to keep longest transcript for each genes. Perhaps, I will learn how to manipulate gtf file manipulation in few months later by myself. But, now I want to run SNPgenie script at this time for test run.

singing-scientist commented 5 years ago

@HirokiTomida in that case, I suggest approach (2), which can be accomplished quickly using Find & Replace in a text editor. (Delete everything before 'transcript_id', then replace 'transcript_id' with 'gene_id'.)

HirokiTomida commented 5 years ago

@cwnelson88 OK I see. I will try it. Thank you so much for everything!

singing-scientist commented 5 years ago

No problem, @HirokiTomida . Let me know if you need anything else.

singing-scientist commented 5 years ago

I will close the thread now. If you have any future issues, please feel free to contact again.