chasewnelson / SNPGenie

Program for estimating πN/πS, dN/dS, and other diversity measures from next-generation sequencing data
GNU General Public License v3.0
102 stars 37 forks source link

error: gtf does not contain any sense (+) strand products. SNPGenie terminated. #23

Closed katarinabraun closed 4 years ago

katarinabraun commented 4 years ago

Hi, Chase.

I have been trying to work with SNPGenie and can't figure out how to troubleshoot this error -- CDS_EXAMPLE.gtf does not contain any sense (+) strand products. SNPGenie terminated..

I am trying to run snpgenie.pl using the following command:
perl snpgenie.pl --minfreq=0.01 --vcfformat=4 --snpreport=A_GUANGDONG_17SF003_2016_H7.vcf --fastafile=A_GUANGDONG_17SF003_2016_H7.fasta --gtffile=A_GUANGDONG_17SF003_2016_H7.gtf --slidingwindow=50

I assume this is a GTF formatting error, but I can't identify it. Additionally, I tried running snpgenie.pl using your example files and got the same error.

I am attaching my GTF, fasta, and VCF for your reference.

Any help would be greatly appreciated! Thanks in advance.

files.zip

singing-scientist commented 4 years ago

Greetings, Katarina @katarinabraun ! Thanks a lot for using SNPGenie, and sorry for the trouble.

How odd; I ran your example on my machine and it worked: see SNPGenie_Results.zip.

This suggests the problem has something to do with a different OS. The most likely culprit is Windows vs. Mac vs. Linux line endings; SNPGenie does its best, but can get confused when the line endings aren't Unix (\n), as often occurs when saving from Microsoft Excel: see the SNPGenie Troubleshooting for how to convert line endings to Unix. If that doesn't work, please write back and we'll take it from there.

Note that none of the SNPs in the VCF actually fall within the length of the FASTA sequence, so no variation is reported.

katarinabraun commented 4 years ago

@cwnelson88 thanks for your speedy reply!

I am working on a Mac (OS = Mojave) and the GTF was saved from the text editor VS Code, never Excel. I confirmed that my line endings are Unix (\n) and not (\r) and am unfortunately still having trouble getting this to run. I am encouraged that the files ran on your machine, but am a bit confused why no variation was reported -- the SNPs in the VCF should fall within the FASTA sequence. I'll keep troubleshooting this, but would be grateful to know if you have any other thoughts/suggestions.

Thanks again for your help!

singing-scientist commented 4 years ago

@katarinabraun thanks so much for bearing with me as I track down the source of this issue.

It turns out that, for previous versions of this VCF format I've encountered, the AD column must contain FIRST the allele depth (read count) of the reference nucleotide, THEN the allele depth of any variant nucleotides. Thus, for sites with one SNP, this will contain two values, and the field for the sample should be ,; for two SNPs, ,,; and so on.

Thus, my questions are as follows: (1) was this VCF file output from a standard SNP calling program? If so, I'll want to incorporate this VCF format; (2) is it possible for you to output the format I described (i.e., one extra value in the AD field) instead?

Either way, I'll need to update the documentation to make this clearer. Thanks so much for pointing out this issue!

Let me know...

katarinabraun commented 4 years ago

@cwnelson88 thanks again for your ongoing help --

To address your questions: (1) This VCF originated from Varscan. However, I am sequencing in technical replicate and have a script that combines the replicate files into a single VCF with average SNP frequencies. I was pretty sure the formatting of my VCF met v.4 and SNPGenie criteria. (2) I could easily modify the script to write the extra value to the AD field.

I modified the VCF according to your suggestions and really carefully checked to make sure this VCF meets other version 4 criteria, but I am still getting the same error: ## WARNING: A_GUANGDONG_17SF003_2016_H7.gtf does not contain any sense (+) strand products. SNPGenie terminated. I have attached the modified VCF along with the GTF and fasta below.

Apologies this has been so much trouble!

files.zip

singing-scientist commented 4 years ago

I tried your files and they (seem to) work perfectly with no modifications.

Command: snpgenie.pl --minfreq=0.01 --vcfformat=4 --snpreport=GD3_ferret1_day1_averaged_H7.vcf --fastafile=A_GUANGDONG_17SF003_2016_H7.fasta --gtffile=A_GUANGDONG_17SF003_2016_H7.gtf --slidingwindow=50

Results: SNPGenie_Results.zip

Here are the only possibilities I can think of: (1) You're using an old version of SNPGenie -- when did you download your working version? (2) You're using SNPGenie on a directory other than the working directory. For example, if you simply provide the gtf file as 'A_GUANGDONG_17SF003_2016_H7.gtf' but you're in a different directory, it will contain nothing because it's not there at all. (I should really add more informative error messages.)

Let me know if it works with the newest version of SNPGenie, calling it from the folder containing the data! If not, we'll take it from there.

katarinabraun commented 4 years ago

@cwnelson88 it works!!

I thought I was working with the most recent version of SNPGenie, but apparently I was not. I downloaded a new version and it ran perfectly. I apologize I didn't think of that sooner. It looks like it is still important to format the AD column in the VCF with ref allele depth and then variant allele depth so I'll modify my VCFs accordingly.

Thank you very much for your help working through this -- I really appreciate it.

singing-scientist commented 4 years ago

I'm so glad to hear that! Please let me know if there is anything else I can help with. Until then, I'll close this issue.