chasewnelson / SNPGenie

Program for estimating πN/πS, dN/dS, and other diversity measures from next-generation sequencing data
GNU General Public License v3.0
100 stars 37 forks source link

Coverage warning #60

Closed dinahparker closed 2 years ago

dinahparker commented 2 years ago

Hello @singing-scientist, I'm using SNPgenie on some haploid fungi and I've run it before without much hassle, but on this round I'm getting a few warnings like the following:

## WARNING: In JAGRQK010000001.1.vcf, at site 7424049,
## the coverage (10.000) does not equal the nucleotide sum (18.000).
## WARNING: In JAGRQK010000001.1.vcf, J3458_000061, the codon at site 330108,
## the variant data imply a negative number of T nucleotides: -4. This most often results from
## variants which are assigned to the wrong site in the SNP Report, or multiple copies of the same genes/exons in the GTF file. Results at this site
## are unreliable; T count set to 0; proceed with caution.

I just wanted some clarity about what nucleotide sum refers to and what does it mean by negative numbers of nucleotides? My first thought was that something was wrong with my vcf file or perhaps I've chosen the wrong format, so I'm attaching a subset of my vcf file for reference.

Thanks in advance! -Dinah

subset.vcf.zip

singing-scientist commented 2 years ago

Greetings, @dinahparker — and so sorry I missed this. Nucleotide sum here refers to coverage (depth), which should be equal to the sum of all difference nucleotides (reads) at a site. Unfortunately the site which gives rise to the error is not present in the example you sent. Thus, it's difficult for me to know what to look for given this subset. If you could send the exact command and files used which give rise to the error, I'd be more than happy to take a look!

dinahparker commented 2 years ago

Hi! No worries, thanks for the reply! I'm attaching a folder of my the first scaffold that I ran SNPGenie on, including the results. There are plenty of instances of these errors across the scaffold, so I assume there must be some sort of fundamental problem with my files. This is the code that I used to run it on the plus strand of this scaffold:

/path/to/SNPGenie-master/snpgenie.pl --vcfformat=1 --snpreport=JAGRQK010000001.1.vcf --fastafile=JAGRQK010000001.fasta --gtffile=JAGRQK010000001.1.gtf

And this is the first error of each type that pops up in this scaffold (among many), although all that info is recorded in the log file.

## WARNING: In JAGRQK010000001.1.vcf, J3458_000061, the codon at site 330108,
## the variant data imply a negative number of T nucleotides: -4. This most often results from
## variants which are assigned to the wrong site in the SNP Report, or multiple copies of the same genes/exons in the GTF file. Results at this site
## are unreliable; T count set to 0; proceed with caution.
## WARNING: In JAGRQK010000001.1.vcf, at site 330108,
## the coverage (10.000) does not equal the nucleotide sum (14.000).

Thanks for your help !

JAGRQK010000001.zip

singing-scientist commented 2 years ago

Greetings, @dinahparker ! Sorry again for the delay. This issue appears to be due to the GTF containing multiple overlapping isoforms with the same gene name. SNPGenie (unfortunately) expects a set of disjoint (non-overlapping) sites for each gene_id. Although it does throw this error if that's not true, there is (unfortunately) no way to get around this requirement. (This is one of many things to be made more flexible in the next version.)

For example, in the present data, gene_id J3458_000061 has:

JAGRQK010000001.1 FUN CDS 327823 328112 . + 0 gene_id "J3458_000061"; JAGRQK010000001.1 FUN CDS 329905 331840 . + 1 gene_id "J3458_000061"; JAGRQK010000001.1 FUN CDS 329825 331840 . + 0 gene_id "J3458_000061";

Here, the variants from the VCF at sites 330108, 330295, 330705, 330772, and 331768 fall within BOTH the second (329905-331840) and third (329825-331840) exon listed, because those exons substantially overlap. SNPGenie will then subtract the ALT alleles from the coverage TWICE for each of those sites, leading to conflicting nucleotide numbers and the error.

Thus, I'm afraid the only solution is to choose one and only one non-overlapping isoform for each gene_id in your GTF file. In the present case, that probably means choosing between exon 2 or 3. Always choosing the longest seems like a good rule of thumb, so no data are eliminated.

Please let me know if that makes sense, and sorry for the inconvenient requirement!

Yours, Chase

dinahparker commented 2 years ago

Hi Chase,

Thanks for getting back to me and for clarifying the issue, that completely makes sense. It did seem strange to me that we had so many genes with overlapping CDS windows, but it turns out that most of these are actually different isoforms of gene products, so I think I'll go with your suggestion of choosing the longest isoform to pass onto SNPGenie. Thanks again for the help!

Cheers! -Dinah