Closed dinahparker closed 2 years ago
Greetings, @dinahparker — and so sorry I missed this. Nucleotide sum here refers to coverage (depth), which should be equal to the sum of all difference nucleotides (reads) at a site. Unfortunately the site which gives rise to the error is not present in the example you sent. Thus, it's difficult for me to know what to look for given this subset. If you could send the exact command and files used which give rise to the error, I'd be more than happy to take a look!
Hi! No worries, thanks for the reply! I'm attaching a folder of my the first scaffold that I ran SNPGenie on, including the results. There are plenty of instances of these errors across the scaffold, so I assume there must be some sort of fundamental problem with my files. This is the code that I used to run it on the plus strand of this scaffold:
/path/to/SNPGenie-master/snpgenie.pl --vcfformat=1 --snpreport=JAGRQK010000001.1.vcf --fastafile=JAGRQK010000001.fasta --gtffile=JAGRQK010000001.1.gtf
And this is the first error of each type that pops up in this scaffold (among many), although all that info is recorded in the log file.
## WARNING: In JAGRQK010000001.1.vcf, J3458_000061, the codon at site 330108,
## the variant data imply a negative number of T nucleotides: -4. This most often results from
## variants which are assigned to the wrong site in the SNP Report, or multiple copies of the same genes/exons in the GTF file. Results at this site
## are unreliable; T count set to 0; proceed with caution.
## WARNING: In JAGRQK010000001.1.vcf, at site 330108,
## the coverage (10.000) does not equal the nucleotide sum (14.000).
Thanks for your help !
Greetings, @dinahparker ! Sorry again for the delay. This issue appears to be due to the GTF containing multiple overlapping isoforms with the same gene name. SNPGenie (unfortunately) expects a set of disjoint (non-overlapping) sites for each gene_id. Although it does throw this error if that's not true, there is (unfortunately) no way to get around this requirement. (This is one of many things to be made more flexible in the next version.)
For example, in the present data, gene_id J3458_000061 has:
JAGRQK010000001.1 FUN CDS 327823 328112 . + 0 gene_id "J3458_000061"; JAGRQK010000001.1 FUN CDS 329905 331840 . + 1 gene_id "J3458_000061"; JAGRQK010000001.1 FUN CDS 329825 331840 . + 0 gene_id "J3458_000061";
Here, the variants from the VCF at sites 330108, 330295, 330705, 330772, and 331768 fall within BOTH the second (329905-331840) and third (329825-331840) exon listed, because those exons substantially overlap. SNPGenie will then subtract the ALT alleles from the coverage TWICE for each of those sites, leading to conflicting nucleotide numbers and the error.
Thus, I'm afraid the only solution is to choose one and only one non-overlapping isoform for each gene_id in your GTF file. In the present case, that probably means choosing between exon 2 or 3. Always choosing the longest seems like a good rule of thumb, so no data are eliminated.
Please let me know if that makes sense, and sorry for the inconvenient requirement!
Yours, Chase
Hi Chase,
Thanks for getting back to me and for clarifying the issue, that completely makes sense. It did seem strange to me that we had so many genes with overlapping CDS windows, but it turns out that most of these are actually different isoforms of gene products, so I think I'll go with your suggestion of choosing the longest isoform to pass onto SNPGenie. Thanks again for the help!
Cheers! -Dinah
Hello @singing-scientist, I'm using SNPgenie on some haploid fungi and I've run it before without much hassle, but on this round I'm getting a few warnings like the following:
I just wanted some clarity about what nucleotide sum refers to and what does it mean by negative numbers of nucleotides? My first thought was that something was wrong with my vcf file or perhaps I've chosen the wrong format, so I'm attaching a subset of my vcf file for reference.
Thanks in advance! -Dinah
subset.vcf.zip