chasewnelson / SNPGenie

Program for estimating πN/πS, dN/dS, and other diversity measures from next-generation sequencing data
GNU General Public License v3.0
109 stars 37 forks source link

WARNING: AD must list numbers #42

Open ireneIta opened 3 years ago

ireneIta commented 3 years ago

Hi Chase, thank you for this useful tool! I am having some issues with a set of data. I got a warning, and SNPGenie terminates. I am running SNPGenie with --vcfformat=4; I got the vcf by merging individual vcfs with bcftools. The warning is

WARNING: the line is:

NC_005967.2 4009 . T A,G 687.24 . BaseQRankSum=-0.486;Dels=0;FS=0;HRun=1;HaplotypeScore=1.6229;MQ=13.77;MQ0=7;MQRankSum=4.129;QD=7.29;ReadPosRankSum=-1.301;DP=1156;AF=1,0.5;AN=64;AC=38,21 GT:AD:DP:GQ:PL 1/1:36,25:61:9.09:477,9,0,.,.,.

For vcfformat==4, AD must list numbers of reads for two SNPs as follows: ,,. SNPGenie TERMINATED.

AD is actually a list of numbers (in the warning= 36,25), so I don't get the message... Could you please help me find out what's the problem?

I have another question: in the SNPGenie_parameters.txt file that is generated over the run, it is reported "COMPLEMENT MODE", which is "Yes" when running the software on "fw" data and is "No" when running it on revcom data. However, to my understanding (and according to the results of the few run), SNPGenie does not reverse-complement the data, so the results are only relative to one strand. Am I correct?

thank you very much!! best wishes Irene

singing-scientist commented 3 years ago

Greetings @ireneIta and thanks for using SNPGenie! Sorry for the issue. Indeed the AD entry seems correct, so it is likely an issue with file encoding. I'd be happy to give it a try myself if you can send a reproducible example (reference / gtf / VCF file combo) — here or private email is fine.

It is correct that SNPGenie does not reverse complement the data, and one run of SNPGenie gives results for one strand only — although it is strand-AWARE in the sense that, if there are minus-strand entries in the GTF file, SNPGenie will correctly label those sites as genic and codons from two genes overlapping on opposite strands will still be flagged as such. It's a bit of a clunky fix, and an artifact of having been originally design for viral genomes, but the practical result is that SNPGenie must be run once for each strand with separate input for each strand.

Let me know if that helps!

ireneIta commented 3 years ago

Dear Chase, thank you very much for your quick reply I have attached here one of the sets of data that is giving me the problem. Please let me know if there are issues with the attachment, I'll send them by email. I ran the same analysis on a different set of data, using the same gtf and reference fasta but a different vcf, and it worked perfectly. I guess it could be an issue with the vcf but I cannot find the problem... I really appreciate your help!!!

thank you also for the clarification on the fw/revcom and for confirming that SNPGenie must be run separately for the two strands.

test_SNPGenie.zip

singing-scientist commented 3 years ago

Ah, I see! This appears to be due to the merging of the VCF files — if you look at the "sample" columns at the right-most side of the output, the merging have forced each sample to have a record for each site that is variable in even one sample, so that the majority of records are "blank" in the form of "./.:.:.:.:.". Apparently I did not deal with this case for triallelic sites. If it it possible to run SNPGenie in a different mode on every individual VCF, that would circumvent the problem. Another quicker fix worth trying is to manually FIND/REPLACE, on this line only, as follows:

FIND: ./.:.:.:.:. REPLACE: ./.:.,.:.:.:.

I can try to code for this case, but it will be several days before I can get around to it. Please let me know!

ireneIta commented 3 years ago

thank you for the hints! I have tried the FIND/REPLACE fix, but the warning still shows, unchanged. I'll give it a go with the other option and let you know if it solves the issue. I had a look at the original vcf of the problematic sample: in that sample the locus is monoallelic, whereas in other samples it is biallelic... and this may generate confusion in the merging with bcftools and following "decomposition" by SNPGenie (is this what you meant with triallelic sites)? maybe it's not related, but I am sure we can find it out with the approach you suggested.

I'll keep you posted. thank you again for your great support!

singing-scientist commented 3 years ago

Please forgive my dropping the ball on this! Did you find a solution that works? If you'd like to to give it a go, I did receive your files but please also provide the exact command-line arguments you ran with them. But perhaps it's already been solved!

Best, Chase

ireneIta commented 3 years ago

Dear Chase, I am sorry for the delay, I am still working on it. I ran once the script and, after three days of analysis with no errors... I realized there was a typo in the name of the reference fasta file! I am running it again. It will take some time because of the number of samples and chromosomes. However, the results look ok, so far, so the issue seems to be solved. Thank you for your support!!

best, Irene