chasewnelson / SNPGenie

Program for estimating πN/πS, dN/dS, and other diversity measures from next-generation sequencing data
GNU General Public License v3.0
106 stars 37 forks source link

reverse compliment #29

Closed harpur closed 4 years ago

harpur commented 4 years ago

Hey Chase, this is copied from our email exchange:

I was running into two issues last night and was hoping you could help.

First, I am running from VCFs of individuals sequenced individually (format 1). I have separated out the fasta, gtf, and vcfs into chromosomes. When I run snpgenie on each chromosome separately, it begins by stating ''There are antisense ('-') strand records in the gtf file. COMPLEMENT MODE activated..." it then proceeds and creates output for all genes in the gtf I have provided (- and +). Are the - stranded gene estimates here believable or do I need to also run snpgenie again for the same data reverse-complemented?

Second, if we do need to run the reverse-complemented data, I am struggling a bit with the vcf2revcom.pl pipeline. I'm not entirely sure how I need to trim my data for this script to work. It also takes some time to figure out which sites are throwing errors because the offending sites are reported in the reverse-complement orientation. My first round of this revealed that vcf2revcom.pl doesn't like missing data (./.). Right now, I am getting the following error and I am not sure what the issue is, I'd love any advice:

seq length is 27754200

## WARNING: In NC_037638.1.S.AFAN.vcf site 26819924,
## the reads total should equal the coverage (29) but is instead: 0.
## The reads total has been used. Please verify your data.
Illegal division by zero at vcf2revcom.pl line 998, <ORIGINAL_SNP_REPORT> line 20695.

Here (I think) is the offending SNP (VCF v4.2):

NC_037638.1 934277 . T C 541.33 PASS AF=0.0909091;AN=22 GT:AD:DP:GQ:PGT:PID:PL:PS 0/0:0,0:29:99:.:.:0,138,1298:. 0/0:20,0:20:51:.:.:0,51,765:. 0|1:11,12:23:99:1|0:934263_TA_T:394,0,423:934263 0/0:35,0:35:81:.:.:0,81,1215:. 0/0:22,0:22:60:.:.:0,60,900:. 0/0:29,0:29:60:.:.:0,60,900:. 0/0:33,0:33:90:.:.:0,90,1350:. 0/0:32,0:32:72:.:.:0,72,1080:. 0|1:11,8:19:99:0|1:934277_T_C:105,0,466:934277 0/0:16,0:16:33:.:.:0,33,495:. 0/0:15,0:15:36:.:.:0,36,540:.

Thanks so much for building the software, it's really a great resource.

singing-scientist commented 4 years ago

Greetings, @harpur ! Thanks for using SNPGenie, and apologies for the trouble. I have received the example data. Could you now please provide the exact command-line arguments used, so that I may replicate and diagnose the problem? Many thanks.

harpur commented 4 years ago

Sure thing First the forward direction (which also outputs stats for the negative): snpgenie.pl --minfreq=0.01 --snpreport=NC_037638.1.S.AFAN.vcf --vcfformat=1 --fastafile=_NC_037638.fasta --gtffile=NC_037638.1.gtf --outdir NC_037638

and then the negative direction (which should throw the error above): vcf2revcom.pl _NC_037638.fasta NC_037638.1.gtf NC_037638.1.S.AFAN.vcf

singing-scientist commented 4 years ago

Thanks very much for your patience during this troubling time.

First, I tried the snpgenie command and it appears to me to run correctly — although I quit after an hour since this was obviously an example not meant for my laptop. Keep in mind that, as described in the documentation, there must only be one transcript per unique gene id.

Second, vcf2revcom.pl, please trying using the script available at this (snpgenie) repository, NOT the EBT repository. As described in the documentation, you should be able to run the following, where 27754200 is the length of your sequence:

vcf2revcom.pl NC_037638.1.S.AFAN.vcf 27754200

Please let me know if that works as expected!

I apologize for the "hack-iness" of these solutions, as SNPGenie was originally written for single-stranded RNA viruses, and this is all I've had time to implement so far. I hope to write an entirely new version in Python in the future, but that's some time away!

singing-scientist commented 4 years ago

Also, given a quick pass I was not able to find any antisense (-) product results when running for the sense (+) -- indeed, those should not occur. If you could provide any specific examples of antisense product results that are being reported, I'd be glad to look into it.

harpur commented 4 years ago

Hey Chase, thanks for your response. I pulled the repo and re-ran everything. It worked correctly. I think we must've pulled an older version that was throwing errors at us. I am happy to close this issue. Thanks for your help. Stay safe!

singing-scientist commented 4 years ago

So glad this worked out! Please don't hesitate to raise any subsequent questions and I'll do my best to get back promptly. Wishing you and yours stay safe and well!