Closed karenbobier closed 4 years ago
Greetings, Karen! Thanks for your interest in using SNPGenie.
I regret that I have not yet programmed a version of SNPGenie that uses the 'pooled' (SNP report/VCF-based) approach for estimating the between-group values—and will likely not get to this for some time.
However, I did recently write a Python script (generate_seqs_from_VCF.py) that randomly generates a user-specified number of sequences in a FASTA file, randomly distributing variants specified in a VCF at the appropriate frequencies. Two or more sets of sequences thus generated can then be used as input for the current SNPGenie between-group script.
The script, generate_seqs_from_VCF.py
, requires the Python libraries SeqIO, os, random, re, and sys. It can be run like this:
generate_seqs_from_VCF.py [sequence file].fasta [SNP report].vcf [num sequences]
For the VCF file, the script simply uses the "AF" entry, and it should work as long as AF is present; for example, the following format:
##METADATA ROWS
#CHROM POS ID REF ALT QUAL FILTER INFO
NC_045512.2 643 . C A 148 PASS DP=6102;AF=0.003114;SB=0;DP4=55,6026,0,20
NC_045512.2 689 . T C 170 PASS DP=6604;AF=0.003028;SB=7;DP4=66,6509,1,19
NC_045512.2 716 . A G 173 PASS DP=7833;AF=0.002936;SB=7;DP4=72,7736,1,22
NC_045512.2 769 . T C 203 PASS DP=8319;AF=0.003005;SB=5;DP4=101,8192,1,24
NC_045512.2 778 . T C 167 PASS DP=8091;AF=0.002719;SB=0;DP4=100,7964,0,22
...
For example, for a reference sequence in a file named reference.fasta, a VCF-format SNP report in a file named variants.vcf, and to create 1,000 sequences, one would run the following:
generate_seqs_from_VCF.py reference.fasta variants.vcf 1000
The script and full documentation have been added to Evolutionary Bioinformatics Toolkit. Please let me know if this addresses your request, and if you encounter any problems, as I have not tested this too extensively!
Best, Chase
Hi Chase,
Thanks for the quick reply. Your toolkit has been super useful.
I don't actually have pooled sequences, I have seperate variant data for each individual, currently all in one vcf but I can split it by population or individual. I do have the AF information in my vcfs but would you still recommend this for non-pooled data?
my vcf files look like this:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ARG10 ARG18 ARG19 ARG21 ARG25 ARG27 ARG29 ARI10 ARI102 ARI11 ARI12 ARI4 ARI6
##contig=<ID=J9347376|m.109621,length=300>
J9347376|m.109621 72 . G T 307.47 . AC=4;AF=0.222;AN=18;BaseQRankSum=0.871;ClippingRankSum=0.00;DP=66;ExcessHet=0.8432;FS=6.239;MLEAC=4;MLEAF=0.222;MQ=58.81;MQRankSum=0.967;QD=20.50;ReadPosRankSum=2.41;SOR=1.304 GT:AD:DP:GQ:PL ./.:0,0:0:.:0,0,0 0/0:1,0:1:3:0,3,42 0/0:5,0:5:15:0,15,159 0/1:4,8:12:40:213,0,40 1/1:0,3:3:9:128,9,0 0/0:14,0:14:24:0,24,360 0/0:2,0:2:6:0,6,76 0/1:9,1:10:12:12,0,283 ./.:1,0:1:.:0,0,0 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 0/0:2,0:2:6:0,6,49 0/0:1,0:1:3:0,3,25
J9347376|m.109621 84 . G A 43.39 . AC=1;AF=0.056;AN=18;BaseQRankSum=-9.670e-01;ClippingRankSum=0.00;DP=38;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.056;MQ=60.00;MQRankSum=0.00;QD=14.46;ReadPosRankSum=0.967;SOR=0.223 GT:AD:DP:GQ:PL ./.:0,0:0:.:0,0,0 0/0:1,0:1:3:0,3,42 0/0:3,0:3:9:0,9,115 0/0:8,0:8:24:0,24,342 0/1:1,2:3:36:75,0,36 0/0:9,0:9:27:0,27,362 0/0:2,0:2:6:0,6,76 0/0:9,0:9:15:0,15,225 0/0:1,0:1:3:0,3,38 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 0/0:1,0:1:3:0,3,25 ./.:0,0:0:.:0,0,0
J9347376|m.109621 129 . C A 230.12 . AC=2;AF=0.125;AN=16;DP=31;ExcessHet=0.1703;FS=0.000;MLEAC=2;MLEAF=0.125;MQ=60.00;QD=27.00;SOR=1.329 GT:AD:DP:GQ:PGT:PID:PL ./.:0,0:0:.:.:.:0,0,0 0/0:2,0:2:6:.:.:0,6,87 0/0:2,0:2:3:.:.:0,3,45 1/1:0,6:6:18:1|1:129_C_A:270,18,0 0/0:4,0:4:0:.:.:0,0,90 0/0:7,0:7:15:.:.:0,15,225 0/0:2,0:2:6:.:.:0,6,76 0/0:6,0:6:15:.:.:0,15,225 ./.:0,0:0:.:.:.:0,0,0 ./.:0,0:0:.:.:.:0,0,0 0/0:2,0:2:6:.:.:0,6,72 ./.:0,0:0:.:.:.:0,0,0 ./.:0,0:0:.:.:.:0,0,0
J9347376|m.109621 135 . C G 1377.62 . AC=16;AF=1.00;AN=16;DP=32;ExcessHet=3.0103;FS=0.000;MLEAC=16;MLEAF=1.00;MQ=60.00;QD=30.82;SOR=0.693 GT:AD:DP:GQ:PGT:PID:PL ./.:0,0:0:.:.:.:0,0,0 1/1:0,2:2:6:.:.:87,6,0 1/1:0,2:2:6:.:.:80,6,0 1/1:0,6:6:18:1|1:129_C_A:270,18,0 1/1:0,6:6:21:.:.:295,21,0 1/1:0,7:7:21:.:.:287,21,0 1/1:0,1:1:3:.:.:42,3,0 1/1:0,6:6:18:.:.:253,18,0 ./.:0,0:0:.:.:.:0,0,0 ./.:0,0:0:.:.:.:0,0,0 1/1:0,2:2:6:.:.:87,6,0 ./.:0,0:0:.:.:.:0,0,0 ./.:0,0:0:.:.:.:0,0,0
J9347376|m.109621 144 . G C 121.76 . AC=1;AF=0.071;AN=14;BaseQRankSum=-9.210e-01;ClippingRankSum=0.00;DP=30;ExcessHet=3.3579;FS=0.000;MLEAC=1;MLEAF=0.071;MQ=53.67;MQRankSum=0.00;QD=15.22;ReadPosRankSum=0.732;SOR=0.799 GT:AD:DP:GQ:PL ./.:0,0:0:.:0,0,0 0/0:2,0:2:6:0,6,74 0/0:1,0:1:3:0,3,42 0/0:3,0:3:0:0,0,39 0/1:3,5:8:99:150,0,101 0/0:5,0:5:15:0,15,202 ./.:2,0:2:.:0,0,0 0/0:7,0:7:0:0,0,211 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 0/0:2,0:2:6:0,6,76 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0
J9347376|m.109621 210 . G A 3468.25 . AC=9;AF=0.500;AN=18;BaseQRankSum=1.22;ClippingRankSum=0.00;DP=199;ExcessHet=7.7512;FS=5.817;MLEAC=11;MLEAF=0.611;MQ=52.40;MQRankSum=-1.360e+00;QD=18.16;ReadPosRankSum=0.203;SOR=0.415 GT:AD:DP:GQ:PL ./.:0,0:0:.:0,0,0 0/0:3,0:3:0:0,0,28 1/1:1,8:9:17:224,17,0 1/1:3,54:57:99:1469,111,0 0/1:2,11:13:9:330,0,9 0/1:2,7:9:22:221,0,22 ./.:1,0:1:.:0,0,0 0/1:30,33:63:99:850,0,787 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0 0/0:3,0:3:0:0,0,41 0/1:4,4:8:46:46,0,46 0/1:12,20:32:99:373,0,149
J9347376|m.109621 265 . A G 41.82 . AC=1;AF=0.050;AN=20;BaseQRankSum=0.126;ClippingRankSum=0.00;DP=23;ExcessHet=3.0103;FS=0.000;InbreedingCoeff=-0.1793;MLEAC=1;MLEAF=0.050;MQ=60.00;MQRankSum=0.00;QD=8.36;ReadPosRankSum=-5.240e-01;SOR=0.223 GT:AD:DP:GQ:PL 0/0:1,0:1:3:0,3,27 0/0:1,0:1:3:0,3,38 0/0:3,0:3:9:0,9,117 0/0:2,0:2:6:0,6,80 0/1:2,3:5:37:73,0,37 0/0:3,0:3:6:0,6,90 0/0:1,0:1:3:0,3,32 0/0:3,0:3:9:0,9,117 0/0:3,0:3:9:0,9,58 ./.:0,0:0:.:0,0,0 0/0:1,0:1:3:0,3,27 ./.:0,0:0:.:0,0,0 ./.:0,0:0:.:0,0,0
J9347376|m.109621 279 . G C 49.24 . AC=1;AF=0.050;AN=20;BaseQRankSum=0.967;ClippingRankSum=0.00;DP=24;ExcessHet=3.0103;FS=0.000;InbreedingCoeff=-0.1578;MLEAC=1;MLEAF=0.050;MQ=60.00;MQRankSum=0.00;QD=16.41;ReadPosRankSum=0.967;SOR=1.179 GT:AD:DP:GQ:PGT:PID:PL 0/0:1,0:1:3:.:.:0,3,27 0/0:1,0:1:3:.:.:0,3,38 0/0:3,0:3:9:.:.:0,9,117 0/0:2,0:2:6:.:.:0,6,80 0/0:4,0:4:12:.:.:0,12,158 0/1:1,2:3:75:0|1:279_G_C:81,0,75 0/0:1,0:1:3:.:.:0,3,32 0/0:3,0:3:9:.:.:0,9,117 0/0:4,0:4:12:.:.:0,12,93 ./.:0,0:0:.:.:.:0,0,0 0/0:2,0:2:6:.:.:0,6,61 ./.:0,0:0:.:.:.:0,0,0 ./.:0,0:0:.:.:.:0,0,0
J9347376|m.109621 282 . C A 49.24 . AC=1;AF=0.050;AN=20;BaseQRankSum=0.431;ClippingRankSum=0.00;DP=24;ExcessHet=3.0103;FS=0.000;InbreedingCoeff=-0.1578;MLEAC=1;MLEAF=0.050;MQ=60.00;MQRankSum=0.00;QD=16.41;ReadPosRankSum=0.967;SOR=1.179 GT:AD:DP:GQ:PGT:PID:PL 0/0:1,0:1:3:.:.:0,3,27 0/0:1,0:1:3:.:.:0,3,38 0/0:3,0:3:9:.:.:0,9,117 0/0:2,0:2:6:.:.:0,6,80 0/0:4,0:4:12:.:.:0,12,158 0/1:1,2:3:75:0|1:279_G_C:81,0,75 0/0:1,0:1:3:.:.:0,3,32 0/0:3,0:3:9:.:.:0,9,117 0/0:4,0:4:12:.:.:0,12,93 ./.:0,0:0:.:.:.:0,0,0 0/0:2,0:2:6:.:.:0,6,61 ./.:0,0:0:.:.:.:0,0,0 ./.:0,0:0:.:.:.:0,0,0
Thanks, Karen
Thanks for these additional details! I am not certain, but I believe you're simply saying that your VCF file is summarizing multiple individual sequences, and that the AF value here in the INFO columns represents a true allele frequency (I think, the allele frequency among all the samples in the columns that come after INFO). If that is true, and the AF represents the allele frequency in a group/sample/population of your organism of interest, then that is all I mean by a "pooled" approach, i.e., the VCF represents multiple individuals. In other words, it doesn't matter whether the sequence data were generated in a pooled fashion or summarized "after the fact" for multiple individual sequences. In summary, if the AF value represents the frequency you want represented in the FASTA, it should work!
It's entirely possible I misunderstood — please let me know!
Chase
Hi Chase,
Is it possible to run the snpgenie_between_group.pl with vcf files rather than fasta files? I have two populations with variant data for 6 diploids per population and would like to estimate dn/ds between populations. If if cannot be run with vcfs do you have a tool you recommend for converting diploid vcf data to fasta format?
Thanks, Karen