chasewnelson / SNPGenie

Program for estimating πN/πS, dN/dS, and other diversity measures from next-generation sequencing data
GNU General Public License v3.0
108 stars 37 forks source link

Conflicting SNP Report formats detected. #75

Open bturne48 opened 1 month ago

bturne48 commented 1 month ago

Hi i am trying to run snpgenie.pl with this command:

 snpgenie.pl
--vcfformat=4
--minfreq=0.01
--outdir results/7_Snps/SnpGenie_Dyak_Tai18E2/NW_025048829.1
--snpreport=results/7_Snps/Dyak_Tai18E2.test.NW_025048829.1.vcf
--fastafile=results/7_Snps/Dyak_Tai18E2.test.NW_025048829.1.fa 
--gtffile=/scratch/bturne48/GCF_016746365.2_Prin_Dyak_Tai18E2_2.1_genomic.gtf 

and am confused why i am receiving this error:

## WARNING: Conflicting SNP Report formats detected. Does the file extension match expectation? If not, please contact author. ## SNPGenie TERMINATED.

I have made sure to include the AD and DP tags in both the format and sample columns, and am unsure what to do.

Here are 4 lines from my vcf as an example. I can provide more information! Any help would be amazing thank you

#CHROM          POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  (all my sample names are the rest of the columns)
NW_025048829.1  335 .   C   .   20.051  .   DP=151;AD=149;VDB=0.42;SGB=-1.25312;RPBZ=-2.0839;MQBZ=0.977156;MQSBZ=3.49015;BQBZ=-2.21411;SCBZ=1.99133;MQ0F=0.211921;AN=104;DP4=65,84,0,2;MQ=24    GT:DP:AD    ./.:0:0 0/0:2:2 0/0:1:1 0/0:1:1 0/0:2:2 0/0:2:2 0/0:3:3 ./.:0:0 0/0:4:4 0/0:4:4 0/0:7:7 0/0:1:1 0/0:1:1 0/0:2:2 0/0:2:1 0/0:1:1 0/0:6:6 0/0:1:1 ./.:0:0 ./.:0:0 ./.:0:0 0/0:7:7 0/0:5:5 0/0:13:13   0/0:6:6 ./.:0:0 0/0:4:4 0/0:7:7 0/0:5:5 0/0:3:3 0/0:10:10   0/0:1:1 0/0:5:5 0/0:2:2 0/0:1:1 0/0:2:2 0/0:3:3 0/0:3:3 ./.:0:0 ./.:0:0 0/0:1:0 0/0:2:2 0/0:4:4 ./.:0:0 0/0:2:2 ./.:0:0 0/0:1:1 0/0:1:1 0/0:2:2 ./.:0:0 ./.:0:0 0/0:1:1 ./.:0:0 0/0:1:1 0/0:4:4 0/0:1:1 ./.:0:0 0/0:2:2 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 0/0:1:1 0/0:1:1 0/0:1:1 ./.:0:0 ./.:0:0 0/0:2:2 0/0:1:1 ./.:0:0 ./.:0:0 0/0:2:2 0/0:2:2 ./.:0:0 ./.:0:0 0/0:2:2 ./.:0:0
NW_025048829.1  336 .   G   A   76.1591 .   DP=149;AD=143,6;VDB=0.101965;SGB=7.02578;RPBZ=-0.912743;MQBZ=1.82338;MQSBZ=3.63272;BQBZ=1.44056;SCBZ=0.656512;MQ0F=0.194631;AC=3;AN=104;DP4=60,83,4,2;MQ=24 GT:PL:DP:AD:GQ  ./.:0,0,0:0:0,0:0   0/0:0,6,38:2:2,0:16 0/0:0,3,27:1:1,0:13 0/0:0,3,8:1:1,0:13  0/0:0,6,40:2:2,0:16 0/0:0,6,66:2:2,0:16 0/0:0,9,79:3:3,0:19 ./.:0,0,0:0:0,0:0   0/0:0,12,43:4:4,0:22    0/0:0,12,95:4:4,0:22    0/0:0,21,85:7:7,0:31    0/0:0,3,29:1:1,0:13 0/0:0,3,4:1:1,0:13  0/0:0,6,53:2:2,0:16 0/0:0,6,41:2:2,0:16 0/0:0,3,12:1:1,0:13 0/0:0,18,109:6:6,0:28   0/0:0,3,4:1:1,0:13  ./.:0,0,0:0:0,0:0   ./.:0,0,0:0:0,0:0   ./.:0,0,0:0:0,0:0   0/0:0,18,149:6:6,0:28   0/0:0,15,126:5:5,0:25   0/0:0,36,198:12:12,0:46 0/0:0,18,123:6:6,0:28   ./.:0,0,0:0:0,0:0   0/1:104,12,0:4:0,4:5    0/0:0,21,100:7:7,0:31   0/0:0,15,114:5:5,0:25   0/0:0,9,83:3:3,0:19 0/0:0,30,157:10:10,0:40 0/1:37,3,0:1:0,1:12 0/0:0,15,58:5:5,0:25    0/0:0,6,57:2:2,0:16 0/0:0,3,37:1:1,0:13 0/0:0,3,4:1:1,0:13  0/0:0,9,90:3:3,0:19 0/0:0,9,84:3:3,0:19 ./.:0,0,0:0:0,0:0   ./.:0,0,0:0:0,0:0   0/0:0,3,25:1:1,0:13 0/1:16,0,23:2:1,1:6 0/0:0,12,117:4:4,0:22   ./.:0,0,0:0:0,0:0   0/0:0,6,58:2:2,0:16 ./.:0,0,0:0:0,0:0   0/0:0,3,37:1:1,0:13 0/0:0,3,27:1:1,0:13 0/0:0,9,67:3:3,0:19 ./.:0,0,0:0:0,0:0   ./.:0,0,0:0:0,0:0   0/0:0,3,4:1:1,0:13  ./.:0,0,0:0:0,0:0   0/0:0,3,29:1:1,0:13 0/0:0,12,68:4:4,0:22    0/0:0,3,29:1:1,0:13 ./.:0,0,0:0:0,0:0   0/0:0,6,66:2:2,0:16 ./.:0,0,0:0:0,0:0   ./.:0,0,0:0:0,0:0   ./.:0,0,0:0:0,0:0   ./.:0,0,0:0:0,0:0   ./.:0,0,0:0:0,0:0   0/0:0,3,37:1:1,0:13 0/0:0,3,29:1:1,0:13 0/0:0,3,15:1:1,0:13 ./.:0,0,0:0:0,0:0   ./.:0,0,0:0:0,0:0   0/0:0,6,53:2:2,0:16 0/0:0,3,15:1:1,0:13 ./.:0,0,0:0:0,0:0   ./.:0,0,0:0:0,0:0   0/0:0,6,47:2:2,0:16 0/0:0,6,56:2:2,0:16 ./.:0,0,0:0:0,0:0   ./.:0,0,0:0:0,0:0   0/0:0,6,56:2:2,0:16 ./.:0,0,0:0:0,0:0
NW_025048829.1  337 .   G   T   474.423 .   DP=146;AD=117,28;VDB=0.00484544;SGB=14.0114;RPBZ=-2.03359;MQBZ=1.67476;MQSBZ=3.34009;BQBZ=1.10604;SCBZ=0.0465449;MQ0F=0.184932;AC=17;AN=102;DP4=49,68,13,16;MQ=24   GT:PL:DP:AD:GQ  ./.:0,0,0:0:0,0:0   0/1:31,6,0:2:0,2:4  0/1:29,3,0:1:0,1:6  0/0:0,3,29:1:1,0:6  0/0:0,6,40:2:2,0:8  0/0:0,6,66:2:2,0:8  0/1:30,0,28:3:1,2:27    ./.:0,0,0:0:0,0:0   0/0:0,12,58:4:4,0:14    0/0:0,12,84:4:4,0:14    0/0:0,18,62:6:6,0:19    0/0:0,3,29:1:1,0:6  ./.:0,0,0:0:0,0:0   0/0:0,6,47:2:2,0:8  0/0:0,6,41:2:2,0:8  0/1:12,3,0:1:0,1:4  0/0:0,18,124:6:6,0:19   0/0:0,3,4:1:1,0:5   ./.:0,0,0:0:0,0:0   ./.:0,0,0:0:0,0:0   ./.:0,0,0:0:0,0:0   0/0:0,18,149:6:6,0:19   0/0:0,15,126:5:5,0:16   0/0:0,36,198:12:12,0:37 0/0:0,18,123:6:6,0:19   ./.:0,0,0:0:0,0:0   1/1:104,12,0:4:0,4:5    0/0:0,21,100:7:7,0:22   1/1:114,15,0:5:0,5:7    1/1:83,9,0:3:0,3:3  0/0:0,30,157:10:10,0:31 0/1:37,3,0:1:0,1:6  0/0:0,12,58:4:4,0:14    0/1:57,6,0:2:0,2:4  0/0:0,3,37:1:1,0:6  0/0:0,3,4:1:1,0:5   0/0:0,9,90:3:3,0:11 1/1:84,9,0:3:0,3:3  ./.:0,0,0:0:0,0:0   ./.:0,0,0:0:0,0:0   0/0:0,3,27:1:1,0:6  0/0:0,6,47:2:2,0:8  0/0:0,12,109:4:4,0:14   ./.:0,0,0:0:0,0:0   0/0:0,6,58:2:2,0:8  ./.:0,0,0:0:0,0:0   0/0:0,3,37:1:1,0:6  0/0:0,3,29:1:1,0:6  0/0:3,9,52:3:2,0:8  ./.:0,0,0:0:0,0:0   ./.:0,0,0:0:0,0:0   0/0:0,3,4:1:1,0:5   ./.:0,0,0:0:0,0:0   0/0:0,3,29:1:1,0:6  0/0:0,12,65:4:4,0:14    0/0:0,3,29:1:1,0:6  ./.:0,0,0:0:0,0:0   0/0:0,6,63:2:2,0:8  ./.:0,0,0:0:0,0:0   ./.:0,0,0:0:0,0:0   ./.:0,0,0:0:0,0:0   ./.:0,0,0:0:0,0:0   ./.:0,0,0:0:0,0:0   0/0:0,3,37:1:1,0:6  0/0:0,3,29:1:1,0:6  0/1:15,3,0:1:0,1:5  ./.:0,0,0:0:0,0:0   ./.:0,0,0:0:0,0:0   0/1:53,6,0:2:0,2:4  0/1:15,3,0:1:0,1:5  ./.:0,0,0:0:0,0:0   ./.:0,0,0:0:0,0:0   0/0:0,6,47:2:2,0:8  0/0:0,6,58:2:2,0:8  ./.:0,0,0:0:0,0:0   ./.:0,0,0:0:0,0:0   0/0:0,6,51:2:2,0:8  ./.:0,0,0:0:0,0:0
NW_025048829.1  338 .   G   .   20.7671 .   DP=143;AD=141;VDB=0.32;SGB=-1.25312;RPBZ=-1.97735;MQBZ=0.0718816;MQSBZ=3.35261;BQBZ=-2.3969;SCBZ=1.89094;MQ0F=0.188811;AN=102;DP4=60,81,0,2;MQ=24   GT:DP:AD    ./.:0:0 0/0:2:2 0/0:1:1 0/0:1:1 0/0:2:2 0/0:2:2 0/0:3:3 ./.:0:0 0/0:4:4 0/0:4:4 0/0:6:6 0/0:1:1 ./.:0:0 0/0:2:2 0/0:2:1 0/0:1:1 0/0:6:6 0/0:1:1 ./.:0:0 ./.:0:0 ./.:0:0 0/0:6:6 0/0:5:5 0/0:12:12   0/0:6:6 ./.:0:0 0/0:4:4 0/0:7:7 0/0:5:5 0/0:3:3 0/0:10:10   0/0:1:1 0/0:4:4 0/0:2:2 0/0:1:1 0/0:1:1 0/0:3:3 0/0:3:3 ./.:0:0 ./.:0:0 0/0:1:1 0/0:2:2 0/0:3:3 ./.:0:0 0/0:2:2 ./.:0:0 0/0:1:1 0/0:1:1 0/0:3:3 ./.:0:0 ./.:0:0 0/0:1:1 ./.:0:0 0/0:1:1 0/0:4:4 0/0:1:1 ./.:0:0 0/0:1:1 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 ./.:0:0 0/0:1:1 0/0:1:1 0/0:1:1 ./.:0:0 ./.:0:0 0/0:2:2 0/0:1:1 ./.:0:0 ./.:0:0 0/0:2:1 0/0:1:1 ./.:0:0 ./.:0:0 0/0:2:2 ./.:0:0
singing-scientist commented 1 month ago

Greetings @bturne48 ! Thanks very much for the question.

The first issue I see is that the FORMAT/sample data differs from line to line. For example, the first record here has GT:DP:AD, whereas the second has GT:PL:DP:AD:GQ. SNPGenie expects all lines to be formatted identically, and I suspect that is the first reason (but possibly not the only reason) for an error.

After ensuring that all lines have identically formatted data, perhaps give it another try and let me know if it works as expected? Happy to troubleshoot further.

Yours, Chase

bturne48 commented 1 month ago

Hey Chase,

Thanks so much for the help. I parsed the VCF to exclude invariant sites, so every line should have identical tags now.

While this did not fix my og error, I did notice that my sample names have backslashes "/" to their corresponding bam paths. and I believe that this was causing the issue. Example:

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT results/sample1.vcf results/sample2.vcf

When snpgenie goes to create temp vcf files for each sample from my multi-sample vcf, it was turning them into temp directories due to the backslashes in the sample names.

Creating temp_vcf4_results/sample1.vcf...
Creating temp_vcf4_results/sample2.vcf...

Sorry for the trouble. Might have more questions later, but I can open a new ticket for those if that is preferable!

singing-scientist commented 1 month ago

Oh wow! Great catch, I haven't run into that before. Agreed, just do something like remove the base path (i.e., 'results/' here) manually (FIND/REPLACE) if you can, or something similar. Let me know how it goes and don't hesitate to reach out if there are other questions!

Chase

bturne48 commented 1 month ago

Hi Chase,

Just throwing this question on this thread since it isn't closed yet. Let me know if you'd prefer to close this and I can communicate with you somewhere else.

If I have 50 pooled samples from one population, is it appropriate to run SNPgenie on each of these pooled samples individually, and merge results into one population summary report at the end?

I have separated out input by chromosome (as required), however runtime is still pretty long, so I have separated the VCF input by sample as well, and just wanted to make sure that works with the assumptions of SNPgenie.

Thanks again!

singing-scientist commented 1 month ago

Hello! Totally fine to continue here as the conversation may benefit others in the future; contrarily if the questions deal with a confidential study, please feel free to email.

If I have 50 pooled samples from one population, is it appropriate to run SNPgenie on each of these pooled samples individually, and merge results into one population summary report at the end?

Well, there is nothing inappropriate about it — it all depends on the question or hypothesis you're addressing. A summary of all populations would result in a metric describing, if you will, the meta-population, and may or may not be what you're interested in characterizing.

I have separated out input by chromosome (as required), however runtime is still pretty long, so I have separated the VCF input by sample as well, and just wanted to make sure that works with the assumptions of SNPgenie.

Great question! Yes, that is totally appropriate, i.e., results for each population (sample) are independent. Thus, the problem is an 'embarrassingly parallel' one, and running SNPGenie separately for each sample and chromosome is both an effective and appropriate way to speed up the runtime.

Let me know if that helps! Chase

bturne48 commented 1 month ago

Great. Thanks for all the help! I think I should only have one more question.

I have combined my results files for all my samples into one combined file (population_summary.txt), however I now see that I need to run snpgenie twice, if I want to use the '-" strand from the gtf.

If I wanted to combine the results of these two runs from the + and - , do you have any suggestions? Or is that not really possible? Or if certain fields like PiN/PiS could be combined, while others could not, that info would be useful.

Thanks!

singing-scientist commented 1 month ago

You bet! This is where a deeper understanding of pi becomes necessary, and some downstream data science (e.g. with R) may be handy. If you have results from both strands of the same sample, they can indeed be combined into a single summary statistic. For example, the overall value of piN for the sample would be (N_diffs_strand1 + N_diffs_strand2) / (N_sites_strand1 + N_sites_strand2). Likewise for piS. This simple calculation works for individual codons, whole ORFs, or even the full protein-coding genome. There are virtually limitless analyses or comparisons one can do, so it's important to form precise hypotheses ahead of time and focus on what comparisons/metrics are appropriate for addressing them. Let me know if that makes sense!

Chase

bturne48 commented 4 weeks ago

Hey Chase, thanks for all your help so far. Was wondering if I could run this past you, totally cool if not.

I had a question regarding my results. I merged my results (+ and - strand) and was working through them. Below I have both strands for 2 of my samples.

Do you also find it suspect that there are far more nonsyn mutations than there are syn mutations. I am worried that I did something wrong with the input because of the magnitude of difference between the two. From literature I thought that syn mutations should be far more common, but I am unsure.

Thank you so much for your help

file    sites   sites_coding    sites_noncoding pi  pi_coding   pi_noncoding    N_sites S_sites piN piS mean_dN_vs_ref  mean_dS_vs_ref  mean_gdiv_polymorphic   mean_N_gdiv mean_S_gdiv mean_gdiv   sites_polymorphic   mean_gdiv_coding_poly   sites_coding_poly   mean_gdiv_noncoding_poly    sites_noncoding_poly

temp_vcf4_Bata6.vcf 24674056    3828975 20845081    0.000509543234000816    0.000242324893686526    0.00055862782832813 1484236.91090886    468064.411646331    0.000156770247136192    0.000565384430333997    0.00576503943164924 0.038129645360846   0.00105072377206131 0.372450067620597   0.355449746228464   0.000461104533970691    433605  0.016210227842936   52542   0.0276216827708886  381063

temp_vcf4_Bata6.vcf.rev 24674056    3828975 20845081    0.00050954323400082 0.000242324893686526    0.000558627828328135    1426704.39721135    449851.959931503    0.000120107506698998    0.000576139612300811    0.00538052207093723 0.0388692743082501  0.000913547376388415    0.378523020565642   0.357159370777168   0.000461104533970693    433605  0.016210227842936   52542   0.0276216827708884  381063

temp_vcf4_Cameroon_115.vcf  24674056    3828975 20845081    0.000373630307395286    0.00022380637324866 0.000401151049495019    1484262.19714368    468075.001419281    0.000106276274383141    0.000550039710172113    0.00178287420560013 0.017125707390797   0.0020162201341968  0.389704341931958   0.388225560362  0.000339883886651677    191050  0.0365187197546601  21761   0.0448442024712771  169289

temp_vcf4_Cameroon_115.vcf.rev  24674056    3828975 20845081    0.000373630307395287    0.00022380637324866 0.000401151049495023    1426727.8508098 449846.32828684 0.000104598622285654    0.000650252158137782    0.00162771477762549 0.0175616054801894  0.00214333946057503 0.398018828353776   0.398470411902559   0.000339883886651676    191050  0.0365187197546602  21761   0.0448442024712769  169289
singing-scientist commented 4 weeks ago

Greetings @bturne48 — no worries! I'm not exactly sure how you're concluding there are more nonsynonymous mutations, are you counting them somehow? Regardless, 3nonsyn:1syn is actually approximately the neutral expectation, i.e., ~75% of random mutations will be nonsynonymous given the nature of the genetic code. One place this is displayed is Graur & Li, Fundamentals of Molecular Evolution, 2nd Edition, Table 1.5 "Relative frequencies of different types of mutational substitutions in random protein-coding genes". I think of 75% nonsynonymous as being approximately dN/dS = 1. Let me know if that helps :->

bturne48 commented 4 weeks ago

Thanks for the speedy reply Chase. I was using the fields "N_sites" and "S_sites" for the example above, and noticing that there was (as you said) about 3x as many non-syn sites. Is this a misinterpretation of what those fields are reporting?

sample: ['temp_vcf4_Bata6.vcf', 'temp_vcf4_Cameroon_115.vcf', 'temp_vcf4_CY01A.vcf']
nonsyn: [1484236, 1484262, 1484244]
syn: [468064, 468075, 468087]

sample: ['temp_vcf4_Bata6.vcf.rev', 'temp_vcf4_Cameroon_115.vcf.rev', 'temp_vcf4_CY01A.vcf.rev']
nonsyn: [1426704, 1426727, 1426701]
syn: [449851, 449846, 449862]

Also thank you for the literature, incredibly helpful!!

singing-scientist commented 4 weeks ago

You're most welcome! Yes, that is a misinterpretation. N_sites is the number of nonsynonymous sites in the sequence, not the number of nonsynonymous changes. Sites can be thought of as possible changes. Strongly recommend to familiarize with background material on molecular evolution in general and dN/dS in particular before interpreting analysis results, or enlist a molecular evolution collaborator. Great resources are the text I mentioned, or anything by Wen-Hsiung Li, Austin Hughes, Masatoshi Nei, Zvelebil & Baum, Asher D. Cutter, etc. Hope that helps!