Open bturne48 opened 1 month ago
Greetings @bturne48 ! Thanks very much for the question.
The first issue I see is that the FORMAT/sample data differs from line to line. For example, the first record here has GT:DP:AD, whereas the second has GT:PL:DP:AD:GQ. SNPGenie expects all lines to be formatted identically, and I suspect that is the first reason (but possibly not the only reason) for an error.
After ensuring that all lines have identically formatted data, perhaps give it another try and let me know if it works as expected? Happy to troubleshoot further.
Yours, Chase
Hey Chase,
Thanks so much for the help. I parsed the VCF to exclude invariant sites, so every line should have identical tags now.
While this did not fix my og error, I did notice that my sample names have backslashes "/" to their corresponding bam paths. and I believe that this was causing the issue. Example:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT results/sample1.vcf results/sample2.vcf
When snpgenie goes to create temp vcf files for each sample from my multi-sample vcf, it was turning them into temp directories due to the backslashes in the sample names.
Creating temp_vcf4_results/sample1.vcf...
Creating temp_vcf4_results/sample2.vcf...
Sorry for the trouble. Might have more questions later, but I can open a new ticket for those if that is preferable!
Oh wow! Great catch, I haven't run into that before. Agreed, just do something like remove the base path (i.e., 'results/' here) manually (FIND/REPLACE) if you can, or something similar. Let me know how it goes and don't hesitate to reach out if there are other questions!
Chase
Hi Chase,
Just throwing this question on this thread since it isn't closed yet. Let me know if you'd prefer to close this and I can communicate with you somewhere else.
If I have 50 pooled samples from one population, is it appropriate to run SNPgenie on each of these pooled samples individually, and merge results into one population summary report at the end?
I have separated out input by chromosome (as required), however runtime is still pretty long, so I have separated the VCF input by sample as well, and just wanted to make sure that works with the assumptions of SNPgenie.
Thanks again!
Hello! Totally fine to continue here as the conversation may benefit others in the future; contrarily if the questions deal with a confidential study, please feel free to email.
If I have 50 pooled samples from one population, is it appropriate to run SNPgenie on each of these pooled samples individually, and merge results into one population summary report at the end?
Well, there is nothing inappropriate about it — it all depends on the question or hypothesis you're addressing. A summary of all populations would result in a metric describing, if you will, the meta-population, and may or may not be what you're interested in characterizing.
I have separated out input by chromosome (as required), however runtime is still pretty long, so I have separated the VCF input by sample as well, and just wanted to make sure that works with the assumptions of SNPgenie.
Great question! Yes, that is totally appropriate, i.e., results for each population (sample) are independent. Thus, the problem is an 'embarrassingly parallel' one, and running SNPGenie separately for each sample and chromosome is both an effective and appropriate way to speed up the runtime.
Let me know if that helps! Chase
Great. Thanks for all the help! I think I should only have one more question.
I have combined my results files for all my samples into one combined file (population_summary.txt), however I now see that I need to run snpgenie twice, if I want to use the '-" strand from the gtf.
If I wanted to combine the results of these two runs from the + and - , do you have any suggestions? Or is that not really possible? Or if certain fields like PiN/PiS could be combined, while others could not, that info would be useful.
Thanks!
You bet! This is where a deeper understanding of pi becomes necessary, and some downstream data science (e.g. with R) may be handy. If you have results from both strands of the same sample, they can indeed be combined into a single summary statistic. For example, the overall value of piN for the sample would be (N_diffs_strand1 + N_diffs_strand2) / (N_sites_strand1 + N_sites_strand2). Likewise for piS. This simple calculation works for individual codons, whole ORFs, or even the full protein-coding genome. There are virtually limitless analyses or comparisons one can do, so it's important to form precise hypotheses ahead of time and focus on what comparisons/metrics are appropriate for addressing them. Let me know if that makes sense!
Chase
Hey Chase, thanks for all your help so far. Was wondering if I could run this past you, totally cool if not.
I had a question regarding my results. I merged my results (+ and - strand) and was working through them. Below I have both strands for 2 of my samples.
Do you also find it suspect that there are far more nonsyn mutations than there are syn mutations. I am worried that I did something wrong with the input because of the magnitude of difference between the two. From literature I thought that syn mutations should be far more common, but I am unsure.
Thank you so much for your help
file sites sites_coding sites_noncoding pi pi_coding pi_noncoding N_sites S_sites piN piS mean_dN_vs_ref mean_dS_vs_ref mean_gdiv_polymorphic mean_N_gdiv mean_S_gdiv mean_gdiv sites_polymorphic mean_gdiv_coding_poly sites_coding_poly mean_gdiv_noncoding_poly sites_noncoding_poly
temp_vcf4_Bata6.vcf 24674056 3828975 20845081 0.000509543234000816 0.000242324893686526 0.00055862782832813 1484236.91090886 468064.411646331 0.000156770247136192 0.000565384430333997 0.00576503943164924 0.038129645360846 0.00105072377206131 0.372450067620597 0.355449746228464 0.000461104533970691 433605 0.016210227842936 52542 0.0276216827708886 381063
temp_vcf4_Bata6.vcf.rev 24674056 3828975 20845081 0.00050954323400082 0.000242324893686526 0.000558627828328135 1426704.39721135 449851.959931503 0.000120107506698998 0.000576139612300811 0.00538052207093723 0.0388692743082501 0.000913547376388415 0.378523020565642 0.357159370777168 0.000461104533970693 433605 0.016210227842936 52542 0.0276216827708884 381063
temp_vcf4_Cameroon_115.vcf 24674056 3828975 20845081 0.000373630307395286 0.00022380637324866 0.000401151049495019 1484262.19714368 468075.001419281 0.000106276274383141 0.000550039710172113 0.00178287420560013 0.017125707390797 0.0020162201341968 0.389704341931958 0.388225560362 0.000339883886651677 191050 0.0365187197546601 21761 0.0448442024712771 169289
temp_vcf4_Cameroon_115.vcf.rev 24674056 3828975 20845081 0.000373630307395287 0.00022380637324866 0.000401151049495023 1426727.8508098 449846.32828684 0.000104598622285654 0.000650252158137782 0.00162771477762549 0.0175616054801894 0.00214333946057503 0.398018828353776 0.398470411902559 0.000339883886651676 191050 0.0365187197546602 21761 0.0448442024712769 169289
Greetings @bturne48 — no worries! I'm not exactly sure how you're concluding there are more nonsynonymous mutations, are you counting them somehow? Regardless, 3nonsyn:1syn is actually approximately the neutral expectation, i.e., ~75% of random mutations will be nonsynonymous given the nature of the genetic code. One place this is displayed is Graur & Li, Fundamentals of Molecular Evolution, 2nd Edition, Table 1.5 "Relative frequencies of different types of mutational substitutions in random protein-coding genes". I think of 75% nonsynonymous as being approximately dN/dS = 1. Let me know if that helps :->
Thanks for the speedy reply Chase. I was using the fields "N_sites" and "S_sites" for the example above, and noticing that there was (as you said) about 3x as many non-syn sites. Is this a misinterpretation of what those fields are reporting?
sample: ['temp_vcf4_Bata6.vcf', 'temp_vcf4_Cameroon_115.vcf', 'temp_vcf4_CY01A.vcf']
nonsyn: [1484236, 1484262, 1484244]
syn: [468064, 468075, 468087]
sample: ['temp_vcf4_Bata6.vcf.rev', 'temp_vcf4_Cameroon_115.vcf.rev', 'temp_vcf4_CY01A.vcf.rev']
nonsyn: [1426704, 1426727, 1426701]
syn: [449851, 449846, 449862]
Also thank you for the literature, incredibly helpful!!
You're most welcome! Yes, that is a misinterpretation. N_sites is the number of nonsynonymous sites in the sequence, not the number of nonsynonymous changes. Sites can be thought of as possible changes. Strongly recommend to familiarize with background material on molecular evolution in general and dN/dS in particular before interpreting analysis results, or enlist a molecular evolution collaborator. Great resources are the text I mentioned, or anything by Wen-Hsiung Li, Austin Hughes, Masatoshi Nei, Zvelebil & Baum, Asher D. Cutter, etc. Hope that helps!
Hi i am trying to run snpgenie.pl with this command:
and am confused why i am receiving this error:
## WARNING: Conflicting SNP Report formats detected. Does the file extension match expectation? If not, please contact author. ## SNPGenie TERMINATED.
I have made sure to include the AD and DP tags in both the format and sample columns, and am unsure what to do.
Here are 4 lines from my vcf as an example. I can provide more information! Any help would be amazing thank you