Question about calculating pi within a window

deborahmleigh commented 3 years ago

Hi! Thanks for making such a a great tool. I am using SNPGenie to examine some intra-host virus SNV data. I was hoping you might be able to help me with two issues I am having.

I am looking to calculate pi but only in specific regions and not across the genome as a whole. As I understand it the population_summary value is genome wide. Is there a way to do this with the outputs of the classic 'within pool' analyses? I have the codon, product, site and population results files.

I also seem to be able to get the sliding window analyses running. Can that run on within pool data? I would like to look at a the general direction of selection at a finer scale than each ORF. Is it best to call consensus sequences and look at the consensus level? Or is there another approach that SNPGenie can do but I have missed?

Thanks for the help!

singing-scientist commented 3 years ago

Greetings! Thanks for these questions. pi (π) is defined as the mean number of pairwise differences per site. Thus, it can be calculated as the sum of pairwise differences at all sites, divided by the total number of sites. This is true even of arbitrarily large (or small) regions of your genome.

Suppose a particular region contains three codons with nonsynonymous differences (N_diffs>0), say 0.1, 0.2, and 0.3. (The N_diffs column of the codon results file reports the mean number of nonsynonymous differences per site in that codon.) Next suppose the region contains 267 sites, of which 200 are nonsynonymous sites (N_sites). πN could then be calculated as (0.1+0.2+0.3)/200 = 0.003. πS can be calculated similarly using S_diffs and S_sites. Finally, if you want π for all sites (not partitioned by nonsynonymous vs. synonymous), it can be calculated as π = (N_diffs+S_diffs) / (N_sites+S_sites). This works for sliding windows, too.

For sliding windows, do you mean you can or can't seem to get it working? Unfortunately, it is not possible for me to answer questions about scripts without examples and knowing exactly which script was run, and with what arguments. It should run on any codon results file (pooled or not), but it's old and there have been several versions. I recently uploaded (but have not documented, it seems) an R script for this purpose. Please let me know exactly what you mean and need, and we can take it from there. Finally, regarding what's best for consensus vs. pooled calculations, etc., that entirely depends on what question you want to ask—analysis of consensus sequences answers questions about the consensus sequences, while analysis of pooled sequences answers questions about the pooled sequences.

Let me know! Chase

deborahmleigh commented 3 years ago

Hi Chase, thanks for the reply. Sorry about not being clear for the window analyses. I ran the SNPGenie pipeline for the within group analyses but I can't get the sliding window approach to work after this. It appears I might be missing a column my codon file. So the first command I ran was:

snpgenie.pl --vcfformat=4 --snpreport=pool_file.vcf --fastafile=Ref.fa --gtffile=Ref.gtf

Followed by: snpgenie_within_group_processor.pl --codon_file=SNPGenie_Results/codon_results.txt --sliding_window_size=30 --sliding_window_step=25 --num_bootstraps=1000

Which fails with the command: DIE: the header name variability is not present in the codon results file.

This is the header of the codon file:

file product site codon num_overlap_ORF_nts N_diffs S_diffs N_sites S_sites N_sites_ref S_sites_ref N_diffs_vs_ref S_diffs_vs_ref gdiv N_gdiv S_gdiv temp_vcf4_default.vcf ORFA 495 ATG 0 0 0 3 0 3 0 0 0 0 temp_vcf4_default.vcf ORFA 498 TCT 0 0 0 2 1 2 1 0 0 0 temp_vcf4_default.vcf ORFA 501 TGT 0 0 0 2.5 0.5 2.5 0.5 0 0 0 temp_vcf4_default.vcf ORFA 504 CTT 0 0 0 2 1 2 1 0 0 0 temp_vcf4_default.vcf ORFA 507 AGA 0 0 0 2.16666666666667 0.833333333333333 2.16666666666667 0.833333333333333 0 0 0

I'd be happy to try the R script. What I am trying to do is to look at pi values at windows across the genome to understand the general direction of selection (purifying or positive) across the virus genome. It was recommended that we bootstrap the pi values to obtain p values too. Thanks! Debbie

singing-scientist commented 3 years ago

Dear Debbie:

Thanks very much for your patience. Indeed, the script snpgenie_within_group_processor.pl was meant only for within-group results. Thus, I have updated the script "SNPGenie_sliding_windows.R", with new documentation provided here: https://github.com/chasewnelson/SNPGenie#sliding-windows-bootstrapping

Would you kindly give it a try and let me know if it works for you? The major reason for moving to R is that it has built-in packages that make bootstrapping much more flexible and fast (as opposed to the from-scratch Perl monstrosity I created previously).

The results you're after should be the columns: sw_dN | sw_dS | sw_dNdS | sw_boot_dN_m_dS_P (the P-value of dN-dS=0, Z test).

I look forward to hearing back! Chase

Chongli-90 commented 3 years ago

Hi Chase!

I am using the script "SNPGenie_sliding_windows.R" to calculate the pi within a window and the script is working perfectly for me! However, I have a question on the way to do the sliding window analyses. I am using the "codon_results" which generated from the "snpgenie.pl" (input is SNP report, not alignment fasta file) to do the sliding window analyses. I want to see the direction of selection on an influenza protein based on five samples from the same treatment group. So is it better to pool the mutations detected from the five samples as a single SNP report for the sliding window analyses, or do the sliding window analyses based on the mutations from each sample and then averaging the dN/dS or dN-dS for each codon? For this circumstance, does the PiN/PiS equal dN/dS?

Thank you so much for your help! Regards,

singing-scientist commented 3 years ago

Good question! Both approaches could be appropriate; it depends entirely on exactly what you want to test.

I suggest first running SNPGenie separately for each sample, because they should be biologically independent and their signal would have more power.

Then, take the mean of N and S diffs and sites for each codons across all samples. Then, each codon in each product will have one mean value for: N_diffs, N_sites, S_diffs, S_sites. This could be done in R (summarize/group_by), Python, or whatever is your favorite method for wrangling data.

Then running the sliding windows on those (mean) results. There will need to be columns for codon, N_diffs, N_sites, S_diffs, and S_sites (and possibly a few others; if you decide to go that route, let me know if you have any further questions.) This will use the mean values of N_diffs, N_sites, S_diffs, and S_sites to calculate piN and piS for each window, and will test the null hypothesis that piN-piS=0.

This approach would have maximum power (because it uses variation from all samples). It would be appropriate if you expect the same codons to be undergoing the same selection pressures in all five samples. I think that is true, because they're all the same "treatment group".

The difference between piN/piS and dN/dS is largely semantic/philosophical. pi refers to variation within a population or species; d refers to divergence between different species. Because species are much more different than members of the same population, d (between-species) is sometimes further 'corrected' for the possibility of multiple mutations having occurred at the same site ('multiple hits'), e.g., using a Jukes-Cantor correction. Because this is all influenza, I'd call it π, and there is almost certainly no need to correct anything (the variation is too low).

Let me know if that helps! Chase

Chongli-90 commented 3 years ago

Thank you Chase! That's definitely the route I am looking for! If I go that way, how do I prepare my "codon_result" file? Can I just replace the values of N_diffs, N_sites, S_diffs, and S_sites columns in any sample's condon_result file with the mean values and use it to run the "SNPGenie_sliding_windows.R" script? Do I need to change the values of the other columns? Thanks!

deborahmleigh commented 3 years ago

Hi Chase,

Thanks for your help. I have got it running locally but have some dependency issues on our cluster. It is a long weekend here, so I will chat with the cluster IT team on Monday. Seems to be working locally though. Can I just confirm that I am right, I think I should only be running it on one ORF or one gene product at a time? Is that correct? I also only see the DnDs values but like Chongli this data is intraspecific, so this is just semantics and it is the same thing? Just to be 100% I am not missing something.

Thanks again for all your help, Debbie

singing-scientist commented 3 years ago

Hello, @Chongli-90 and @deborahmleigh ! Sorry for the delay.

To prepare a codon results file as input for the script, I am fairly certain the following columns (with these exact names) will be sufficient:

codon_num (or 'codon')
N_diffs
S_diffs
N_sites
S_sites
num_defined_seqs

A bit of explanation for 'num_defined_seqs': dN and dS values are typically not considered reliable for fewer than 6 sequences (or fewer than 100, depending on your standards!). However, alignments can sometimes introduce codons that are insertions in very few sequences in an alignment. Thus, this column represents how many sequences were defined (non-gap) in your alignment, as a way of ignoring sites that are unreliable. If you feel comfortable that all your codons and variants are not false-positives and are reliable, you can set this to something like 100 and just proceed.

Regarding number of ORFs or gene products, the BOOTSTRAP script should indeed be performed on only one at a time.

Regarding dN/dS vs. πN/πS, the differences are indeed largely semantic (π is the mean for all within-population comparisons; d for all between-species comparisons). However, between species divergence can be high (say, d>0.1), in which case it becomes important to use a multiple hits correction like Jukes-Cantor.

Let me know! C

Chongli-90 commented 3 years ago

Hi Chase，

Thanks for all your help and explanations! Now I am using the "SNPGenie_sliding_windows.R" for the slide window analysis with the input file generated from the "snpgenie.pl" script. As your instructions, I am using the mean values of N_diffs, S_diffs, N_sites and S_sites across all the samples within the same treatment group as the input parameters in "codon results file". While I found the values in other columns of the input file named "N_diffs_vs_ref" and "S_diffs_vs_ref" could influence the outcome values named "sw_boot_dN_m_dS_P" and "sw_boot_dN_m_dS_SE" (slightly). Therefore, I am not quite sure about the input values I use for the columns "N_diffs_vs_ref" and "S_diffs_vs_ref". Should I also use the average values for these two columns across all the samples? Just the same way I calculate the input values for N_diffs, S_diffs, N_sites, and S_sites. Is that correct? Thank you so much for your help!

Regards,

singing-scientist commented 3 years ago

Hello, @Chongli-90 ! Thanks for the kind words and for using this tool!

I found the values in other columns of the input file named "N_diffs_vs_ref" and "S_diffs_vs_ref" could influence the outcome values named "sw_boot_dN_m_dS_P" and "sw_boot_dN_m_dS_SE" (slightly)

That should definitely not be the case — the way I've coded it, it should look for *_diffs, and it should be impossible to use a value in a column with the name *_diffs_vs_ref. Could you share more details, or ideally an example file so I can Troubleshoot? (Note that it WILL look to see if an *_diffs_vs_ref column is present — but it doesn't use the value.)

Whether you "should" use this value is entirely up to you and the hypothesis you want to test — for your analysis, do you want to calculate πN (πS), or do you want to calculate mean dN (dS) vs. the reference sequence?

Yours, Chase

Chongli-90 commented 3 years ago

Hi Chase!

Thanks for the prompt reply! I attached the input and output files below. It's just for testing so I just change the values in "N_diffs_vs_ref" between two input files (codon_results_1 and codon_results) and you can see there is little difference in "sw_boot_dN_m_dS_SE“ and "sw_boot_dN_m_dS_P" between the output files corresponding output files (codon_results_WINDOWS_dNds and codon_results_1_WINDOWS_dNds).

For my project, I am looking for the results of "sw_dN_m_dS", ”sw_boot_dN_m_dS_SE“ and ”sw_boot_dN_m_dS_P“ for each codon. And as you suggested, I am going to treat the dN or dS as the πN or πS. Thank you!

Regards,

Testing file.zip

Chongli-90 commented 3 years ago

Hi Chase！

I just realized the slight differences in the values of "sw_boot_dN_m_dS_P" and ”sw_boot_dN_m_dS_SE“ on two runs of "SNPGenie_sliding_windows.R" are not due to the input values changed in "*_diffs_vs_ref". It is because of the bootstraps on the two runs and it's almost impossible to get the exact same values on "sw_boot_dN_m_dS_P" and ”sw_boot_dN_m_dS_SE“ on each run due to the bootstraps. Am I right? I just want to confirm that I use this script in the completely right way and thanks so much for your time!

Regards,

singing-scientist commented 3 years ago

Thanks @Chongli-90 ! Now I see — I misunderstood and thought the *_vs_ref values were somehow involved. You're exactly right, the P-values will change almost every time because they depend on the random pseudosamples generated by bootstrapping. So, this is not unexpected and not a problem.

To check this, I ran the following command twice:

SNPGenie_sliding_windows.R codon_results.txt N S 40 1 1000 100 NONE 6 > SNPGenie_sliding_windows_results.tsv

Indeed, the point estimates are identical (e.g., dN=) in both, but all *_boot_* values including the P-values differ slightly due to differences between the random pseudosamples.

Let me know! Chase

Chongli-90 commented 3 years ago

Thanks Chase! That's awesome! Now I can run the tests with full confidence. I really appreciate all your knowledge and help! Thank you!

Regards, Chong

singing-scientist commented 3 years ago

Glad to help! Sorry for the delays in my responses (have been very overwhelmed). Stay well, and feel free to re-open the issue if you run into further problems!

deborahmleigh commented 3 years ago

Hi Chase,

Thanks for all your help. I have finally managed to get this running across my data. But I was wondering if you had written a guide to interpret the output columns? I am not sure which column to pull out to look at the patterns of selection across the sequence.

Thanks, Debbie

singing-scientist commented 3 years ago

No problem! Sorry for the piecemeal process by which I'm adding this feature. Would you mind sharing one sample output file? I can then very quickly add descriptions to the documentation.

Yours, Chase

deborahmleigh commented 3 years ago

HI Chase,

Thanks that would be great. I have attached a header from one of my output files!

Thanks, Debbie example.dNds.txt

deborahmleigh commented 3 years ago

Hi Chase, just following this up. I checked the manual to see if there had been any changes but can't spot them. I was wondering what columns I should focus on when plotting out the patterns of selection across the sequence.

Thanks!

singing-scientist commented 3 years ago

Dear Debbie:

Please accept my sincere apologies for the delay, and thanks for the reminder and your patience — things have been a bit tumultuous in Taiwan.

I just added descriptions of each output column for the bootstrapping script, and the end of the relevant section: https://github.com/chasewnelson/SNPGenie#sliding-windows-bootstrapping

Please let me know if that is sufficient and if there's anything that could be better clarified. Thank you very much!

Yours, Chase

deborahmleigh commented 3 years ago

Thanks Chase! This looks great and really thorough. Thank you for taking the time when things must be very stressful!

I really like this tool :)

Best, Debbie

chasewnelson / SNPGenie

Question about calculating pi within a window #48

Regards,