chasewnelson / SNPGenie

Program for estimating πN/πS, dN/dS, and other diversity measures from next-generation sequencing data
GNU General Public License v3.0
100 stars 37 forks source link

How to join the output for a whole genome analysis #64

Open pabloangulo7 opened 1 year ago

pabloangulo7 commented 1 year ago

Hello,

I am using SNPGenie for a Genome-wide SNPs dataset of 70 individuals of a haploid organism consisting of 14 chromosomes, using as input a VCFfile. For the requirements of the tool, I have split the fasta, gtf and vcf files into 14 parts, one per chromosome and applied the command. Also, this process has to be repeated again for the minus strand (-) genes. My question is whether the output for the minus strand can be joined directly to the output of the plus strand and finally, once the whole process is completed, how can all the data be put together to calculate the dn/ds ratio of the whole genome in the population.

Best regards and thanks in advance

Pablo

singing-scientist commented 1 year ago

Greetings, Pablo! Thanks very much for the question. That all sounds correct; the requirement to run twice (once for each strand) is a bit of a hack, unfortunately necessary because SNPGenie was originally written for one strand only. The genes of each strand can simply be combined as you say -- but be aware that, if there are any overlapping antisense genes (i.e., genes in reverse complement reading frames that overlap the same genome positions), then some genome positions/regions will be represented more than once. Also, some of the individual SITES results will be redundant. Thus, the answer to how the data should be combined depends in part on how you want to treat overlapping genes from separate strands (if they exist). Certainly for non-protein-coding sites, you'll only want to include them once, and results will be redundant between the strands.

For the whole (coding) genome, the question is whether you want the mean value for all sites (codon is the unit) or of all genes (gene is the unit). For the first, you could simply sum numerators and denominators, i.e., mean_dN = sum(N_diffs)/sum(N_sites) and mean_dS = sum(S_diffs)/sum(S_sites). This would reflect the "typical protein-coding site". Contrarily, if you're looking for an average of all genes, you could simply take the means of genes, e.g., mean_dN = mean(vector of all genes' dN). This would reflect the "typical gene". Let me know if that makes sense!

Yours, Chase

pabloangulo7 commented 1 year ago

Thank you very much, sorry for the late reply but my data is very large and it took me a long time to finish the analysis. I already have the results for each of the chromosomes and for the minus and plus strand. Then following the instructions you gave me in the previous answer, I was about to calculate the dN/dS ratio for each chromosome and for the whole genome, but to make sure I do it correctly, to do the calculation, for example, for all the sites, I would have to use the file called "codon_results.txt" and use the columns of N_diffs, S_diffs, N_Sites and S_Sites, right? And to do the calculation by genes, would it be the same but with the file "product_results.txt"?

Thanks in advance

singing-scientist commented 1 year ago

Hello! Both codon_results and product_results should yield the same results, i.e., product_results are just the sum of codon_results, for each gene. The question is whether you want to add all the N_diffs, S_diffs, N_Sites and S_Sites together, then calculate dN/dS (mean of all coding sites); or first calculate dN/dS for each product, and take the mean of those values (mean of all genes). Let me know if that helps!

C

pabloangulo7 commented 1 year ago

Thank you very much, I think I understand it better now. On the other hand, I was reviewing the results and there is one thing that doesn't add up. In one of the "site_results.txt" files, one of the SNPs is missing, which for some reason the program has filtered out, but I don't quite know why. In the temporary file of the sample (generated by SNPGenie) "temp_snp_report_VY02.txt" this variant does appear. So, there would be a total of 37 SNPs although only 36 SNPs are analyzed in the results. I attach the files, and specifically it is the variant at site 1580099 (line 27 in the file). Let me know if you need more info.

Thanks in advance temp_snp_report_VY02.txt site_results.txt

singing-scientist commented 1 year ago

Greetings, Pablo! Indeed I see what you mean and confirm the SNP at 1580099 is present in temp but not the sites file. It is not really possible for me to troubleshoot this without the input; if you'd like to supply the 3 files and command that give rise to this result I'd be happy to check it out and determine what's happening.

Regarding dN or dS equal to 0, ideally one would perform a statistical test to determine if dN=dS can be rejected with significance. Normally this is done with bootstrapping, but when dS=0 there's no point. Thus as a rule of thumb, I generally consider that if dS (which reflects mutation rate) is 0, there's not enough variation to have statistical power to detect a difference between dN and dS. I'd report the ratio as undefined (certainly don't infer dN>dS or positive selection usually). If dN is 0 that's a different story (again because dS reflects mutation rate) and one can infer strong purifying selection given a proper statistical test. Let me know if that helps!