chasewnelson / SNPGenie

Program for estimating πN/πS, dN/dS, and other diversity measures from next-generation sequencing data
GNU General Public License v3.0
100 stars 37 forks source link

Advice on multi-sample, multi-species, multi-contig analyses for both πN/πS and dN/dS #39

Open andreaswallberg opened 3 years ago

andreaswallberg commented 3 years ago

Dear @cwnelson88 ,

Thanks for implementing these solutions in these scripts! These are hugely time-saving for anyone interested in accomplishing gene-centered population genomics!

I have some questions which I think were partly explained in #34 and #35 but I will take the liberty to ask again, just to be sure.

  1. Does SNPGenie consider every "SAMPLE" in a VCF file to be a separate pool/population for the within-pool analyses?

Lets say I work with discrete libraries from individuals sampled from one population (one library, one read-group sample and one column per individual following the "FORMAT" column on every line with a variant) and produce a VCF with a tool like Freebayes (VCFv4.2). The intention is to compute πN/πS across all these individuals, i.e. to treat them as a single population.

If run with --vcfformat=4, will these samples be treated as a single or separate populations?

Freebayes also outputs the global DP and AF tags in the INFO column. Will SNPGenie use these instead of the "SAMPLE" columns? Can I specify --vcfformat=2 to enforce using these tags and effectively parse the multi-sample VCF as a single-population VCF?

  1. When looking across species to compute dN/dS, what is a practical way to accomplish these computations using the same gene models as for a within-population scan?

I suspect one answer to this is to map data from other species to the same reference, and generate one VCF per species. Depending on genetic distance, this will contain quite a few gene regions with variants fixed for a non-reference alternate allele. Using the script from #34, species-specific sequences can then be exported and easily concatenated. This mapping approach side-steps the issue of orthology-detection using external tools.

Does this sound reasonable to you?

For dN/dS, is there any correction for multiple substitutions when estimating dN and dS?

Corrected dN and dS could possibly be computed separately using PAML. Would it be possible for you to implement a way of optionally side-loading those stats from a simple table, such that they can replace or complement the internal computation of dN/dS and be incorporated in the standard summary statistics computed by the program?

  1. Lets say that I work with a highly fragmented reference sequence, spanning many thousands of contigs. Are there any special considerations that need to be taken into account? For instance:

Must all included contigs have a gene model? Will the reverse-complement script handle multiple accessions as is, or will I need to use alternative solutions?

Cheers!

singing-scientist commented 3 years ago

Thanks a lot for your interest in SNPGenie, @andreaswallberg! Regarding the questions:

  1. Yes, each VCF file is treated as its own pooled sample from a population. The only exception to this is when using SNPGenie's --vcfformat=4, where each additional SAMPLE column is treated as its own (pooled) sample. This is the only format which supports multiple samples in the same VCF file.

I think this speaks to your question, as it sounds like your "sample" columns each represent one individual. Because SNPGenie expects each will represent a pool, this approach will not work. It will instead be necessary to bioinformatically "pool" the individuals you've sequenced, e.g., summarize them in a single VCF file for use with whichever SNPGenie VCF format is most convenient.

Ah! Looks like I got ahead of myself. Indeed, if Freebayes the bioinformatically pooled "summary" data in this way, using --vcfformat=2 should ignore the "sample" columns altogether and just use the DP and AF flags — so you should be good to go!

  1. To me, this is a surprisingly complicated issue. First of all, SNPGenie computes raw "mean p" distances, as is appropriate for π (i.e. low levels of diversity). Depending on how distant the other species are, this may or may not be appropriate, i.e., a multiple hits correction (Jukes-Cantor, Kimura 2-parameter, or more complex) may be required. If the two species have a raw p distance of more than about 0.1 (10% divergence), then I recommend just using Jukes-Cantor. But if it gets much more, the between-species approach might instead benefit from other approaches, like Bayesian tree/site-based models in HyPhy. If you're already generating "pseudo-alignments" of your samples using #34 anyway, this should not be too difficult.

I now see your question about PAML, etc. — that is fine, too. What you suggest would be ideal, and if I'd been a better coder when writing SNPGenie I surely would have used a modular approach that would allow another dN/dS engine to be "plugged in", as it were. Unfortunately, SNPGenie is anything but modular and such a change would require a major rewrite that I (sadly) currently don't have time for. I do hope to take a crack at this whole thing again in coming years, using Python this time, and will make modularization a priority!

About the question of orthology, I think that your approach of mapping reads from different species to the same reference is a clever one. Depending on how distant the species are, you might run into problems with indels and rearrangements, etc. In my ideal world, it would be best to first align the references of all species together using a pipeline like Enredo/Pecan/Ortheus to determine which sites are homologous; exclude any sites with multiple hits in the alignment (i.e. exclude any sites that are even possibly duplications/paralogs); map reads separately to each reference; and finally convert the coordinates from each reference to one one the species' references using the multi-species alignment. This is of course too perfectionistic and terribly tedious; it's probably fine to use your approach so long as you're somehow trying to correct for possible biases, e.g., read mapping bias that could occur due to regions of divergence or rearrangements.

  1. Each FASTA reference — whether it's a chromosome or a small contig — must have its own VCF and own GTF file for use with SNPGenie, and the coordinates of the 3 should correspond. That is, one SNPGenie run involves a single trio of FASTA/GTF/VCF. It isn't smart and can't figure out what you mean, but simply believes that site 1 of the FASTA is site 1 in the GTF and VCF files; however, it should throw an error if anything is out of bounds.

Could you explain what you mean by accessions in whether the revcom script will "handle multiple accessions as is"?

Hope at least some of this is helpful!

andreaswallberg commented 3 years ago

Thanks for the feedback!

I fully appreciate the complexity and time needed to add new functions to old but working code :-)

With "handle multiple accessions as is", I mean if the script would work on files that include many separate contigs, but I think you answered my question. Basically, I need one set of trio files per contig (or scaffold). My current draft assembly spans thousands of contigs/scaffolds in the FASTA file, as do the variation data in the VCF file. So I better split these into separate files when running the program?

singing-scientist commented 3 years ago

Ah, understood now! Yes, unfortunately, it will be necessary to put each contig in its own FASTA file, and the SNPs and genes within each contig in their own corresponding VCF and GTF files, respectively. The positions of both SNPs and CDS entries in the GTF files should be accurate with respect to that contig's (1-based) coordinates. Let me know if that makes sense!

andreaswallberg commented 3 years ago

Makes a lot of sense!

I set up a pipeline for dN/dS in PAML using high-quality pairwise alignments a few years ago. Perl is my main language, so perhaps I can dust off what I did back then, and see if it can be put to use in SNPGenie. I will let you know if I start playing around with your code.