chasewnelson / SNPGenie

Program for estimating πN/πS, dN/dS, and other diversity measures from next-generation sequencing data
GNU General Public License v3.0
109 stars 37 forks source link

SNPGenie for bacterial samples from the same lineage #43

Closed arredondo23 closed 3 years ago

arredondo23 commented 3 years ago

Hi @cwnelson88 ,

Thanks for the fantastic software and the detailed wiki page. I would like you to ask for advice on my setup, just to see if it makes sense. I am quite a newbie in this field.

I am analysing 138 bacterial genomes (E. coli) from the same lineage (closely-related). I used snippy which uses FreeBayes to compute which SNPs are present in these genomes against a reference genome. My first question is more theoretical: if my final goal is to identify which genes are under positive or purifying selection, should I select a reference genome from the same lineage or from a distinct (ancestral) lineage?

Following the advice given at the issue #35, I am planning to merge the 138 individual vcf files into a single vcf file that represents the entire lineage. Is there any other alternative that I should consider instead?

Thanks in advance! and it is super handy the conda installation :)

Sergio

singing-scientist commented 3 years ago

Greetings, Sergio! Thanks very much for the question and for using SNPGenie.

First, I am a little confused on the nature of your analysis — it seems as if you may have sequenced 138 individual bacterial genomes and want to compare each one to a single reference genome. If that's the case, then snpgenie.pl is overkill, and will not be appropriate, because it assumes that each sample is a POOL of multiple individuals, and that SNPs are called for sample containing multiple genotypes representative of your population. You'd instead want to simply create FASTA files of each genome and use either SNPGenie's within-group or between-group scripts. Other programs can also do standard one-to-one comparisons for dN/dS.

However, if your samples DO represent deep sequencing to detect SNPs among multiple genomes (e.g., each sample is a bunch of E. coli pooled together), and you're interested in the variation WITHIN each sample, then snpgenie.pl might be appropriate. For the reference genome, you could simply use the reference sequence against which sequencing reads were aligned. This is in fact necessary, because the variants in the SNP report (e.g., VCF file) must be numbered according to the positions within your reference genome. In other words, if I understand your question, the reference genome which made the most sense for read mapping also makes the most sense for SNP calling.

Finally, if you're sequencing 138 distinct E. coli isolates, calling one genotype for each isolate, and want to summarize the variation among those isolates, then it might make sense to summarize the variation in a VCF file, depending on your goal.

Please let me know if that helps, and if you have any further questions as a result!

Yours, Chase

arredondo23 commented 3 years ago

Hi @cwnelson88,

Thanks for the quick feedback, much appreciated.

You guessed right! So my samples do not represent a pool of multiple individuals in which I expect multiple genotypes. I will follow the advice on using the SNPGenie's within-group script.

One quick question, since my genomes have a different size should I perform an alignment of orthologous genes and pass that to the script, so for each orthologous gene, should I have a fasta file alignment? Can I pass a core genome alignment instead? In both cases, how should I parse the gtf file?

Once again thanks for all the help provided!

Sergio

singing-scientist commented 3 years ago

Hello @arredondo23! The explanation of the within-group script is here: https://github.com/chasewnelson/SNPGenie#snpgenie-within

In short, the sequences within the fasta file will need to be aligned, correct. If the fasta contains one gene, the GTF file will have one line corresponding to that one gene, with the start and end coordinates corresponding to the start and end positions of the alignment. Alternatively, if the alignment is a full genome (perhaps a virus) or chromosome, the GTF may contain multiple records for many genes. You'll just want to make sure that your alignment method was codon-aware, i.e., gaps in coding regions are multiples of 3 and do not interrupt reading frame.

Let me know if that helps! Chase

arredondo23 commented 3 years ago

Hi @cwnelson88 ,

Thanks for the explanation, it is very clear the way to go now :)

I close the issue, thanks for the quick feedback!

Sergio

jianshu93 commented 6 months ago

Hello All, I am still confused about input for within group. I have about 10 genes on a contig, I have 10 such contigs that all have those 10 genes, so for each contig I need a gff file an concatenate all of them to just one gff file. When align the contig, there might be gaps (-) added for each gene, while when we predict genes from the contig, it does not have the gap in it. so the position in gff file about genes may not be consistent with positions in the aligned contig fasta file (10 contigs aligned). Or it just assume no gaps will be introduced for the alignemnt. Or we can predict gene on aligned contig fasta files (never heard of this).

Thank you,

Jianshu

singing-scientist commented 6 months ago

Greetings Jianshu! You can think of it this: each time you run SNPGenie, only one coordinate system can be used. (You can think of this as a distinct 'chromosome', 'genome', 'genome map', or 'alignment'.) Thus, if you have multiple contigs with different genes (or different locations of the same genes), then you must treat each one like a separate alignment. Simply run SNPGenie separately once for each of them, each time with the appropriate GTF file. Let me know if that helps, or if I've misunderstood!

Chase