AstrobioMike / GToTree

A user-friendly workflow for phylogenomics
GNU General Public License v3.0
199 stars 25 forks source link

Create a nucleotide alignment #77

Closed Sidduppal closed 1 year ago

Sidduppal commented 1 year ago

Hey, thanks for creating an easy-to-use Phylogenetic tool. I was wondering if there's a way to generate nucleotide alignment. It seems like even if I input nucleotide gene sequences (FFN produced from Prokka) GToTree still runs Prodigal to predict the amino acids, which then leads to a protein alignment.

As a roundabout way of creating a nucleotide alignment, I tried to identify the protein sequence IDs identified by the HMM models (using FAA files as input) and then extract the corresponding ORFs from my FFN file. However, it seems like GToTree renames each FASTA header with the genome ID.

Is there a way I can create a nucleotide alignment or identify the sequence headers of the proteins identified by HMM (when using FAA files as input)? Thanks

AstrobioMike commented 1 year ago

Hi there, @Sidduppal :)

Thanks for the kind words and interest.

Yes indeed, GToTree from the ground up is built to get to and work with amino-acid sequences (unfortunately for your goal here). And I don't think there is a straightforward way to have it help out without a lot of code refactoring being needed first on GToTree's side of things (as you found, i change the headers of everything as soon as possible, this is to eliminate potential problems with any odd characters that could break any given program along the way – the downside to this added robustness/safeguard of course means it's not built to be backtracked to things... the expectation there is if that's of interest, GToTree just isn't the tool for the job). Sorry you've spent time trying to find a potentially helpful path that i don't think really exists here :/

Without more of a demand for this, i don't think i'll be able to make the time to implement an avenue for doing it either (making nucleotide trees). I could be overlooking something, but I don't think there is typically that much need for a nucleotide phylogenomic tree over an amino-acid one. This is because i think amino acids work better at the levels of resolution typically being investigated via phylogenomics (like across domains all the way down to across one species); and if getting to the point where the input genomes are so closely related that an amino-acid tree doesn't provide the resolution needed, then i think it makes sense to go to a single-nucleotide variant (SNV)/SNP tree (I've used Parsnp for this before).

But again, maybe i'm overlooking or just naive to a use-case in this realm. Is your scenario one where a SNV/SNP tree might be helpful/appropriate? Or is there more going on in your case here that maybe would help me see more of a motivation to eventually build this functionality in?

sorry again it's not more helpful for your situation!

Sidduppal commented 1 year ago

Hey @AstrobioMike thanks for the informative reply.

Even though I'm dealing with strain level genomes they are pretty different in their size (sometimes varying by 3x) and genetic composition. Using a tree based on nucleotide composition is useful in this case as it gives better resolution (since there are three codon positions now instead of just a single amino acid position). I tried using Parsnp, however, it didn't work really well as the genomes are pretty different from each other to detect any meaningful SNPs.

I understand that this might not be a "much" needed feature and I won't blame you for not spending too much of your time on it.

Are you aware of any other Phylogenetic tools that can create a nucleotide phylogeny? I have used PhyloPhlan in the past but looking for some other tools to get a robust tree.

AstrobioMike commented 1 year ago

Heya, @Sidduppal,

Sorry for the delay. Yea i recognize the increased resolution of nucleotides vs amino acids, it's just been in my experience that going to a snp/snv tree works when that level of resolution is needed. But sounds like there can be cases when we're still in between, as you highlight here :)

I don't know of another. I did however start working on implementing a nucleotide option tonight. I hope to have enough time to work on it sunday to finish it up. It might be too late to be useful for you for this situation, as you might have done your own workaround or found something else already, but it'll be there for the future 👍

Thanks for requesting it :)

I'll let you know when it's updated, or if I hit problems and if would take longer than i expect

Sidduppal commented 1 year ago

@AstrobioMike that sounds great 🎉 I'm working on different solutions and I might still be around by the time you implement it 😄

AstrobioMike commented 1 year ago

hey there, @Sidduppal

this is implemented as of v1.8.1, you just need to add a -z to the call to tell it to run in nucleotide mode (this is in the help menu at the top of "General run settings")

This mode will only allow as input ncbi accessions and genome nucleotide fasta files. It doesn't take amino acid inputs (because we can't adequately reverse-translate, of course) and doesn't take genbank files (because they are a nightmare to parse given all the varied forms they can come in, and i'm not going to worry about that unless it becomes a real need for folks).

I'm not sure when bioconda will be updated, so if wanting to try it anytime soon, i'd just install directly from my conda page like so:

mamba create -n gtotree -c astrobiomike -c conda-forge -c bioconda -c defaults "gtotree>=1.8.1"

If you do get around to trying it out, please let me know if you hit any issues :)

thanks!

Sidduppal commented 1 year ago

Hey, @AstrobioMike that amazing new 🎉 . Just overwhelmed with some other stuff at the moment. I will try it soon. Your fast response and assistance were much appreciated 😄