biobakery / phylophlan

Precise phylogenetic analysis of microbial isolates and genomes from metagenomes
https://huttenhower.sph.harvard.edu/phylophlan
MIT License
128 stars 33 forks source link

New feature of Phylophlan3 #14

Closed jliu-nj closed 4 years ago

jliu-nj commented 4 years ago

Hi, Thanks for your help on installing Phylophlan3

We are able to run the pipeline. I have some more questions about the new feature of Phylophlan3.

One feature of phyloplan3 is to integrate new genomes to tree of life. We want to integrate a set of genomes into tree of life. The tutorial of tree-of-life contains 17000 genomes, which is too big. The 3171 genomes seems a good size. I am wondering how this can be done. Where we can download this 3171 genomes.

https://huttenhower.sph.harvard.edu/phylophlan FEATURES Completely automatic, as the user needs only to provide the (unannotated) protein sequences of the input genomes (as multifasta files of peptides – not nucleotides)

The possibility of integrating new genomes in the already reconstructed most comprehensive tree of life (3,171 microbial genomes)

Thank you! Jie

fasnicar commented 4 years ago

Dear Jie,

I'm sorry but I think that the 3,171 genomes you were referring to are the ones selected in the first PhyloPhlAn implementation. Sorry for this confusion.

In PhyloPhlAn 3.0 there is no way to integrate a small set of genomes into an already computed set of genomes, as we saw that this is hugely biased by the step that integrates the MSAs and this, in general, doesn't work well as it introduces biases that result in a wrong phylogeny structure.

So, with PhyloPhlAn 3.0 you cannot integrate a set of genomes, but you can rebuild a tree-of-life phylogeny in something like ~10 days.

The way to do this is to follow the Prokaryotes Tree of life reconstruction tutorial.

If computation time is an issue, you can try to be more stringent on the parameters by specifying more aggressive trimming approaches by adding the following params to the phylophlan command line:

--subsample tenpercent --not_variant_threshold 0.75 --gap_perc_threshold 0.67

My suggestion also is to specify the number of cores (--nproc param) as high as possible in your machine to allow for more analysis don in parallel.

I hope this helps you.

Many thanks, Francesco

jliu-nj commented 4 years ago

Thanks for your prompt explanation! I appreciate all your time in helping us in Phylophlan3 installation and usage.

Jie