biobakery / phylophlan

Precise phylogenetic analysis of microbial isolates and genomes from metagenomes
https://huttenhower.sph.harvard.edu/phylophlan
MIT License
120 stars 33 forks source link

Discrepancy in the generated tree of life ? #68

Open nlmjacquemin opened 3 years ago

nlmjacquemin commented 3 years ago

Hello,

I tried to redo a tree of life including my own metagenomes (~488) using the approach that you provide in examples (log file: phylophlan_tol.log, config file: config_file_and_command_line.txt). However, in the final tree, the firmicutes are split into two branches and there are other discrepencies. As I am fairly new to phylogenetic analysis, I don't see why this would happen. The only thing I could notice without knowing if it is relevant is that a lot of my metagenomes are in the firmicutes and that I used the updated database. So I was wondering if you had any idea what might have happened because I was expecting to have similar results to the 2020 publication. I would be very grateful if you have any suggestions.

Thanks for the amazing work, Nicolas

My tree: tree_cir_complete_p_2_reroot_2 Expected tree: image

fasnicar commented 3 years ago

Dear Nicolas, thanks for reporting this.

From the visualization of your tree, it seems that the branches close to the root are a bit "weird". So, my question is, how you visualized the tree? Is it rooted in between Archaea and Bacteria?

In the meantime, I'm looking at your files and I'll follow up here if I spot something there.

Many thanks, Francesco

nlmjacquemin commented 3 years ago

Dear Francesco,

Thank you for looking into this matter. I have used R using the package ggtree and rerooted the tree in between Archaea and Bacteria (which cause the "weird" look of the branches).

I provide you here the raw tree (input_genomes.nwk.txt,input_genomes.tre.iqtree.txt,input_genomes.tre.treefile.txt) and the file I used for the annotation (taxa2genomes_cpa201901_up201901.txt).

Many thanks, Nicolas

fasnicar commented 3 years ago

Dear Nicolas,

Thank you for sending over the raw tree file. I re-rooted it between Archaea and Bacteria using Archaeopteryx. Then I took the GraPhlAn annotations file I made for the tree of life in the PhyloPhlAn paper to visualize your tree.

input_genomes reroot annot

Please ignore the bin.### in the right legend as I forgot to remove them.

The phylogeny is still a bit different than the one from the paper, it is very very similar to the one you posted (minor the counter-clockwise arrangement and the "weird" branches close to the root). One thing I can say is that for the phylogeny in the PhyloPhlAn paper I used the representative genomes from the ~17k species-level genome bins (SGBs) which often could be MAGs instead of reference genomes. Also, the phyla annotations I used are from the phylum taxonomic label assigned to the SGBs (which should be consistent for such SGBs that contain reference genomes but might be different for the SGBs composed of MAGs only).

I'm not sure this explains the differences between the two trees. I'll be happy to keep investiganting this. So, one thing would be to check if there are major differences in the software versions (the config file you sent looks good to what I used) and the MSA alignment (the number of final positions used for each genome, for instance).

Many thanks, Francesco

nlmjacquemin commented 3 years ago

Dear Francesco,

Did I not used also the representative genomes from the ~17k species-level genome bins (SGBs) with the pipeline I provided to you? I think that I have roughly those 17k species. Could you provide me these phyla annotations to make a fair comparison?

Also, given that we are at the phylum level, I have trouble explaining those drastic changes.

So you suggest checking the difference between PhyloPhlAn version 3.0.60 (27 November 2020) and the version you used? To check the number of final positions for each genome, should I look at this file input_genomes.tre.uniqueseq.phy.txt and yours? Also in case, there might be more information in this file, here it is: input_genomes.tre.log.

Many thanks, Nicolas

fasnicar commented 3 years ago

Dear Nicolas,

Thank you for your reply and for sending over the files to check the differences. I'll try to answer your points below.

Did I not used also the representative genomes from the ~17k species-level genome bins (SGBs) with the pipeline I provided to you? I think that I have roughly those 17k species.

The representative genomes you retrieved could be a bit different than those I used in the paper because the approach described in the tutorial retrieves one representative genome for all species-level taxonomies as in NCBI. The representative genomes I used in the paper instead are coming from the dereplication of several hundreds of thousands of genomes and MAGs organized in species-level clusters (or SGBs) as described in here.

Could you provide me these phyla annotations to make a fair comparison?

Yes, you can find the annotations from the SGB table files described in the phylophlan_metagenomic.txt file. If you open it with a text editor you'll see that there are 3 entries for each database release, for instance:

https://zenodo.org/record/4005775/files/SGB.Jan19.md5?download=1    SGB.Jan19.md5
https://zenodo.org/record/4005775/files/SGB.Jan19.tar?download=1    SGB.Jan19.tar
https://zenodo.org/record/4005775/files/SGB.Jan19.txt.bz2?download=1    SGB.Jan19.txt.bz2

Where the (in the above example) the SGB.Jan19.txt.bz2 is the SGB table describing the SGBs with their assigned taxonomy. In your case, since you didn't use one representative for each SGB, what I think you can do is to search for your genomes in the SGB table and retrieved their assigned taxonomic label (from which you can keep only the phylum level label).

Also, given that we are at the phylum level, I have trouble explaining those drastic changes.

I think that the changes are not that drastic, what I mean is that the placement of the "root" of the phyla "moved", but I think that the leaves are fairly coherent. Of course that is not enough as an explanation (at least to me), but there is no right answer (see Bacterial phyla from Wikipedia and the several different branching order: Woese (1987), Gupta (2001), Cavalier-Smith (2002), Rappe and Giovanoni (2003), Battistuzzi et al. (2004), and Ciccarelli et al. (2006))

So you suggest checking the difference between PhyloPhlAn version 3.0.60 (27 November 2020) and the version you used? To check the number of final positions for each genome, should I look at this file input_genomes.tre.uniqueseq.phy.txt and yours? Also in case, there might be more information in this file, here it is: input_genomes.tre.log.

It is true that the version I used in the paper was 3.0.43, so we did do edits and fixes, but the main phylogenetic pipeline did not change substantially. I checked the two files you sent me and indeed there I think there is something that doesn't match with the MSA I had for the tree of life I built in the paper. It seems that you have only 471 AA positions in your alignment for each genome, while I think you should have about 4.5k AA positions (so about 9/10 times more!).

Can you retrieve the latest PhyloPhlAn version (3.0.64) from the Github repository and re-run your tree? Note: you don't need to re-run it from scratch, you can go to the <output_folder>/tmp and remove these: rm -r markers* msas trim* *.pkl sub, so basically you keep all the mapping of the database against your inputs and when you re-launch PhyloPhlAn it will re-extract the markers for all genomes, do the MSA and then build the phylogeny. Please, let me know if you are unsure about this cleaning and I'll be happy to help.

Many thanks, Francesco