davidemms / OrthoFinder

Phylogenetic orthology inference for comparative genomics
https://davidemms.github.io/
GNU General Public License v3.0
703 stars 188 forks source link

Details about how species tree inferred #354

Closed MinjieHu closed 4 years ago

MinjieHu commented 4 years ago

Hi, @davidemms

Thanks for this awesome software. I have use it to infer species tree containing our new sequenced species and submitted the paper recently. One reviewer come back asking some details about how the species tree was inferred. Could you give me a hand to clarify it as I am not quite understand how OrthoFinder works?

Here are the questions:

Did you use a subset of those Orthogroups that were present in nearly all or all species included?
What criteria did you use to select orthogroups for species tree inferring? 
Were the selected orthogroups single-copy in all species?
Please provide the final alignment matrix after MAAFT alignment as a supplemental file. 
State the data occupancy in the matrix. 
Also state whether the alignment was trimmed in any way.

I guess the SpeciesTreeAlignment.fa file is the MAFFT alignment file. But I cann't find exact answers for other questions, especially for "data occupancey" and "alignment trimming".

Here's the command I used to generate the species tree

orthofinder -S diamond -t 22 -M msa -f new_species

And the atthached is the run log files. orthofinder_20191009.log

Thanks so much for all the help!

Minjie

davidemms commented 4 years ago

Hi Minjie

I think you should be able to find most of the info in the Species Tree section of the latest OrthoFinder paper. There are also answers to some common questions about species tree inference in the github issues. Have a look through those and let me know if any questions are still unanswered and I will get back to you next week.

All the best David

MinjieHu commented 4 years ago

Hi David,

Thanks for the responding. I found STAG paper is very helpful for the detail of the Orthogroups selection. Also #153 provide a lot of help.

However, as I am not familiar with phylogenetic algorithm, I still have no idea of "final alignment matrix" and "data occupancy in the matrix". Can I find these information in the result output folder? Could you give me some clues or point to where can I find this information?

As you mentioned in #153, for MSA method, you apply trimming method for the alignment. What's the method was used for trimming. Where can I find more details about the MSA method.

Thanks so much!

Minjie

davidemms commented 4 years ago

Hi Minjie

I think the reviewer of your paper is asking what percentage of the data in your multiple sequence alignment is amino acid characters and what percentage is gap characters. OrthoFinder doesn't calculate it, but it should be easy just to count it in the file itself. I'll put some information up about the MSA method in the github documentation. If f is the minimum fraction of species for your single-copy orthogroups (OrthoFinder prints this fraction in its terminal output when you run it) then it allows 50% of these to have gaps. I.e. it trims any column with more that (1-0.5*f) gaps.

All the best David

davidemms commented 4 years ago

Hi Minjie

I've written a summary of the species tree inference methods on the README page: https://github.com/davidemms/OrthoFinder#species-tree-inference

All the best David

MinjieHu commented 4 years ago

Hi David,

Thanks so so so so much! It's much clearer for me now. Below is my response, I hope I don't make obvious mistakes.

Briefly, orthofinder -S diamond -t 22 -M msa -f fasta_files was used to generate the result. With this command, Diamond (v0.9.21) was used for sequence search and OrthoFinder grouped 308348 genes (83.8% of total) into 19244 orthogroups. 1601 orthogroups, according to previously reported method [50], with minimum 10 species having single-copy genes were used to infer the species tree. These orthogroups were subjected to multiple sequence alignment by MAFFT (v7.407) and columns with more than 8 gaps were trimmed. The trimmed alignment with 73.6% data occupancy (see source data for fig1d) was used to infer the maximum likelihood unrooted species tree by FastTree (v2.1.10) with the default configuration in OrthoFinder. This species tree was further rooted by STRIDE algorithm which has been demonstrated to correctly root the species tree spanning a wide range of time scales and taxonomic groups [51].

Minjie

davidemms commented 4 years ago

Hi Minjie

Sorry for the slow response. Yes that all sounds correct.

I can imagine a reviewer asking you to use something like IQTREE or RAxML to infer the species tree, that is easy to do now that you have the alignment. If you need to run OrthoFinder again with the new species tree you can do that with the "-ft PREVIOUS_RESULTS_DIRECTORY" and "-s NEW_SPECIES_TREE_FILENAME" options, it is pretty quick. You only need to do this if the topology or root of the species tree changes, otherwise it won't make any difference. And if you do need to do this, the only difference it can make is to the orthologs and gene duplication events, not the orthogroups.

All the best David