davidemms / OrthoFinder

Phylogenetic orthology inference for comparative genomics
https://davidemms.github.io/
GNU General Public License v3.0
703 stars 188 forks source link

OrthoFinder - Part 2 is not deterministic #766

Open raufs opened 1 year ago

raufs commented 1 year ago

Hi David,

Thank you a ton for developing and maintaining OrthoFinder and various related software! They have been instrumental in my research.

Perhaps something you are already aware of, but I noticed that the second part of OrthoFinder does not seem reproducible like the first part is, up until OrthoGroups.tsv.

Between two replicate runs, I get identical results for OrthoGroups.tsv but Phylogenetic_Hierarchical_Orthogroups/N0.tsv appears to differ.

It appears differences are small between two replicate N0.tsv files so this is not a major issue but perhaps, if possible, something to resolve in later versions.

Rauf

davidemms commented 1 year ago

Hi Rauf

Thanks, and I appreciate the feedback. I think this is probably due to the tree inference and/or the MSA inference and therefore best resolved by providing a fixed seed for the random number generation, where possible. There are two overall options for tree inference with OrthoFinder:

default: tree inference using a distance matrix and fastme. The fastme command line says

    -z seed, --seed=seed
        Use this option to initialize randomization with seed value.
        Only helpful when bootstrapping.

so I'm not 100% sure if it's deterministic or not, but unfortunately this parameter won't have an affect forOrthoFinder as it doesn't rely on bootstrapping

-M msa: By default uses mafft and FastTree. I can't find references for providing a seed for the random number generator for these (FastTree has one for the support values, but again these don't affect orthofinder).

Which options did you see the non-determinacy with?

I think if you wanted deterministic behaviour you'd need a tree inference program and MSA inference program that allowed you to specify the seed. I know RAxML and IQTREE both do, if you found and MSA program that was also deterministic then you could use that. You can edit the options of any programs used in the OrthoFinder config.json file: https://github.com/davidemms/OrthoFinder#configjson--adding-addtional-programs-for-tree-inference-local-alignment-or-msa

All the best David

raufs commented 1 year ago

Hi David,

Thank you for your reply!

I observed the behavior with just default settings, so DendroBLAST distance matrices + FastME.

Using a small set of bacterial proteomes, it seems two corresponding gene trees in the Gene_Trees/ folder are different as expected from the Phylogenetic_Hierarchical_Orthogroups/N0.tsv being different too. The distance matrices at /WorkingDirectory/Distances_SpeciesTree/ for the orthogroup in question were identical however.

To test if it was FastME resulting in different formatting of the gene trees, I ran one of the distance matrices located at: /WorkingDirectory/Distances_SpeciesTree/ two separate times (for an orthogroup that seems to be split up differently into HOGs between two identical runs). Oddly, it seems to be reproducible and I ran FastME as you appear to run it in the orthologues.py program, with options -N -w O -s.

Differences for Phylogenetic_Hierarchical_Orthogroups/N0.tsv between two replicate runs also appear when -M msa is used.

Hope this is helpful and that it is just a matter of sorting some list or requesting not to change the gene tree format/order when reading and rewriting with proper names to get the deterministic behavior! Rauf