davidemms / OrthoFinder

Phylogenetic orthology inference for comparative genomics
https://davidemms.github.io/
GNU General Public License v3.0
716 stars 189 forks source link

Error with MSA/RAxML Species Tree Inference #253

Closed Cronzo closed 5 years ago

Cronzo commented 5 years ago

Hi everyone,

The OrthoFinder run I was performing encountered an error at the species tree inference step. The only logged error message was:

ERROR: An error occurred ERROR: Species tree inference failed

Context: I was trying to construct a more 'accurate' species tree (of 50 species) using MSA and RAxML. The input came from a previous OrthoFinder run (-fg option using the BLAST output files). This error appeared after (last I checked, the gene tree inference step).

It appears that the MSA and RAxML results (RAxML_info/parsimony/results; OG0000000.fa etc.) were generated in the working directory, but all other directories (GeneTrees, MultipleSequenceAlignments, Orthologues) are empty.

Question: What could be the cause of the error? Due to constraints of time is there anyway I can continue the Orthofinder run (via RAxML or what not) using the the existing files in the working directory?

Thank you very much.

davidemms commented 5 years ago

Hi

It looks like RAxML has failed on the species tree inference step. Have a look at the file WorkingDirectory/Alignments_ids/RAxML_info.SpeciesTree and see what that says.

All the best David

p.s. I'd be interested to know what went wrong with the species tree.

davidemms commented 5 years ago

In terms of continuing the run, a few contrary suggestions to consider first.

To finish the trees manually and the get OrthoFinder to do the orthologue inference step from the trees, you'd need to set everything up:

  1. Once you've identified the problem with RAxML, run OrthoFinder with RAxML on the example dataset so you know what you're aiming for in terms of what files should be where.
  2. Complete any MSA or RAxML runs that hadn't completed. These are all in the WorkingDirectoryAlignments_ids/ directory. The command OrthoFinder uses by default is like this: raxmlHPC-AVX -m PROTGAMMALG -p 12345 -s Alignments_ids/OG0000010.fa -n OG0000010 -w Alignments_ids
  3. Copy the orthogroup trees (e.g. RAxML_result.OG0000010) to the Trees_ids/ directory with the names in the format "OG0000010_tree_id.txt"
  4. Root the species tree on your outgroup (you can see how the IDs correspond to species in the SpeciesIDs.txt file) and save the rooted species tree to WorkingDirectory/SpeciesTree_rooted_ids.txt
  5. You should be able to use the "-ft" command to get OrthoFinder to identify orthologues from these trees now: "python orthofinder.py -ft Results_Dir/".

All the best David

Cronzo commented 5 years ago

Hi David,

Thank you for the response and suggestions!

Regarding the error message: There was no 'RAxML_info.SpeciesTree' generated; I could only find the respective 'RAxML_info.OG00*' files.

I performed a run using the test data with the same settings (from orthogroups), and had no issue generating a species tree. I should also mention that since I am working with plant genomes, the run did take a good amount of time (450-500 hours) before I encountered this error.

I believe I also failed to mention that my Tree_ids/ directory is already filled with the respective "OG00*_tree_id.txt" files.

In any case, I did follow your instructions and could see that approximately 20,000~/28,000 orthogroups have a RAxML_result file, and when I tried to run RAxML manually, I received the error message:

TOO FEW SPECIES

Since the later orthogroups usually contain >10~ species I figured this might mean that RAxML has finished processing the files (whatever it could process)?

If you do not mind me asking one more question: I am a little confused regarding (4); how do you root the species tree onto the outgroup? I see that there are several rooting algorithms in RAxML (pardon my unfamiliarity with RAxML): does OrthoFinder use the default rooting algorithm -f d (new rapid hill-climbing)? And if so, how do I generate just one rooted tree with all the Tree_IDs?

Thank you again.

davidemms commented 5 years ago

Hi

That's strange that it would say there are too few species. It should be able to do it as long as there are 4 or more genes present. Would you mind sending me an example?

As for rooting the species tree. First find the species tree file, I think it will be WorkingDirectory/SpeciesTree_unrooted_ids.txt if it was successfully inferred from RAxML. If RAxML wasn't successful then I think there will be the input multiple sequence alignment file for it here: WorkingDirectory/SpeciesTreeAlignment.fa

If you know the outgroup for your species (i.e. which individual species or group of species diverged earliest from the rest) then you can do it yourself, Dendroscope is a good tool for that. Use it to open the tree file, select the branch that should be the root, click the button which has a little red triangle pointing to a branch of the tree (this sets the root), then go to File > Export... and save it as a newick file.

If you don't know the outgroup then you can use STRIDE which is the method OrthoFinder uses. It's on my github page and takes as input the unrooted species tree and the Trees_ids/ directory.

All the best David

Cronzo commented 5 years ago

Hi David,

Thanks for the response again. You are right, those files had less than 4 genes.

OG0019353.zip

This was the next file (by numerical sequence) from the last result file, so RAxML likely completed.

In any case, I managed to generate the rooted species tree using the WorkingDirectory/SpeciesTreeAlignment.fa (as you suggested) and the rooting option in RAxML. I got the species tree, and renamed it to SpeciesTree_rooted_ids.txt in the WorkingDirectory as you described above.

However, when I tried to continue the OrthoFinder run (-ft), I receive this strange error:

2019-04-05 06:34:23 : Starting OrthoFinder
5 thread(s) for highly parallel tasks (BLAST searches etc.)
1 thread(s) for OrthoFinder algorithm
Traceback (most recent call last):
 File "/opt/OrthoFinder/orthofinder/orthofinder.py", line 1592, in <module>
    scripts.files.InitialiseFileHandler(options, fastaDir, continuationDir, resultsDir_nonDefault, pickleDir_nonDefault)
  File "/opt/OrthoFinder/orthofinder/scripts/files.py", line 825, in InitialiseFileHandler
    FileHandler.CreateOutputDirectories(options, pfl, base_dir, fastaDir)
  File "/opt/OrthoFinder/orthofinder/scripts/files.py", line 188, in CreateOutputDirectories
user_name=options.name)
  File "/opt/OrthoFinder/orthofinder/scripts/files.py", line 153, in StartFromTrees
    shutil.copy(speciesTreeFN, self.GetSpeciesTreeIDsRootedFN())
  File "/usr/lib64/python2.7/shutil.py", line 119, in copy
    copyfile(src, dst)
  File "/usr/lib64/python2.7/shutil.py", line 68, in copyfile
     if _samefile(src, dst):
  File "/usr/lib64/python2.7/shutil.py", line 58, in _samefile
    return os.path.samefile(src, dst)
  File "/usr/lib64/python2.7/posixpath.py", line 162, in samefile
    s1 = os.stat(f1)
TypeError: coercing to Unicode: need string or buffer, NoneType found

Command was: python orthofinder.py -ft ./OrthoFinder/Results_Mar02/ -t 5

It looks as if OrthoFinder was not able to locate some of the required files. My current results folder has the following files from the previously terminated OrthoFinder run: Gene_Trees/ (Empty) Orthologues/ (Empty) MultipleSequenceAlignments/ (Empty) Orthogroup_Sequences/ (Contain the FASTA files of the various orthogroups) WorkingDirectory (Contains Alignment_ids/, Tree_ids/, Sequences_ids/, SpeciesTree_rooted_ids.txt)

What are the files that OrthoFinder need to read for the -ft command? I have tried moving the files from Tree_ids/ into Gene_Trees/ but the error still persists.

Thank you again for the help you have given thus far. I really appreciate it.

Cheers

davidemms commented 5 years ago

Hi

I think what's probably happened is that because the trees didn't finish, OrthoFinder didn't get to the stage where it writes out to the Log.txt file where the trees can be found. I think it may be as simple as putting a line in the OrthoFinder/Results_Mar02/Log.txt file like this WorkingDirectory_Trees: OrthoFinder/Results_Mar02/ WorkingDirectory/ but with the full path to the directory rather than a relative path.

Also, I think it's probably worth checking if the tree from RAxML is rooted and how accurate the root it. I don't think RAxML will really have available any good info to use to work out where the tree root is as virtually all substitution models are time-reversible. The STRIDE method will be a lot more accurate because it looks for gene duplication events in the gene trees instead. These are time-irreversible so give a good indication of the root. The '-f d' option looks like it's just for the tree search, it's not to do with the root. Better still, if you know what your outgroup is then it's easiest just to manually re-root it on this using Dendroscope.

All the best David