davidemms / OrthoFinder

Phylogenetic orthology inference for comparative genomics
https://davidemms.github.io/
GNU General Public License v3.0
693 stars 188 forks source link

Multiple sequence alignment failed, producing empty (0 kb) orthogroup file and "IndexError: list index out of range" error #427

Closed cgmayers closed 4 years ago

cgmayers commented 4 years ago

Hi David,

I have encountered what seems to be the same problem as this thread and this more recent thread. OrthoFinder fails during MSA, producing (in addition to many normal orthogroup files) an empty orthogroup file and halting the program. First, I'll paste the log with error, then follow up with some more information below.

(I've bolded my text to make it stand out from log snippets, since I have some comments in the middle of two big chunks. This is my first time posting on GitHub so my formatting's clumsy.)

orthofinder -f RemoveIfHaveStop/ -S diamond -M msa -A mafft -T fasttree

OrthoFinder version 2.3.12 Copyright (C) 2014 David Emms

2020-07-13 22:29:39 : Starting OrthoFinder
8 thread(s) for highly parallel tasks (BLAST searches etc.)
1 thread(s) for OrthoFinder algorithm

Checking required programs are installed
----------------------------------------
Test can run "mcl -h" - ok
Test can run "mafft /home/chase/OrthoFinder/Endobacteria/AllFrames/CombinedFiles/Linearize/RemoveIfHaveStop/OrthoFinder/Results_Jul13/WorkingDirectory//_dependencies_check/SimpleTest.fa" - ok
Test can run "FastTree /home/chase/OrthoFinder/Endobacteria/AllFrames/CombinedFiles/Linearize/RemoveIfHaveStop/OrthoFinder/Results_Jul13/WorkingDirectory//_dependencies_check/SimpleTest.fa" - ok

Dividing up work for BLAST for parallel processing
--------------------------------------------------
2020-07-13 22:29:42 : Creating diamond database 1 of 6
2020-07-13 22:29:42 : Creating diamond database 2 of 6
2020-07-13 22:29:42 : Creating diamond database 3 of 6
2020-07-13 22:29:42 : Creating diamond database 4 of 6
2020-07-13 22:29:42 : Creating diamond database 5 of 6
2020-07-13 22:29:42 : Creating diamond database 6 of 6

Running diamond all-versus-all
------------------------------
Using 8 thread(s)
2020-07-13 22:29:42 : This may take some time....
2020-07-13 22:29:43 : Done 0 of 36
2020-07-13 22:30:22 : Done 10 of 36
2020-07-13 22:30:35 : Done 20 of 36
2020-07-13 22:30:47 : Done all-versus-all sequence search

Running OrthoFinder algorithm
-----------------------------
2020-07-13 22:30:47 : Initial processing of each species
2020-07-13 22:30:48 : Initial processing of species 0 complete
2020-07-13 22:30:51 : Initial processing of species 1 complete
2020-07-13 22:30:52 : Initial processing of species 2 complete
2020-07-13 22:30:53 : Initial processing of species 3 complete
2020-07-13 22:30:53 : Initial processing of species 4 complete
2020-07-13 22:30:54 : Initial processing of species 5 complete
2020-07-13 22:30:58 : Connected putative homologues
2020-07-13 22:30:59 : Written final scores for species 0 to graph file
2020-07-13 22:31:01 : Written final scores for species 1 to graph file
2020-07-13 22:31:01 : Written final scores for species 2 to graph file
2020-07-13 22:31:02 : Written final scores for species 3 to graph file
2020-07-13 22:31:02 : Written final scores for species 4 to graph file
2020-07-13 22:31:02 : Written final scores for species 5 to graph file
2020-07-13 22:31:10 : Ran MCL

Writing orthogroups to file
---------------------------
OrthoFinder assigned 7471 genes (91.9% of total) to 2204 orthogroups. Fifty percent of all genes were in orthogroups with 4 or more genes (G50 was 4) and were contained in the largest 748 orthogroups (O50 was 748). There were 127 orthogroups with all species present and 115 of these consisted entirely of single-copy genes.

2020-07-13 22:31:22 : Done orthogroups

Analysing Orthogroups
=====================
2020-07-13 22:31:22 : Starting MSA/Trees
Species tree: Using 707 orthogroups with minimum of 66.7% of species having single-copy genes in any orthogroup

Inferring multiple sequence alignments for species tree
-------------------------------------------------------
2020-07-13 22:31:32 : Done 0 of 707
2020-07-13 22:32:45 : Done 100 of 707
2020-07-13 22:33:45 : Done 200 of 707
2020-07-13 22:34:54 : Done 300 of 707
2020-07-13 22:36:16 : Done 400 of 707
2020-07-13 22:37:15 : Done 500 of 707
2020-07-13 22:38:01 : Done 600 of 707
Traceback (most recent call last):
  File "/home/chase/anaconda3/envs/py37/bin/orthofinder", line 7, in <module>
    main(args)
  File "/home/chase/anaconda3/envs/py37/bin/scripts_of/__main__.py", line 1731, in main
    GetOrthologues(speciesInfoObj, options, prog_caller)
  File "/home/chase/anaconda3/envs/py37/bin/scripts_of/__main__.py", line 1511, in GetOrthologues
    options.name)
  File "/home/chase/anaconda3/envs/py37/bin/scripts_of/orthologues.py", line 905, in OrthologuesWorkflow
    seqs_alignments_dirs = treeGen.DoTrees(ogSet.OGs(qInclAll=True), ogSet.OrthogroupMatrix(), ogSet.Spec_SeqDict(), ogSet.SpeciesDict(), ogSet.speciesToUse, nHighParallel, qStopAfterSeqs, qStopAfterAlign or qPhyldog, qDoSpeciesTree=qDoMSASpeciesTree)
  File "/home/chase/anaconda3/envs/py37/bin/scripts_of/trees_msa.py", line 357, in DoTrees
    CreateConcatenatedAlignment(iOgsForSpeciesTree, ogs, self.GetAlignmentFilename, concatenated_algn_fn, fSingleCopy)
  File "/home/chase/anaconda3/envs/py37/bin/scripts_of/trees_msa.py", line 222, in CreateConcatenatedAlignment
    alignment = ReadAlignment(alignment_filename_function(iOg))
  File "/home/chase/anaconda3/envs/py37/bin/scripts_of/trees_msa.py", line 214, in ReadAlignment
    return MSA(msa)
  File "/home/chase/anaconda3/envs/py37/bin/scripts_of/trees_msa.py", line 189, in __init__
    self.length = len(list(msa_dict.values())[0])
IndexError: list index out of range

To explain some of the directories here, I've activated the conda environment "py37" which is running Python 3.7. Further system info is at the bottom of this post.

Like you predicted in the previous threads, an empty orthogroup file was produced in /Alignments_ids. I've attached the file from /Sequences_ids that has the same name (OG0000725.fa) as the empty 0KB file from /Alignments_ids (OG0000725.fa). It's had ".txt" added to it to allow uploading here.

OG0000725.fa.txt

Like you suggested in the previous thread, I ran $ mafft --localpair --maxiterate 1000 --anysymbol OG0000725.fa > OG0000725.aln.fa on this exact file, getting the following (apparently problem-free) result:

inputfile = orig
4 x 261 - 204 p
outputhat23=16
treein = 0
compacttree = 0
stacksize: 8192 kb
rescale = 1
All-to-all alignment.
tbfast-pair (aa) Version 7.471
alg=L, model=BLOSUM62, 2.00, -0.10, +0.10, noshift, amax=0.0
0 thread(s)

outputhat23=16
Loading 'hat3.seed' ...
done.
Writing hat3 for iterative refinement
rescale = 1
Gap Penalty = -1.53, +0.00, +0.00
tbutree = 1, compacttree = 0
Constructing a UPGMA tree ...
    0 / 4
done.

Progressive alignment ...
STEP     3 /3
done.
tbfast (aa) Version 7.471
alg=A, model=BLOSUM62, 1.53, -0.00, -0.00, noshift, amax=0.0
1 thread(s)

minimumweight = 0.000010
autosubalignment = 0.000000
nthread = 0
randomseed = 0
blosum 62 / kimura 200
poffset = 0
niter = 16
sueff_global = 0.100000
nadd = 16
Loading 'hat3' ... done.
rescale = 1

    0 / 4
Segment   1/  1    1- 264
STEP 003-001-0  identical.
Converged.

done
dvtditr (aa) Version 7.471
alg=A, model=BLOSUM62, 1.53, -0.00, -0.00, noshift, amax=0.0
0 thread(s)

Strategy:
 L-INS-i (Probably most accurate, very slow)
 Iterative refinement method (<16) with LOCAL pairwise alignment information

If unsure which option to use, try 'mafft --auto input > output'.
For more information, see 'mafft --help', 'mafft --man' and the mafft page.

The default gap scoring scheme has been changed in version 7.110 (2013 Oct).
It tends to insert more gaps into gap-rich regions than previous versions.
To disable this change, add the --leavegappyregion option.

In the most recent thread linked at the top, this was all fixed by simply running the analysis on another computer, but I don't have that luxury at the moment. I wonder if there's any clues in common between my data and the previous poster's? If it's a per-machine thing, it must be something about the local environment, not the data itself.

For full information, I have OrthoFinder 2.3.12 and I'm running Ubuntu 18.04.4 LTS in "Windows Subsystem for Linux" compatibility layer in Windows 10 64-bit (ver. 10.0.18362 Build 18362), with 32 GB RAM; cpu is i5-8350U; machine is Lenovo Thinkpad T480.

For good measure, I've zipped the entire folder (original input files and Orthofinder results) and attached it here, too.

RemoveIfHaveStop2_DelLastLine.zip

In the most recent thread linked at the top of this post, you mentioned looking for Species_Tree/Orthogroups_for_concatenated_alignment.txt, however this folder doesn't exist for me anywhere.

Is there anything else useful I could provide?

Thank you for your time! Chase

davidemms commented 4 years ago

Hi Chase

Thanks for the details, from the looks of things this should be resolved by the fix that was submitted for the previously reported issue. I've pushed out a new release yesterday with this fix in so would you be able to try updating to that and letting me know how it goes? I'd also be interested to know whether OG0000725 is successful or not this time. OrthoFinder should be fine in either case but it'd be interesting to know if there is an occasional & intermittent problem with the alignments.

All the best David

cgmayers commented 4 years ago

David,

This was extremely fortuitous timing! After updating to 2.4, the analysis completed successfully with no errors.

Thank you! Chase

davidemms commented 4 years ago

That's great, thanks for letting me know!