davidemms / OrthoFinder

Phylogenetic orthology inference for comparative genomics
https://davidemms.github.io/
GNU General Public License v3.0
677 stars 186 forks source link

OF failure with -b option #9

Closed claumer closed 8 years ago

claumer commented 8 years ago

Hi David,

I'm trying to run OF on a large dataset. Because of the size (191 spp) I performed the all-by-all blast manually as a series of job arrays on our research cluster, and am now using the -b option to run the actual orthofinder algorithm.

When I do so, however, it fails, with the following errors:


Process Process-1: Traceback (most recent call last): File "/nfs/research2/marioni/claumer/anaconda2.7/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/nfs/research2/marioni/claumer/anaconda2.7/lib/python2.7/multiprocessing/process.py", line 114, in run self._target(_self._args, *_self._kwargs) File "orthofinder.py", line 800, in AnalyseSequences wfAlg.RunWaterfallMethod(graphFilename) File "orthofinder.py", line 572, in RunWaterfallMethod Bij = self.thisBfp.GetBLAST6Scores(iSpecies, jSpecies) File "orthofinder.py", line 544, in GetBLAST6Scores if score > B[sequence1ID, sequence2ID]: File "/nfs/research2/marioni/claumer/anaconda2.7/lib/python2.7/site-packages/scipy/sparse/lil.py", line 246, in getitem i, j) File "_csparsetools.pyx", line 58, in scipy.sparse._csparsetools.lil_get1 (scipy/sparse/_csparsetools.c:3299) File "_csparsetools.pyx", line 81, in scipy.sparse._csparsetools.lil_get1 (scipy/sparse/_csparsetools.c:2944) IndexError: row index (18054) out of bounds _ [mcxIO] cannot open file </nfs/research2/marioni/claumer/metazoa/Blastout2/OrthoFinder_v0.3.0graph.txt> in mode r [mcl] no jive ___ [mcl] failed Traceback (most recent call last): File "orthofinder.py", line 1146, in MCL.ConvertSingleIDsToIDPair(speciesStartingIndices, clustersFilename, clustersFilename_pairs) File "orthofinder.py", line 230, in ConvertSingleIDsToIDPair with open(clustersFilename, 'rb') as clusterFile, open(newFilename, "wb") as output: IOError: [Errno 2] No such file or directory: '/nfs/research2/marioni/claumer/metazoa/Blastout2/clusters_OrthoFinder_v0.3.0_I1.5.txt'


I've had this error with v. 0.2.8 and v. 0.3.0. The test suite runs just fine. I'm using python 2.7, with scipy etc installed using the Anaconda suite.

Any ideas on the source of this? A true bug, or could it be a format error in my input data? It's dimly possible that a few of the blast outputs might have truncated a bit early due to stochastic node failure, but I'm a bit surprised if this causes the entire algorithm to fail.

Grateful for what attention you can give to this,

Regards, Chris L

davidemms commented 8 years ago

Hi Chris

It looks like there's a problem with the input data, specifically there seem to be a sequence in one of the BLAST results files that isn't in the fasta file for one of the two corresponding species, it's probably just a mislabeling of something in the files.

I've added some output to the python script to try and identify where the problem lies and have attached it to this message: orthofinder.txt. Try running it and let me know how you get on. There should be a message that starts "Error in input files, expected only..." which should help identify which files the problem is in. If you let me know what it says I should be able to help you track it down.

All the best David

claumer commented 8 years ago

Hello David,

Well timed - I was just about to write, myself. It seems the error was indeed mine -- my job array script swapped the query and subject sequences in the names of the Blast output files. I realized it by comparing the output from the example dataset which I got using my script vs. that generated by OrthoFinder. I've renamed the files accordingly, and have just started OF - which seems not to be stalling with the same error as it had before. So -- optimistic that it should actually progress to completion this time, knock on wood.

Sorry for the bother, and thanks very much for taking the time to add some extra lines in for this!

Best, Chris L

Date: Wed, 25 Nov 2015 07:29:51 -0800 From: notifications@github.com To: OrthoFinder@noreply.github.com CC: c.e.laumer@outlook.com Subject: Re: [OrthoFinder] OF failure with -b option (#9)

Hi Chris

It looks like there's a problem with the input data, specifically there seem to be a sequence in one of the BLAST results files that isn't in the fasta file for one of the two corresponding species, it's probably just a mislabeling of something in the files.

I've added some output to the python script to try and identify where the problem lies and have attached it to this message: orthofinder.txt. Try running it and let me know how you get on. There should be a message that starts "Error in input files, expected only..." which should help identify which files the problem is in. If you let me know what it says I should be able to help you track it down.

All the best

David

— Reply to this email directly or view it on GitHub.

claumer commented 8 years ago

Hi David,

Apologies to re-open this, but revisiting Orthofinder's -b option with a slightly different dataset, this time using your updated script (0.3.1) with error messaging written to diagnose malformatted BLAST results...

I'm again seeing Orthofinder failing to finish, but the error that I see is unique -- and it seems that the malformatted BLAST error message is not printing - perhaps the problem this time is not that the input Blast results are poorly formatted? :


OrthoFinder version 0.3.1 Copyright (C) 2014 David Emms

This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it under certain conditions.
For details please see the License.md that came with this software.

Using previously calculated BLAST results in /n/regal/Giribet_lab/claumer/metazoa_Blastout4/

1. Checking required programs are installed

Test can run "mcl -h" - ok

2. Temporarily renaming sequences with unique, simple identifiers

Skipping

3. Dividing up work for BLAST for parallel processing

Skipping

4. Running BLAST all-versus-all

Skipping

5. Running OrthoFinder algorithm

Process Process-1: Traceback (most recent call last): File "/n/sw/fasrcsw/apps/Core/Anaconda/2.1.0-fasrc01/x/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/n/sw/fasrcsw/apps/Core/Anaconda/2.1.0-fasrc01/x/lib/python2.7/multiprocessing/process.py", line 114, in run self._target(_self._args, *_self._kwargs) File "orthofinder.py.1", line 805, in AnalyseSequences wfAlg.RunWaterfallMethod(graphFilename) File "orthofinder.py.1", line 577, in RunWaterfallMethod Bij = self.thisBfp.GetBLAST6Scores(iSpecies, jSpecies) File "orthofinder.py.1", line 539, in GetBLAST6Scores score = float(row[11]) IndexError: list index out of range 2015-12-14 03:34:29.460914 : Started 2015-12-14 03:34:59.989838 : Got sequence lengths 2015-12-14 03:34:59.989904 : Initial processing of each species 2015-12-14 04:08:58.732561 : Initial processing of species 0 2015-12-14 04:43:41.458877 : Initial processing of species 1 2015-12-14 05:07:09.826500 : Initial processing of species 2


Keen to hear your thoughts on this!

Best regards, Chris L