PacificBiosciences / FALCON

FALCON: experimental PacBio diploid assembler -- Out-of-date -- Please use a binary release: https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries
Other
205 stars 102 forks source link

Crash on input file does not stop fc_run.py #64

Closed lexnederbragt closed 9 years ago

lexnederbragt commented 9 years ago

Hi,

I accidentally provided a fastq file, which caused an error in the first step of fc_run.py. But this did not stop the run, it continued a few more steps before it died. It would be nice if fc_run.py does not try to continue after such an input file error.

pb-jchin commented 9 years ago

If you use the latest check-in in the master branch, you should have a file called fc_run.log. Can you show me the file so I can guess what is exactly going on properly.

lexnederbragt commented 9 years ago

I love the fc_run-log file! Here is the output with a fastq as input file. You could also try not having the input.fofn file in the folder where you start the run, this also crashes cryptically. Note, I use job_type = local for all runs.

input.fofn:

$ cat input.fofn
data/temp.fastq

stdout:

$ fc_run.py fc_run_ecoli.cfg
 No target specified, assuming "assembly" as target
fasta2DB: Cannot open data/temp.fastq.fasta for 'r'
DBsplit: Cannot open ./raw_reads.db for 'r'
cat: raw_reads.db: No such file or directory
HPCdaligner: Cannot open ./raw_reads.db for 'r'
Exception in thread Thread-5:
Traceback (most recent call last):
  File "/cluster/software/VERSIONS/python2-2.7.9/lib/python2.7/threading.py", line 810, in __bootstrap_inner
    self.run()
  File "/cluster/software/VERSIONS/python2-2.7.9/lib/python2.7/threading.py", line 763, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/node/work1/no_backup/lex/9-spine/bin/fc_env/lib/python2.7/site-packages/pypeflow-0.1.1-py2.7.egg/pypeflow/task.py", line 317, in __call__
    runFlag = self._getRunFlag()
  File "/node/work1/no_backup/lex/9-spine/bin/fc_env/lib/python2.7/site-packages/pypeflow-0.1.1-py2.7.egg/pypeflow/task.py", line 147, in _getRunFlag
    runFlag = any( [ f(self.inputDataObjs, self.outputDataObjs, self.parameters) for f in self._compareFunctions] )
  File "/node/work1/no_backup/lex/9-spine/bin/fc_env/lib/python2.7/site-packages/pypeflow-0.1.1-py2.7.egg/pypeflow/task.py", line 812, in timeStampCompare
    if min(outputDataObjsTS) < max(inputDataObjsTS):
ValueError: max() arg is an empty sequence

Traceback (most recent call last):
  File "/node/work1/no_backup/lex/9-spine/bin/fc_env/bin/fc_run.py", line 4, in <module>
    __import__('pkg_resources').run_script('falcon-kit==0.2.1', 'fc_run.py')
  File "/node/work1/no_backup/lex/9-spine/bin/fc_env/lib/python2.7/site-packages/pkg_resources/__init__.py", line 723, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/node/work1/no_backup/lex/9-spine/bin/fc_env/lib/python2.7/site-packages/pkg_resources/__init__.py", line 1636, in run_script
    exec(code, namespace, namespace)
  File "/node/work1/no_backup/lex/9-spine/bin/fc_env/lib/python2.7/site-packages/falcon_kit-0.2.1-py2.7-linux-x86_64.egg/EGG-INFO/scripts/fc_run.py", line 643, in <module>
    wf.refreshTargets(updateFreq = wait_time) # larger number better for more jobs
  File "/node/work1/no_backup/lex/9-spine/bin/fc_env/lib/python2.7/site-packages/pypeflow-0.1.1-py2.7.egg/pypeflow/controller.py", line 519, in refreshTargets
    rtn = self._refreshTargets(objs = objs, callback = callback, updateFreq = updateFreq, exitOnFailure = exitOnFailure)
  File "/node/work1/no_backup/lex/9-spine/bin/fc_env/lib/python2.7/site-packages/pypeflow-0.1.1-py2.7.egg/pypeflow/controller.py", line 617, in _refreshTargets
    assert self.jobStatusMap[str(URL)] in ("done", "continue", "fail")
AssertionError

fc_run.log:

$ cat fc_run.log
2015-05-15 14:33:47,501 - fc_run - INFO - fc_run started with configuration fc_run_ecoli.cfg
2015-05-15 14:33:48,143 - fc_run - INFO - executing /node/work1/no_backup/lex/9-spine/bin/ecoli_test/0-rawreads/prepare_db.sh locally, start job: build_rdb-1c3c9478
2015-05-15 14:33:53,263 - fc_run - INFO - /node/work1/no_backup/lex/9-spine/bin/ecoli_test/0-rawreads/rdb_build_done generated. job: build_rdb-1c3c9478 finished.
pb-cdunn commented 9 years ago

Please also post the contents of /node/work1/no_backup/lex/9-spine/bin/ecoli_test/0-rawreads/prepare_db.sh.

pb-cdunn commented 9 years ago

Also, which commit of DAZZ_DB are you using? Use

git submodule status
pb-cdunn commented 9 years ago

The basic problem is that fasta2DB expects FASTA, not FASTQ. However, I'd like to see where the .fasta extension is appended.

pb-jchin commented 9 years ago

@lexnederbragt will you be able to show some snippet of the fastq file? Any particular reason not starting with subreads.fasta files?

lexnederbragt commented 9 years ago
cat /node/work1/no_backup/lex/9-spine/bin/ecoli_test/0-rawreads/prepare_db.sh
source /node/work1/no_backup/lex/9-spine/bin/fc_env/bin/activate
cd /node/work1/no_backup/lex/9-spine/bin/ecoli_test/0-rawreads
hostname >> db_build.log
date >> db_build.log
for f in `cat /node/work1/no_backup/lex/9-spine/bin/ecoli_test/input.fofn`; do fasta2DB raw_reads $f; done >> db_build.log
DBsplit -x500 -s50 raw_reads
LB=$(cat raw_reads.db | awk '$1 == "blocks" {print $3}')
HPCdaligner -v -dal4 -t16 -e.70 -l1000 -s1000 -H12000 raw_reads 1-$LB > run_jobs.sh
touch /node/work1/no_backup/lex/9-spine/bin/ecoli_test/0-rawreads/rdb_build_done
git submodule status
-aea1a1dfbdac10a50a4bfbd81292842c7a7b4828 DALIGNER
-454ae5fe2ff4de6e03343480ae80f03f665d5992 DAZZ_DB
-23a0a9da3ab4584a59dafc35243987ff74a52b05 pypeFLOW

I collected all the subreads in a fastq file for PBcR/MHAP as it expects that (or at least can work with that), and the documentation asks for a single input file. So I wanted to reuse that file. If you want I can give you a snippet. Now I am happily correcting and assembling separate subread fasta files...

pb-cdunn commented 9 years ago

ValueError: max() arg is an empty sequence That sometimes happens when a previous step fails, so the task-input calculation never occurs. pypeFLOW is not happy when it thinks there are no inputs to a task.

We should detect a bad input (e.g. fastq) and we should have a better message when inputs from the previous stage's outputs are missing. But we should also specify the full set of inputs and outputs for each task. At any rate, a task-failure now usually ends the run, so this is less of an issue. Please re-open if you see this in a more specific context, using code from the master branch.