incomplete run because 1 pread job failure - how to resolve/restart/ignore

PacificBiosciences / FALCON

FALCON: experimental PacBio diploid assembler -- Out-of-date -- Please use a binary release: https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries

https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries

Other

205 stars 102 forks source link

incomplete run because 1 pread job failure - how to resolve/restart/ignore #277

Open ishengtsai opened 8 years ago

ishengtsai commented 8 years ago

Hi,

After many falcon runs on at least 5+ species, I finally came across a new species which failed for the first time. I tried to restart falcon by just restarting with the same commands:

$  nohup fc_run.py fc_run.cfg logging.ini &

but there is always one job that seems to fail. The log where failure is shown in both pypeflow.log and nohup.out below. My question is whether this job can be skipped/rescued/restarted, and if so, how do I restart it efficiently? Do I just change the settings in fc_run.cfg, rerun with fc_run.py, and hope this works the next time?

Best, Jason

section of pypeflow.log where failure is shown

...
Exception: Cmd 'bash /mnt/nas1/ijt/SpeciesC/falcon_run_cov108/1-preads_ovl/job_447f81ef/rj_447f81ef.sh 1> /mnt/nas1/ijt/SpeciesC/falcon_run_cov108/1-preads_ovl/job_447f81ef/rj_447f81ef.sh.log 2>&1' (job 'rj_447f81ef.sh-d_447f81ef_preads-d_447f81ef_preads') returned 35072.

nohup.out Output below

<redacted>

pb-cdunn commented 8 years ago

After many falcon runs on at least 5+ species, I finally came across a new species which failed for the first time.

Good! 5 of 6 is pretty good for experimental code.

My question is whether this job can be skipped/rescued/restarted, and if so, how do I restart it efficiently?

I understand what you're saying. Only 1 job failed, and you'd like to get some output anyway. Typically, skipping one set of data will only reduce coverage, so there might still be enough useful data for a final assembly.

If the job failed because of an intermittent failure in your system, simply restart. Finished tasks are skipped quickly.

However, if you need to adjust settings, then other tasks might end up re-running too. We plan to do something about that, but not soon.

Your best shot in that case is to alter the generated .sh files for that task only. Then, run it manually. If you get good data from daligner, you can create the _done file by hand. Then, you can restart the whole workflow and let it continue on.

That's all I can offer for now.

ishengtsai commented 8 years ago

Hi Chris,

The problem is when I completed the job manually by running that xxx.sh. But when I restarted falcon using fc_run.py, it seems to me that fcrun.py recreates all the job\S\d+ and m_\d+ folders and finished running with 1 job failure again.

Any suggestions? Which part of fc_run.cfg should I be looking at?

Thanks.

pb-cdunn commented 8 years ago

See these:

bredeson commented 8 years ago

Hey @ishengtsai,

If you have run the rj_*.sh script manually, have you made sure that the corresponding job_*_done file is present (and has a more current timestamp than the output *.las files)? If you are absolutely sure that the manual rj_*.sh job completed successfully, then you may touch the job_*_done file yourself; however, this is probably not recommended because the rj_*.sh script should do this for you if it completes, and no job_*_done would indicate that the run failed for one reason or another. If it did not complete, check the logs carefully.

I've also found that deleting the job_*_done.exit files helps the pipeline restart when there has been a previous failure; otherwise the pipeline gets tripped up if there is a job_*_done.exit file without also a job_*_done file.

find 0-rawreads -name "job_*_done.exit" | while read file; do rm $file; done

pb-cdunn commented 8 years ago

I've also found that deleting the job_*_done.exit files helps the pipeline restart when there has been a previous failure; otherwise the pipeline gets tripped up if there is a job_*_done.exit file without also a job_*_done file.

Exactly. That is the current behavior.

Eventually (probably within about 2 weeks) we will separate success (the "done" files) from script completion/exit-status in an obvious way.