no submission of new dalinger jobs when running falcon

MichelMoser commented 8 years ago

Hi, I try to run falcon on a large dataset (50 smrtcells). 126 subdirectories were created (from 126 blocks) in the 0-rawreads directory and are being processed.

But instead of processing all of them, i get a weird message from pypeflow.controller which is counting down each job until they are all finished instead of starting new dalinger jobs of yet unprocessed directories.

Stderror produces me this warnings every 2 seconds:

22015-11-23 23:23:21,597 - pypeflow.controller - WARNING - Now, #tasks=15, #alive=4
2015-11-23 23:23:23,599 - pypeflow.controller - WARNING - Now, #tasks=15, #alive=4
2015-11-23 23:23:25,601 - pypeflow.controller - WARNING - Now, #tasks=15, #alive=4
2015-11-23 23:23:27,602 - pypeflow.controller - WARNING - Now, #tasks=15, #alive=4
2015-11-23 23:23:29,604 - pypeflow.controller - WARNING - Now, #tasks=15, #alive=4
2015-11-23 23:23:31,605 - pypeflow.controller - WARNING - Now, #tasks=15, #alive=4
2015-11-23 23:23:33,607 - pypeflow.controller - WARNING - Now, #tasks=15, #alive=4
2015-11-23 23:23:35,609 - pypeflow.controller - WARNING - Now, #tasks=15, #alive=4
2015-11-23 23:23:37,610 - pypeflow.controller - WARNING - Now, #tasks=15, #alive=4
2015-11-23 23:23:39,612 - pypeflow.controller - WARNING - Now, #tasks=15, #alive=4
2015-11-23 23:23:41,613 - pypeflow.controller - WARNING - Now, #tasks=15, #alive=4
2015-11-23 23:23:43,615 - pypeflow.controller - WARNING - Now, #tasks=15, #alive=4
2015-11-23 23:23:45,617 - pypeflow.controller - WARNING - Now, #tasks=15, #alive=3
2015-11-23 23:23:47,619 - pypeflow.controller - WARNING - Now, #tasks=15, #alive=3
2015-11-23 23:23:49,620 - pypeflow.controller - WARNING - Now, #tasks=15, #alive=3
2015-11-23 23:23:51,622 - pypeflow.controller - WARNING - Now, #tasks=15, #alive=3
2015-11-23 23:23:53,624 - pypeflow.controller - WARNING - Now, #tasks=15, #alive=3
2015-11-23 23:23:55,625 - pypeflow.controller - WARNING - Now, #tasks=15, #alive=3
....
....

when checking if all jobs are finished, i see that alot of them are still unprocessed or were terminated when i killed falcon:

for i in $(ls -d job_0*); do echo $i; cd $i; ls |grep done ; cd ..; done

...
job_00000095
job_00000095_done.exit
job_00000096
job_00000096_done
job_00000097
job_00000098
job_00000099
job_00000100
job_00000100_done
job_00000101
...

Do you know how i can resolve this issue without loosing already calculated .las files? Thank you, Michel

pb-cdunn commented 8 years ago

What commits of FALCON-integrate/FALCON/pypeFLOW are you using? (Run git rev-parse HEAD in each. git submodule status can be helpful from FALCON-integrate.)

MichelMoser commented 8 years ago

I use FALCON-integrate. Hope this helps:

git submodule status

5d527739295c82bf4a141532d61019b9d155cc99 DALIGNER (heads/master) 993e2bd578fbdf042cf52d41eaccf2cde35c444c DAZZ_DB (heads/master) acb722c6f586d448690d2ed2db5c86fb049ae038 FALCON (0.2JASM-290-gacb722c) 0515239714fd3f25ce55fbf6bf7cde9a6aeead22 FALCON-examples (heads/master) a05d2555eed025117469c9fc7718414f7a6f909c FALCON-make (heads/master) 483836854e1bafb6e40cf17f3a9382efe351daad git-sym (heads/master) 4333093a2e774101030cedac8869d03c7d0d8469 pypeFLOW (v0.1.0-46-g4333093) 8e9cb4a7c34a12762a0ef8b2c5003ded4346cd49 virtualenv (13.0.3-9-g8e9cb4a)

pb-cdunn commented 8 years ago

And what commit of FALCON-integrate?

MichelMoser commented 8 years ago

commit 0c11a7e30de3d2e65bc681818921f27e94f4d0ec

pb-cdunn commented 8 years ago

https://github.com/PacificBiosciences/FALCON-integrate/commit/0c11a7e30de3d2e65bc681818921f27e94f4d0ec

You're up-to-date.

If exitOnFailure is set for pypeflow (probably is for you) then an exception is thrown on the first failure. You definitely have a failure, the job with a _done.exit file. After that, the other jobs might be allowed to finish (I'm not sure) but definitely no others would start.

First, update to the latest master. That will give you much faster restarts.

Then, remove the .exit files in the job directories. Then you can restart, and the failed jobs should run again.

But first, try setting ovlp_concurrent_jobs=2 in your .cfg. That way, you'll have exactly two jobs at a time. You can get a better idea what's going on. Run with job_type=local too.

It's always best to experiment with smaller genomes. In particular, the example under FALCON-examples/run/synth0 is tiny. If you change its .cfg to use a smaller -u unit for DBsplit, you can create as many jobs as you'd like, all extremely fast. You can also experiment with Python logging, DEBUG level.

But my guess is that you have a failing daligner job, and we don't currently have a good way to help you with that. If we can prove that's the problem, then I can convince my management to let me work on a way to ignore the bad job, since it might fail reliably on subsequent restarts.

In other words, I need your help to find the exact problem. I'll be back Monday.

pb-cdunn commented 8 years ago

I'm pretty sure the issue is that you have exitOnFailure and a failed task. I'm not sure what to do about that. We can keeping running when there are some daligner failures (e.g. #254), but we need to make the subsequent tasks more robust, since in theory they could work even with a reduced set of .las files, if a lower quality assembly is ok. I'll talk to others about this.

MichelMoser commented 8 years ago

Hi, Yes thats true, i have exitOnFailure set. Could you tell me how to unset it? Additionally, i might now be running into disk space shortage on the cluster, so the termination of daligner jobs is very likely due to no space on the disk (after using 2 Tbyte).

Is it possible to copy the whole datastructure of the falcon-run to another destination and continue the calculations? Or better to start from scratch?

Thank you, Michel

pb-cdunn commented 8 years ago

exitOnFailure set. Could you tell me how to unset it?

At the moment, you'd have to fork FALCON and modify. For now, try to pull the version from my pb-cdunn repo.

disk space

That's interesting! We'll think about what to do in that case.

better to start from scratch?

Well, at the moment that might be easiest for you, unfortunately.

PacificBiosciences / FALCON

no submission of new dalinger jobs when running falcon #252