Open MichelMoser opened 8 years ago
What commits of FALCON-integrate/FALCON/pypeFLOW are you using? (Run git rev-parse HEAD
in each. git submodule status
can be helpful from FALCON-integrate.)
I use FALCON-integrate. Hope this helps:
git submodule status
5d527739295c82bf4a141532d61019b9d155cc99 DALIGNER (heads/master) 993e2bd578fbdf042cf52d41eaccf2cde35c444c DAZZ_DB (heads/master) acb722c6f586d448690d2ed2db5c86fb049ae038 FALCON (0.2JASM-290-gacb722c) 0515239714fd3f25ce55fbf6bf7cde9a6aeead22 FALCON-examples (heads/master) a05d2555eed025117469c9fc7718414f7a6f909c FALCON-make (heads/master) 483836854e1bafb6e40cf17f3a9382efe351daad git-sym (heads/master) 4333093a2e774101030cedac8869d03c7d0d8469 pypeFLOW (v0.1.0-46-g4333093) 8e9cb4a7c34a12762a0ef8b2c5003ded4346cd49 virtualenv (13.0.3-9-g8e9cb4a)
And what commit of FALCON-integrate?
commit 0c11a7e30de3d2e65bc681818921f27e94f4d0ec
You're up-to-date.
If exitOnFailure
is set for pypeflow (probably is for you) then an exception is thrown on the first failure. You definitely have a failure, the job with a _done.exit
file. After that, the other jobs might be allowed to finish (I'm not sure) but definitely no others would start.
First, update to the latest master. That will give you much faster restarts.
Then, remove the .exit
files in the job directories. Then you can restart, and the failed jobs should run again.
But first, try setting ovlp_concurrent_jobs=2
in your .cfg. That way, you'll have exactly two jobs at a time. You can get a better idea what's going on. Run with job_type=local
too.
It's always best to experiment with smaller genomes. In particular, the example under FALCON-examples/run/synth0 is tiny. If you change its .cfg to use a smaller -u
unit for DBsplit, you can create as many jobs as you'd like, all extremely fast. You can also experiment with Python logging, DEBUG level.
But my guess is that you have a failing daligner job, and we don't currently have a good way to help you with that. If we can prove that's the problem, then I can convince my management to let me work on a way to ignore the bad job, since it might fail reliably on subsequent restarts.
In other words, I need your help to find the exact problem. I'll be back Monday.
I'm pretty sure the issue is that you have exitOnFailure
and a failed task. I'm not sure what to do about that. We can keeping running when there are some daligner failures (e.g. #254), but we need to make the subsequent tasks more robust, since in theory they could work even with a reduced set of .las files, if a lower quality assembly is ok. I'll talk to others about this.
Hi, Yes thats true, i have exitOnFailure set. Could you tell me how to unset it? Additionally, i might now be running into disk space shortage on the cluster, so the termination of daligner jobs is very likely due to no space on the disk (after using 2 Tbyte).
Is it possible to copy the whole datastructure of the falcon-run to another destination and continue the calculations? Or better to start from scratch?
Thank you, Michel
exitOnFailure set. Could you tell me how to unset it?
At the moment, you'd have to fork FALCON and modify. For now, try to pull the version from my pb-cdunn repo.
disk space
That's interesting! We'll think about what to do in that case.
better to start from scratch?
Well, at the moment that might be easiest for you, unfortunately.
Hi, I try to run falcon on a large dataset (50 smrtcells). 126 subdirectories were created (from 126 blocks) in the 0-rawreads directory and are being processed.
But instead of processing all of them, i get a weird message from pypeflow.controller which is counting down each job until they are all finished instead of starting new dalinger jobs of yet unprocessed directories.
Stderror produces me this warnings every 2 seconds:
when checking if all jobs are finished, i see that alot of them are still unprocessed or were terminated when i killed falcon:
Do you know how i can resolve this issue without loosing already calculated .las files? Thank you, Michel