fast restart of Falcon idea

PacificBiosciences / FALCON

FALCON: experimental PacBio diploid assembler -- Out-of-date -- Please use a binary release: https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries

https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries

Other

205 stars 102 forks source link

fast restart of Falcon idea #239

Open dgordon562 opened 8 years ago

dgordon562 commented 8 years ago

Right now it takes Falcon about an hour or so to read through (with mammalian genomes) all of the directories to determine what has been done and what jobs still need to be submitted. My proposal is to reduce this hour to a few seconds.

Have the done files written, in addition to the directories, to a single file (or, if there are locking problems, to a single directory) so that Falcon doesn't have to read all of the job and m_ directories...

I always have to restart Falcon between 8 and 20 times for a typical genomes (even more when there is a bug in one of the qsub'd programs such as daligner), so this would really save time.

pb-cdunn commented 8 years ago

Interesting. We definitely want fast restarts. How many done files are we talking about?

If you re-open this in pypeflow, we can move the discussion there.

pb-jchin commented 8 years ago

Ideally, replacing sentinel file creation and detection with a database with ACID property will help for slow file system. However, it will bring in new dependences. If we write all dependence sentinels into a file, it will need to introduce a new data object class in Pypeflow and make sure the file locking works. (NFS may not support file locks well. Sqlite also had some problem for parallel writing.)

dgordon562 commented 8 years ago

NFS supports locks as long as you open the file for writing. So you would just have one additional 0-length file that you would open for writing, flock it, append to the "done" file, close the "done" file, and release the lock on the 0-length file. I can supply a test program (I've done this) if you like.

pb-cdunn commented 8 years ago

@dgordon562, Any idea how many files? I'd like to know the size of the problem. Can you tell us how many blocks you have?

@pb-chin@, I would use services. I could draw it up for you sometime.

A simpler idea -- easy to implement quickly -- is to store the dependency graph persistently. "Done" could still be communicated via NFS, but the main program would always store its current graph on exit. No lock is necessary.

This gets a bit complicated when you think about re-running some steps. Again, a service can help.

dgordon562 commented 8 years ago

Sorry for the delay. I was on the other communication median (please check--I'll be working on this this weekend).

Number of files: 2000+ job files and around that many m files. Having the done files distributed in so many directories is probably what slows Falcon/pypeflow down.

I would suggest a simple solution that is debuggable by users and human-readable.

pb-cdunn commented 8 years ago

2000 file-stats is not that bad. Maybe seconds, but not an hour. I think there is something going on in Python. I don't think this is another O(N^2) problem though. I don't like how long it takes to restart even for small examples. I haven't put any time into it yet though.

pb-jchin commented 8 years ago

in our system, it is less than 10 minutes.

dgordon562 commented 8 years ago

Started Falcon at Tue Nov 10 06:50:43 2015. First job submitted at Tue Nov 10 07:51:17 2015. That is a little over 1 hour. This is on the 0-rawreads LA4Falcon stage.

When Falcon is in the initial daligner stage it doesn't take so long.

Our system is more typical from talking with other users. So if you want Falcon to work well on most users systems, it is insufficient to make it work well on yours.

pb-cdunn commented 8 years ago

[I]f you want Falcon to work well on most users systems, it is insufficient to make it work well on yours.

True. But even 10 minutes is too long for my taste. We'll make it faster here, and then we'll look for feedback.

I think that part of the delay is pypeflow sleeping. It basically submits every job and then detects that they are already done. If we don't "submit" already done jobs into the queue, then we'll be down to just file-stats.

Or I could be wrong. If it's just file-stats, then we'll use some kind of database, and maybe a service.

Anyway, we'll definitely work on this, hopefully soon.

dgordon562 commented 8 years ago

The reason I prefer a human-readable file is that I can imagine several circumstances in which users will need to view that file and even modify it: for example, it will inevitably sometimes become corrupted and restarting the entire assembly would be a bad option. Another example would be when some jobs might be run outside of Falcon (as I sometimes do) and then I would want to update that file so fc_run.py doesn't try to run them again itself.

pb-cdunn commented 8 years ago

Ok. We will aim for human-readable.

dgordon562 commented 8 years ago

Thanks, Chris!

Last night I had to restart Falcon during the 1-preads stage after daligner and this time it took 2 hours (!) before Falcon submitted any jobs. I restarted at 5:15pm and it submitted the first job rp_00047.sh-m_00047_preads-m_00047_preads at 7:15pm. During these 2 hours fc_run.py was cranking away using between 100% and 150% cpu time. I'm not sure what it was doing this entire time. I had never seen such a long delay before submitting jobs, but then I don't think I've ever had to restart Falcon during the 1-preads stage.