daligner job distribution

PacificBiosciences / FALCON

FALCON: experimental PacBio diploid assembler -- Out-of-date -- Please use a binary release: https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries

https://github.com/PacificBiosciences/FALCON_unzip/wiki/Binaries

Other

205 stars 102 forks source link

daligner job distribution #8

Open pb-jchin opened 9 years ago

pb-jchin commented 9 years ago

Currently, fc_run uses HPCAligner for constructing the daligner jobs. It will be useful to build the logic for distributing the jobs without calling HPCAligner. This will be necessary if we like to be able to provide the feature to progressively doing daligner jobs when the new data are added progressively. Also, we can eliminate the "find" command for the merge and sort step.

zyndagj commented 8 years ago

Jason,

Since this is currently a targeted enhancement, I have two suggestions.

Daligner commands Not all systems have infinite argument limits.

$ getconf ARG_MAX

System	Arg Limit
OSX	262144
Stampede	4611686018427387903

And depending on the number of blocks to compare, the daligner calls increase in length with the number of jobs. If you are going to build the jobs without HPCAligner, is there a way to slim down the calls?

Job Bundling Bundling all the jobs into a single srun with a distributed worker queue like TACC-launcher would be great for larger systems. There are usually job submission limits so a single user can't take down the scheduler with an srun fork; limiting the number of concurrent jobs in queue. Using something like TACC-launcher allows a single srun to work through any number of commands on the maximum number of nodes without stressing the scheduler.

These large single jobs also do wonders for queue priority.

pb-cdunn commented 8 years ago

This idea may be obsolete.

We no longer use find to group .las files for merging.
We now parse the result of HPCdaligner, so we are able to group them differently.
Whatever we do for incremental runs is TBD.

Not all systems have infinite argument limits.

Have you actually neared a limit? You would need tens of thousands of blocks. When it becomes a problem, we can split up our parsed results. Or we can simply change HPCdaligner to limit the number of blocks per job. (That might be hard-coded somewhere already.)

Job Bundling

TACC-launcher is a good idea. We'll look into that.

These large single jobs also do wonders for queue priority.

Could you elaborate? Wouldn't they tend to prevent other, shorter jobs from sharing the cluster?

zyndagj commented 8 years ago

Have you actually neared the limit?

When I was still using the recommended ecoli parameters of -dal 4 on a much larger input I probably crossed the 262k character limit for OSX, but haven't gotten near the 2^62 limit on stampede.

In regards to queue priority, there's a detailed explanation for how SLURM calculates priority here.

https://computing.llnl.gov/linux/slurm/priority_multifactor.html#general

At TACC, and on most larger clusters, we favor larger jobs to optimize utilization. Ignoring all other factors, a single job requesting 50 nodes would have higher priority (starting faster) than 50 jobs each requesting a single node. If we looked at fair-share, the priority of each of the single jobs would decrease with each completed job as well.

pb-cdunn commented 8 years ago

But a single job requesting 1 node would then wait behind everything? For cluster responsiveness, it's really the smallish jobs that I'd worry about. The big jobs are expected to take a long time and to run only when not a lot else is going on. Am I wrong?

zyndagj commented 8 years ago

These large research clusters are usually designed to serve large mpi jobs that require either distributed data or a huge amount of cooperative computation, so priority is given to them because they cannot go anywhere else. However, you are correct that larger ( > 256 node ) jobs do run during specific reservations or after the queue is depleted. To ensure no job monopolizes resources we also impose a hard time limit; encouraging demanding software to scale along with many-core and distributed hardware.

Each job also has overhead for preparing nodes and cleaning up after a job to make ensure the environment is clean and secure for the next user. This is why we prefer tasks either be strung together with launcher for many-core and distributed execution or run throughout the day on a single interactive session, allowing for better cycle utilization.

pb-cdunn commented 8 years ago

Helpful info. Thanks.

We will definitely move toward combining pieces of work into "chunks" to run as single job submissions. But that seems simple. Why does TACC-launcher have so much code for something so simple? What else would it get us?

zyndagj commented 8 years ago

Besides knowing how launch jobs on both SGE and SLURM, it can also launch jobs on our xeon phi accelerators. Much like traditional mpi jobs, launcher processes are spawned on each node that reflect the job's processes-per-node and number-of-nodes settings. Then, each of the workers dynamically complete the work instead of each working through separate queues.

I've never actually installed launcher myself and don't see anything TACC specific except for maybe the plugins/*.rmi scripts. I'll ask @lwilson about general installation procedures and how feasible it would be to refactor inside pypeFLOW.

zyndagj commented 8 years ago

Followed up with @lwilson today and he confirmed that there are no TACC specific dependencies in this latest version of TACC-launcher. The only thing you need to set is $LAUNCHER_RMI to announce the type of scheduler on the system. It currently supports SLURM and SGE.

We have successfully used this to complete a single molecular docking workflow with 640,000 vina tasks in 18 minutes on 4,000 nodes, so it won't have trouble with large jobs.