camall3n / onager

Lightweight python library for launching experiments and tuning hyperparameters, either locally or on a cluster
MIT License
20 stars 4 forks source link

Add multiworker/subjobs functionality #43

Closed camall3n closed 2 years ago

camall3n commented 2 years ago

@samlobel and I made this upgrade that adds two new args to the launch subcommand: --tasks-per-node and --max-tasks-per-node.

The default value, --tasks-per-node=1, behaves the same way as before. But when --tasks-per-node > 1, it switches into "multiworker" mode. Each node that gets scheduled on the backend (slurm/gridengine) will subsequently run the local backend with the desired number of subjobs.

The --max-tasks-per-node has a default value of -1, which will automatically compute the number of cores on the system, and give you that many workers for processing your subjobs. You can override it to less than that, for example if you wanted 4 jobs to run with 2 cpus each on a node with 4 cpus. You can also override it to more, for example if you knew the jobs would mostly be sleeping or waiting on I/O or something.

Why do this crazy thing?

We tested it a bit on both slurm and gridengine and it seems to not break anything! 😃

camall3n commented 2 years ago

I converted the PR to draft because I want to rebase before merging. Would still love your review when you get a chance @neevparikh !

camall3n commented 2 years ago
- Add tests!
    - Show that it can divvy up complicated task lists (1-3,5-9,13) for subtasks
    - Make sure works properly when there are more subjobs than cores, or whatever
    - Make sure you can launch two of these subjob things at once and have it work nicely. Might mean subjob file has slurm ID in it or something.

- Documentation:
    - explain that `duration`, `cpus`, etc. always apply to the multiworker, not the subjob
neevparikh commented 2 years ago

Nice, give me today to look at it!

camall3n commented 2 years ago

I get an AttributeError on my machine:

$ python
Python 3.9.6 (default, Jun 29 2021, 06:20:32) 
[Clang 12.0.0 (clang-1200.0.32.29)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.cpu_count()
4
>>> os.sched_getaffinity(0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'os' has no attribute 'sched_getaffinity'
>>> 
camall3n commented 2 years ago

Another potential use-case:

When grid jobs that fail due to "Eqw", there are sometimes multiple non-contiguous job ranges that are affected, which need to be re-launched separately. Normally onager will split up the corresponding task list into different task blocks, but that requires executing multiple qsub commands in sequence to handle each block.

If we simply allowed the multiworker to run with 1 task per node, this feature would automatically re-number the jobs as subjobs and launch everything under a single new jobid, which would make it easier to keep track of everything

camall3n commented 2 years ago

@samlobel I updated the docs. There's a system overview and a multiworker design doc.

Let me know if I'm missing anything