flatironinstitute / disBatch

Tool to distribute a list of computational tasks over a pool of compute resources. The pool can grow or shrink.
Apache License 2.0
39 stars 8 forks source link

Tasks-per-node fixes #25

Closed lgarrison closed 1 year ago

lgarrison commented 2 years ago

I started skimming the recent changes to the tasks-per-node handling, and the logic seems straightforward: tasks-per-node and cpus-per-task can either be specified as arguments to disBatch, or to Slurm, which can result in conflicts. To resolve conflicts, the disBatch arguments take precedence, and the user is warned.

I did spot one likely bug, so I wanted to open this PR for @njcarriero to look. I think there's a typo of taskPerNode, where it should be tasksPerNode. The other changes in this PR are mostly cosmetic.

I haven't actually run the code yet but will try it out soon.

lgarrison commented 2 years ago

@njcarriero I ran the updated master under Slurm, and everything looks good to me. I got warnings in the logs when the Slurm values didn't match the disBatch flags, and the run proceeded with the disBatch values.

I guess my only remaining question is whether the warnings should be echoed to stdout/stderr by default, instead of hidden in the logs... We would probably want to suppress/move some of the developer-facing warnings, like "Retirement planning needs improvement" if so. I don't have a strong opinion, but it's something to consider.

The only remaining changes on this branch are installation fixes; disBatch.py isn't used as a script anymore, so I removed that from the setup.py scripts arg (pip install fails otherwise). Likewise, I removed a hashbang since disBatch.py isn't invoked directly anymore.

njcarriero commented 2 years ago

Good point. I wonder if we could/should add a custom log level for stuff that should be written to both the log and standard error?

lgarrison commented 1 year ago

I added a feature to echo certain context log statements to stderr in addition to the log. It wasn't as clean a change as I had hoped; it required removing the stderr redirects from both the subprocess.Popen() call and the invocation of Python in dbUtil.sh. It would probably be possible to tee both of these, but in any case, it opens us up to getting all stderr from the context on the console. There appears to be one such "Invalid node name" that comes as a result:

(venv) lgarrison@scclin021:~/scc/disbatch-tasks-per-node$ salloc -c 4 disBatch -c 2 tasks.txt
salloc: Pending job allocation 1902959
salloc: job 1902959 queued and waiting for resources
salloc: job 1902959 has been allocated resources
salloc: Granted job allocation 1902959
salloc: Waiting for resource configuration
salloc: Nodes worker1215 are ready for job
UserWarning: disBatch argument cpusPerTask (2.0) conflicts with SLURM_CPUS_PER_TASK (4). Using disBatch value.
Invalid node name specified for job 1902959
salloc: Relinquishing job allocation 1902959

So... I'm not sure how much I love this idea after all. Maybe the best solution would be to have a "side channel" (i.e. another fd) that we write to that gets echoed to the console, while normal stderr is redirected to a file as usual. But that's starting to get pretty convoluted.

lgarrison commented 1 year ago

As discussed with @njcarriero, we'll leave stderr open to the console from the context, so we can write user-facing errors there (in addition to echoing them to the log). Console output will usually end up in a Slurm log, anyway.

I've tweaked the retirement command to capture any stdout/stderr and send it silently to the log. Seems to be working, so I think this is ready for merge.