Closed samlobel closed 2 years ago
@samlobel Sorry for the delay in getting to this!
My initial reaction is that this is a great idea for a feature, but that this implementation is rather complicated, and I'm worried about adding the extra complexity.
That said, here's a sketch of a different implementation: we use the existing onager functionality to have onager launch itself on each node, except that the launched copies are each using the local
backend to service the jobs. I think this would take care of the multiprocessing, keep the logs from overwriting each other, and avoid the need of __filler__
ids.
If you feel so inclined to mess around with something like that, feel free to update this PR or make a new one!
By passing
--tasks-per-job
toonager launch
, lets you run multiple tasks using the same resources. For example, running 2 tasks on one GPU. Silently ignored on non-slurm.Main current limitation is that it sends all logs in a job to the same place. So, if you have 2 tasks in the same job the logs can be jumbled and confusing. Also, my feeling is that when tasks_per_job is 1, it would be better to make wrapper.sh look like it used to for simplicity.