Open btovar opened 1 year ago
Here is what I see in the code: https://github.com/cooperative-computing-lab/cctools/blob/325160cb1929339b596ec44ed796b2ff93acc3a9/batch_job/src/vine_factory.c#L574
When the factory is done, it calls remove_all_workers
, which calls batch_job_remove
for every worker job, which then runs qdel
, condor_rm
or similar to remove the job, but does not wait for the job to complete. So this may very well be slow, but more likely because it invokes an external command for each job.
Some alternative approaches might be:
Looking at the bug description, most likely this was with workers executing alongside the notebook, that is, -Tlocal
. The issue is there is that the factory waits for each worker individually, I think.
When the factory is shutting down, it terminates the workers one by one, which makes the processes slow. This has two issues:
This lack of interactivity limits the factory use inside a notebook. Instead, the factory should send the "remove" directive to all workers at once. Then it can it can process all the jobs as they terminate. E.g., for -Tlocal, send the signal to all workers at once. Then we it doesn't matter much if we wait for all workers one by one (or use a wait for any worker.) Similarly for all other batch jobs, send the remove to all jobs and then add function to wait for them to terminate.