cooperative-computing-lab / cctools

The Cooperative Computing Tools (cctools) enable large scale distributed computations to harness hundreds to thousands of machines from clusters, clouds, and grids.
http://ccl.cse.nd.edu
Other
131 stars 111 forks source link

vine: factory is slow to shutdown workers #3311

Open btovar opened 1 year ago

btovar commented 1 year ago

When the factory is shutting down, it terminates the workers one by one, which makes the processes slow. This has two issues:

This lack of interactivity limits the factory use inside a notebook. Instead, the factory should send the "remove" directive to all workers at once. Then it can it can process all the jobs as they terminate. E.g., for -Tlocal, send the signal to all workers at once. Then we it doesn't matter much if we wait for all workers one by one (or use a wait for any worker.) Similarly for all other batch jobs, send the remove to all jobs and then add function to wait for them to terminate.

dthain commented 3 weeks ago

Here is what I see in the code: https://github.com/cooperative-computing-lab/cctools/blob/325160cb1929339b596ec44ed796b2ff93acc3a9/batch_job/src/vine_factory.c#L574

When the factory is done, it calls remove_all_workers, which calls batch_job_remove for every worker job, which then runs qdel, condor_rm or similar to remove the job, but does not wait for the job to complete. So this may very well be slow, but more likely because it invokes an external command for each job.

Some alternative approaches might be:

btovar commented 3 weeks ago

Looking at the bug description, most likely this was with workers executing alongside the notebook, that is, -Tlocal. The issue is there is that the factory waits for each worker individually, I think.