cooperative-computing-lab / cctools

The Cooperative Computing Tools (cctools) enable large scale distributed computations to harness hundreds to thousands of machines from clusters, clouds, and grids.
http://ccl.cse.nd.edu
Other
130 stars 111 forks source link

Vine: Send Final Update on Delete #3035

Open BarrySlyDelgado opened 1 year ago

BarrySlyDelgado commented 1 year ago

An issue came up when testing poncho + vine that I'll detail here:

When starting a work_queue/vine factory with N number of workers the factory will submit N workers in some fashion. Once the manager connects to the workers the catalog server will be updated that that manager has N workers. Upon exit of the manager the catalog entry for that manager remains with N workers connected. If a user exits the factory and then restarts the factory for a manager of the same name and same number of workers. The factory will read from the catalog that N workers are connected to the manager and not submit new workers even though the workers were terminated when closing the previous factory. This works itself out however when the manager is started again and reports 0 workers. However when waiting for N workers to be submitted before starting the manager for running performance tests this will stall.

btovar commented 1 year ago

@BarrySlyDelgado Here is something you can try. Use q.tune("wait-for-workers", n), in which wq/tv won't submit tasks until n workers connect. You would then have to look at the logs and trim the time before the n workers connect.

For performance tests this is something you always want to do, even when not using the factory, as you want to make sure that all your runs have the desired number of workers for most of the time.

BarrySlyDelgado commented 1 year ago

Thanks Ben!

dthain commented 1 year ago

Another aspect to consider:

The manager ought to send a final update at vine_delete that describes its final state. (I'm not sure if it is actually doing that.)

Of course, if the manager crashes or exits abnormally, that won't happen, and it's necessary to timeout.

dthain commented 1 year ago

@mcarbona1 please have a look at vine_delete() in C, and either make sure (or fix it) so that the manager sends one final update to the catalog before it goes away.