cooperative-computing-lab / cctools

The Cooperative Computing Tools (cctools) enable large scale distributed computations to harness hundreds to thousands of machines from clusters, clouds, and grids.
http://ccl.cse.nd.edu
Other
130 stars 111 forks source link

vine: Wait no wait #3815

Closed btovar closed 1 month ago

btovar commented 1 month ago

Return any completed task to the application without doing any work.

Post-change actions

Put an 'x' in the boxes that describe post-change actions that you have done. The more 'x' ticked, the faster your changes are accepted by maintainers.

Additional comments

This section is dedicated to changes that are ambitious or complex and require substantial discussions. Feel free to start the ball rolling.

dthain commented 1 month ago

I see the need for this and it makes sense, but a few things about the API:

btovar commented 1 month ago

Time zero has meant to wait at least one second, so I didn't want to mess with that. But yes, if not to preserve backwards compatibility I would have preferred not to add a call and use timeout=0. If you think it is ok, we can use timeout=0, in that way we don't have to come up with a new name.

btovar commented 1 month ago

It has to have a tag, just in case the daskvine manager is managing two dags at the same time. (Currently not possible as it waits for the queues to be empty, but we will need that for notebooks in the future.)

dthain commented 1 month ago

So the timeout value is funny because we want to put some approximate limit on waiting time. But once the manger begins to interact with a worker, certain actions cannot be interrupted (e.g. a file transfer) and so it's easy to take longer than the timeout value. I think the timeout value really means "max time to wait idle for a message to arrive." So, I can see several regimes for waiting:

1 - Do not wait for anything, only return a completed task if available. 2 - Do not wait idly for messages to arrive, but process any pending messages on sockets. (Requires calling link_wait at least once.) 3 - Wait idly for messages to arrive, up to N seconds. 4 - Wait idly forever until something happens.

Did I miss any cases?

I believe that case 2 corresponds to timeout==0, case three is timeout>0 and case four is timeout==VINE_WAIT_FOREVER

Would it be better to change the meaning for timeout==0 or to introduce a new symbol for case 1?

btovar commented 1 month ago

Since we convert timeout=0 to 1, it loosely correspond to case 2. From a user perspective case 2 is hard to explain without going into the particulars of the implementation, so I wouldn't make it an official case. My preference is having timeout=0 do what the wait_no_wait call is doing here, similar to WNOHANG.

dthain commented 1 month ago

That makes sense to me.