cooperative-computing-lab / cctools

The Cooperative Computing Tools (cctools) enable large scale distributed computations to harness hundreds to thousands of machines from clusters, clouds, and grids.
http://ccl.cse.nd.edu
Other
134 stars 116 forks source link

vine: treat worker:// and http:// differently #3731

Open btovar opened 6 months ago

btovar commented 6 months ago

Continued discussion from #3729

In the current master, when an input url for a task fails to transfer, the task is retried indefinitely. Previously, the task would fail immediately with input missing. This was changed because often the url transfers would come from workers, which are subject to transient errors.

One view is that http:// errors are the responsibility of the application, while worker:// errors are the responsibility of taskvine proper. E.g., a task with http:// errors could return immediately with input missing, while worker:// errors can be retried indefinitely (with transfers from other workers, recovery tasks, etc.).

Another option is to add to declare_url parameters that would allow taskvine to determine the health of the source: acceptable fail rate per minute, maximum number of connections, etc.

dthain commented 3 months ago

Yes, I agree with this interpretation. HTTP transfers and worker transfers are different primarily because the latter allows failures to be handled internally. However, note that there is a distinction between the original source of a file and the current location. A file obtained by HTTP can be put into the cluster, and then transferred laterally using worker transfers. That distinction is not cleanly maintained in the worker.