cooperative-computing-lab / cctools

The Cooperative Computing Tools (cctools) enable large scale distributed computations to harness hundreds to thousands of machines from clusters, clouds, and grids.
http://ccl.cse.nd.edu
Other
130 stars 111 forks source link

Vine: Edge Cases for Temporary Files #3563

Open dthain opened 8 months ago

dthain commented 8 months ago

Per our discussion yesterday, several challenges and ideas regarding handling of temporary files. Currently, temporary files are associated with the tasks that create them, so if a file is lost, it gets recreated. This is good so far, but causes trouble when the failure rate becomes sufficiently high.

So, we need one or more operation options like this:

@colinthomas-z80 other ideas?

colinthomas-z80 commented 8 months ago

I am thinking a user can declare a replica count, perhaps in an overload of declare_temp. The manager will coordinate its distribution. Maybe on handling the cache update for the first copy? Making sure each replica is stored on a different hostname.

I don't know how convenient a "conservative" mode would be. This would essentially just make it a normal file, and if the user explicitly declared a temp I would assume there are disk space issues, or perhaps something else that makes it undesirable to come back to the manager.

Adaptive methods are alluring but they might be difficult to configure. One workflow may tolerate more failures than another so the user would have to think of the exact threshold to switch to conservative. Then, once we switch to conservative mode, do we bring everything back all at once? Or only new files?

dthain commented 8 months ago

I am all in favor of doing simple things first before complicated things!

Specifying the replica count in the file object makes sense to me.

colinthomas-z80 commented 8 months ago

Exploring the implementation of this has me thinking about another topic I am considering opening an issue about. That is the disconnect in terms of data structures, between vine file objects and the manager's understanding of the worker's cache.

That is, we declare files at the manager and keep around the objects for local use, reference counting, etc...

Yet the hash tables tracking worker's cache files only contain cachenames, which is a field of struct vine_file, however there is no backwards reference possible from cachename->vine_file. The consequence is that we often do linear searches through the worker's hash tables looking for a cachename to count its replicas.

The naive thought is that we could attach some replica info to a vine_file object. I think I understand the motivation behind the current construction, since we would need to update the state of vine_file objects each time something happens, which might include a number of unexpected situations.

The issue is that I am looking for a place for the manager to decide to replicate temp files, but in order for this to work asynchronously it will need to query the replica count of a temp file each time it queues a transfer, and the overhead of that would probably be very bad.

Either that or some local structure can be put together to handle the temp specific case, which could work but might look like a sore thumb

dthain commented 8 months ago

I think that's a good observation. There are a few inherent complications: 1 - The worker may contain a file (perhaps from a previous run) that the manager has not declared at all. (Or not declared yet.) 2 - b/c the application creates and deletes files at will, the manager does reference counting so that the objects don't go away if it is still using them.

Perhaps we need to more clearly distinguish between "files" (things that the application declares) and "replicas" (instances of stored objects on the workers. It seems to me there may be 0 to n replicas for a given file at any time.

Also note that every time a replica is added or removed, we perform an operation on vine_file_replica_table. It wouldn't be hard to modify that object to track a count per replica, without changing the API.

dthain commented 4 months ago

Notes from Ben merged from #3693:

Currently the manager assumes that a temp file can be transferred between workers. If for some reason a worker can't do peer transfers, then the temp file can only be used by tasks that run on that worker.

Some random ideas: