Open dthain opened 8 months ago
I am thinking a user can declare a replica count, perhaps in an overload of declare_temp. The manager will coordinate its distribution. Maybe on handling the cache update for the first copy? Making sure each replica is stored on a different hostname.
I don't know how convenient a "conservative" mode would be. This would essentially just make it a normal file, and if the user explicitly declared a temp I would assume there are disk space issues, or perhaps something else that makes it undesirable to come back to the manager.
Adaptive methods are alluring but they might be difficult to configure. One workflow may tolerate more failures than another so the user would have to think of the exact threshold to switch to conservative. Then, once we switch to conservative mode, do we bring everything back all at once? Or only new files?
I am all in favor of doing simple things first before complicated things!
Specifying the replica count in the file object makes sense to me.
Exploring the implementation of this has me thinking about another topic I am considering opening an issue about. That is the disconnect in terms of data structures, between vine file objects and the manager's understanding of the worker's cache.
That is, we declare files at the manager and keep around the objects for local use, reference counting, etc...
Yet the hash tables tracking worker's cache files only contain cachenames, which is a field of struct vine_file, however there is no backwards reference possible from cachename->vine_file. The consequence is that we often do linear searches through the worker's hash tables looking for a cachename to count its replicas.
The naive thought is that we could attach some replica info to a vine_file object. I think I understand the motivation behind the current construction, since we would need to update the state of vine_file objects each time something happens, which might include a number of unexpected situations.
The issue is that I am looking for a place for the manager to decide to replicate temp files, but in order for this to work asynchronously it will need to query the replica count of a temp file each time it queues a transfer, and the overhead of that would probably be very bad.
Either that or some local structure can be put together to handle the temp specific case, which could work but might look like a sore thumb
I think that's a good observation. There are a few inherent complications: 1 - The worker may contain a file (perhaps from a previous run) that the manager has not declared at all. (Or not declared yet.) 2 - b/c the application creates and deletes files at will, the manager does reference counting so that the objects don't go away if it is still using them.
Perhaps we need to more clearly distinguish between "files" (things that the application declares) and "replicas" (instances of stored objects on the workers. It seems to me there may be 0 to n replicas for a given file at any time.
Also note that every time a replica is added or removed, we perform an operation on vine_file_replica_table
. It wouldn't be hard to modify that object to track a count per replica, without changing the API.
Notes from Ben merged from #3693:
Currently the manager assumes that a temp file can be transferred between workers. If for some reason a worker can't do peer transfers, then the temp file can only be used by tasks that run on that worker.
Some random ideas:
Per our discussion yesterday, several challenges and ideas regarding handling of temporary files. Currently, temporary files are associated with the tasks that create them, so if a file is lost, it gets recreated. This is good so far, but causes trouble when the failure rate becomes sufficiently high.
So, we need one or more operation options like this:
[x] Set a minimum replica count for temporary files. When a temporary file is created (or also lost?) immediately send out
put url
replication requests to workers to increase the replica count. Probably should be a configurable option since this could significantly increase cost.[x] Make temporary file handling an option for the user to set. In "conservative" mode, just treat all temporary files like normal local files, so that they are brought home to the manager. Slower, yes, but also more reliable. (They can still be cached on workers, of course.)
[ ] Make one or both of these options adaptive. When "enough" failures happen, the system should automatically become more conservative to avoid cascading failures.
@colinthomas-z80 other ideas?