Open nbren12 opened 3 years ago
The cached filesystem is largely untested from multiple processes. I would probably expect multi-thread to be OK, though. The current cache implementations might not work well, in general, for large loads. There is an idea elsewhere to try to decouple the logical layer (what gets cached, when) from the storage end (dealing with the disk or other backend). I could use help planning these.
Note that the cache metadata is saved using shutil.move
, which indeed is the same as os.rename
where the paths are on the same device. This should probably be used for the files themselves - which would perhaps best be accomplished by using LocalFileSystem.open(, autocommit=False)
or transactions.
The cached filesystem is largely untested from multiple processes.
Yah, I just started playing around with it. It seems like a potentially powerful feature, but as they say: "there are two hard problems in CS..."
This should probably be used for the files themselves - which would perhaps best be accomplished by using LocalFileSystem.open(, autocommit=False) or transactions.
Refactors aside, I think this would resolve this issue.
Essentially, the cachers call fs.get(remotepaths, localpaths)
in just a couple of places, so it would be simple to use a temporary location until downloading has completed. I don't think there is any foolproof way to ensure that the temporary file is on the same device as the eventual target, except to put them in the same directory (e.g., "{localpath}.temp.UUID").
except to put them in the same directory (e.g., "{localpath}.temp.UUID")
this sounds like a good idea. It would also probably not work reliably on NFS.
I recently had some strange data corruption errors happen when using a cached fsspec filesystem from parallel processes. Seemed like a race condition. Is the cached file system thread/process safe?
According this this stackoverflow answer
os.rename
can ensure that parallel writes do not corrupt a single file.