Open birnbaum opened 6 years ago
Not that I know of, but it would be a good idea.
There is a walker_class
class attribute on the base FS. In theory, if you supply a custom walker class with the same interface, everything should just work.
Let me know if you need any help with that. I would almost certainly want to borrow the implementation for S3FS.
BTW If you are copying files, the slow walking is somewhat ameliorated by the multi-threaded copying. Since the walking can be done in the background..
Cool, thanks for the tips, I'll give it a try!
The walking is slow in my use case because I am walking over deeply "nested" keys. For every level, a separate request is sent to GCS which is, of course, a lot slower than retrieving the keys in large batches and "faking" the (path, dirs, files) tuple under the hood.
Unfortunately, it's a little more complicated than I thought. For example:
The first element returned by walk is supposed to contain all dirs
and files
on the root level. Now you have two options:
dirs
and files
before.Unfortunately there is no real way to be smart here, one can not anticipate how many files or folders are in a bucket and which algorithm will be faster/make more sense. In general, if you know that you will need to walk the entire fs anyway, option 2 will be a lot faster (which is my use case). I don't think it should be the default walker though.
Has there any work been done towards a custom Walker for key-value-store filesystems? Walking with the standard Walker is extremely slow on my gcsfs implementation because of all the requests and I can imagine it's the same on s3fs.
Walking on buckets could be implemented fairly efficient because it comes down to something like
bucket.list()
and one would just need to format the walk output correctly. This way we would need way less S3/GCS calls. Am I missing something here or is this correct?Are there currently any custom Walker implementations? And where would such a custom Walker live? In the main pyfilesystem2 repo?
Thanks! :)