PyFilesystem / s3fs

Amazon S3 filesystem for PyFilesystem2
http://fs-s3fs.readthedocs.io/en/latest/
MIT License
154 stars 55 forks source link

Custom Walker for key-value-store filesystems #45

Open birnbaum opened 6 years ago

birnbaum commented 6 years ago

Has there any work been done towards a custom Walker for key-value-store filesystems? Walking with the standard Walker is extremely slow on my gcsfs implementation because of all the requests and I can imagine it's the same on s3fs.

Walking on buckets could be implemented fairly efficient because it comes down to something like bucket.list() and one would just need to format the walk output correctly. This way we would need way less S3/GCS calls. Am I missing something here or is this correct?

Are there currently any custom Walker implementations? And where would such a custom Walker live? In the main pyfilesystem2 repo?

Thanks! :)

willmcgugan commented 6 years ago

Not that I know of, but it would be a good idea.

There is a walker_class class attribute on the base FS. In theory, if you supply a custom walker class with the same interface, everything should just work.

Let me know if you need any help with that. I would almost certainly want to borrow the implementation for S3FS.

BTW If you are copying files, the slow walking is somewhat ameliorated by the multi-threaded copying. Since the walking can be done in the background..

birnbaum commented 6 years ago

Cool, thanks for the tips, I'll give it a try!

The walking is slow in my use case because I am walking over deeply "nested" keys. For every level, a separate request is sent to GCS which is, of course, a lot slower than retrieving the keys in large batches and "faking" the (path, dirs, files) tuple under the hood.

birnbaum commented 6 years ago

Unfortunately, it's a little more complicated than I thought. For example:

The first element returned by walk is supposed to contain all dirs and files on the root level. Now you have two options:

  1. Either you follow the standard walk algorithm and only list until the first delimiter. This means you will get the first result very quickly but the walk will get extremely slow if you have a lot of folders down the way.
  2. You do it the way I proposed and send as few requests as possible by retrieving your keys in large batches. The problem here: You have to list the entire bucket first before you can return the first element as you cannot compute the tuple fields dirs and files before.

Unfortunately there is no real way to be smart here, one can not anticipate how many files or folders are in a bucket and which algorithm will be faster/make more sense. In general, if you know that you will need to walk the entire fs anyway, option 2 will be a lot faster (which is my use case). I don't think it should be the default walker though.