fsspec / filesystem_spec

A specification that python filesystems should adhere to.
BSD 3-Clause "New" or "Revised" License
1.02k stars 358 forks source link

Explicit "here's what to implement guide" #78

Open jtratner opened 5 years ago

jtratner commented 5 years ago

Hi!

I am one of the maintainers of stor a filesystem library that attempts to implement a cross-compatible API for POSIX, OpenStack Swift, S3, and DNAnexus paths. @kyleabeauchamp pointed out your spec and it seemed pretty interesting :) . I am considering providing an alternate version of stor.Path that implements the filesystem_spec API. I know I'm combining a couple things here, but I didn't see a mailing list so I figured I'd ask 'em together.

  1. What are the specific methods I need to implement to be compatible with the system? What's the minimal set? Is there an "advanced" set of methods? Is the minimal set all methods on AbstractFileSystem?
  2. What does maxdepth mean in the context of S3 or OpenStack Swift? Is it referring to directories delimited by /? Is there a particular ordering that's required? (for example, is it necessary to guarantee that you fully list each "directory" before emitting files in another directory?)

Aside - as a general comment, I wonder if the API would be better served by having all methods work as iterators and/or have all methods support some kind of limit argument. We originally implemented stor to have most file listing methods return lists; in hindsight, this was a mistake - it's much too easy to accidentally write code that tries to list millions of objects, and thus gets really slow or unintentionally expensive.

Interested to test out the system more! :)

Jeff

jtratner commented 5 years ago

One more point on the maxdepth / limit / iterator concept: we have hit situations (because of import from swift) where we have a parent directory that has 30K subdirectories underneath it, and then each of those might have two levels of hierarchy before getting to anything useful.

For both S3 and DNAnexus, it (can be) quite a bit more performant to list files without concern for directories (which is particularly nice when you want every single file, say for filtering by prefix) vs. attempting to list by prefix. I think the maxdepth parameter makes this kind of optimization impossible for walk (tho I'd need to take a look further and likely s3fs handles this case already). Do you think it'd be useful to add an API method that returns an iterator for that kind of use case?

martindurant commented 5 years ago

The the first question: the simplest thing to do is to inherit from AbstractFileSystem. Many methods are not implemented, but few raise exceptions: it is up to the author of a new implementation to consider which to use. For example, HTTPFileSystem is read-only, and cannot manipulate folders; on the other hand, I can imagine a use for a file-system that only manipulates folders, but cannot interact with files.

Yes, maxdepth refers to walking through file-listings by "folder", which in S3 means using the delimiter approach throughout. This matches the structure of most file-systems, although I of course recognise that cloud key-value stores are different. (There might even be a redis file-system at some point...) The implementation of ls in s3fs does use pagination internally, because of limits on the maximum size of a single request; it would not be a stretch to provide this as an iterator on request. However, there are many options here, and it might be tricky to provide a coherent API to the user. This is a general point, since the method signatures ought to be very similar for all implementations, and some of the backends can only do things in the simplest way. However, overrides and alternate extra methods are always possible.

martindurant commented 5 years ago

it (can be) quite a bit more performant to list files without concern for directories

On second thoughts, perhaps the right method to override with a one-shot or iterator mechanism would be .find().

TomAugspurger commented 4 years ago

I'd echo the request for an explicit list of what's abstract and needs to be implemented.

For example, is AbstractFileSystem.protocol abstract? Should failing to override that cause an exception?

I'm a bit worried about the default behavior of "semi"-abstract methods like mkdir silently returning None.

martindurant commented 4 years ago

I'm a bit worried about the default behavior of "semi"-abstract methods like mkdir

They could in principle also be NotImplemented, requiring downstream authors to explicitly implement no-op stubs like this one. I am not sure.

jtratner commented 4 years ago

I circled back to this briefly and I’m still not clear on min methods to get something to work (I’d like to provide an fsspec compatible stor.Path variant). Would you mind listing them here? (Or perhaps I’m missing something? :-/)

On Mon, Nov 11, 2019 at 10:15 AM Martin Durant notifications@github.com wrote:

I'm a bit worried about the default behavior of "semi"-abstract methods like mkdir

They could in principle also be NotImplemented, requiring downstream authors to explicitly implement no-op stubs like this one. I am not sure.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/intake/filesystem_spec/issues/78?email_source=notifications&email_token=AAMGHK53QAZVPAST763MUGLQTGONPA5CNFSM4IGV24DKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEDXU5NY#issuecomment-552554167, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMGHK4PMJTX5L7JKMS4OOLQTGONPANCNFSM4IGV24DA .

svohara-aurora commented 4 years ago

I echo the sentiments in this thread. One question -- does subclassing AbstractFileSystem automatically register the new class with the registry / known_implementations dict? So that any imported subclass of AbstractFileSystem can be "found" with fsspec.filesystem('protocol') ?

martindurant commented 4 years ago

does subclassing AbstractFileSystem automatically register the new class with the registry / known_implementations

No, it does not. At the moment, you would need to submit a PR to here, or have users import your package, and place the class in the dicts directly. Going the metaclass route to do this automatically sounds convoluted.

It would be good to follow the pattern in Intake and use entrypoints to declare fsspec backends in packages.

martindurant commented 4 years ago

(ref #206 )

martindurant commented 4 years ago

To the people on this thread: do you have implementations that you have been developing? Obviously you are quite right about the documentation, but I am pleased to see that there is some interest here.

jtratner commented 4 years ago

If there was a clear example of minimal set of functions, I would integrate stor with this library, giving easy access to open stack swift and DNAnexus.

On Tue, Mar 17, 2020 at 7:58 AM Martin Durant notifications@github.com wrote:

To the people on this thread: do you have implementations that you have been developing? Obviously you are quite right about the documentation, but I am pleased to see that there is some interest here.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/intake/filesystem_spec/issues/78#issuecomment-600117323, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMGHKYW7323IHDUH3AMRITRH6FX5ANCNFSM4IGV24DA .

martindurant commented 4 years ago

@jtratner : is the set of docstrings in AbstractFileSystem not sufficient?

The smallest implementation example I currently have (read-only) is the github FS.

martindurant commented 4 years ago

The smallest r/w implementations are probably FTP and local, all in the builtin implementations. The important distinction between the two, is that FTP uses AbstractBufferedFile too, so that's handy to see, if your implementation doesn't naturally produce file-like objects.

jtratner commented 4 years ago

I’ll take another pass soon! (My initial pass about a year ago I wasn’t quite clear so I stopped)

On Wed, Mar 18, 2020 at 13:11 Martin Durant notifications@github.com wrote:

The smallest r/w implementations are probably FTP and local, all in the builtin implementations. The important distinction between the two, is that FTP uses AbstractBufferedFile too, so that's handy to see, if your implementation doesn't naturally produce file-like objects.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/intake/filesystem_spec/issues/78#issuecomment-600835403, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMGHKYDNTBONH5HURRC2N3RIETHJANCNFSM4IGV24DA .