fsspec / filesystem_spec

A specification that python filesystems should adhere to.
BSD 3-Clause "New" or "Revised" License
1.05k stars 362 forks source link

os.path equivalent for fsspec #747

Open d4l3k opened 3 years ago

d4l3k commented 3 years ago

One of the things that have come up when trying to integrate fsspec into tensorboard (https://github.com/tensorflow/tensorboard/pull/5248) is that there aren't any standard path operations as part of fsspec as far as I can tell. When dealing with more complex chained filesystems the rules get pretty complex for users to implement.

Ideally fsspec would provide the common operations such as:

I.e.

>>> path = "simplecache::zip://foo::s3://bucket/path.zip"
>>> fsspec.path.join(path, "bar") 
"simplecache::zip://foo/bar::s3://bucket/path.zip"
>>> fsspec.path.dirname(path)
"foo"
>>> fsspec.path.basename(path)
"simplecache::zip://::s3://bucket/path.zip"

With the chained filesystems this gets really complex and I'm not 100% sure how to implement this in all cases/filesystems so feedback would be appreciated here

An option here might be to try and implement https://docs.python.org/3/library/pathlib.html#pathlib.PurePath

martindurant commented 3 years ago

On the pathlib option, please see the https://github.com/Quansight/universal_pathlib project, a layer on top of fsspec.

Otherwise, the situation is indeed complex and I'm not certain we can easily define the expected behaviour of os.path.* for fsspec-compatible paths. Indeed, you can apply those builtin functions right now, and get something reasonable back (which are not what you were after, though!).

>>> os.path.dirname("simplecache::zip://foo/bar::s3://bucket/path.zip")
'simplecache::zip://foo/bar::s3://bucket'
>>> os.path.basename("simplecache::zip://foo/bar::s3://bucket/path.zip")
'path.zip'

(I would probably argue that fsspec.path.basename(path) -> "foo")

d4l3k commented 3 years ago

Just took a look at universal_pathlib. It seems to throw exceptions for most chained file systems. For most filesystems posixpath works but since it defines the sep as a top level field seems like there should be an implementation that respects that. With chained paths, it's certainly more complex.

Potentially each FS could define the path operations and the top level fsspec join handles filesystem specific implementation? That would allow defining a "standard" posixpath based solution as well as allowing for overrides for specific filesystems with non posix style paths. Though there may be file systems without any concept of directories which could be painful to support here

martindurant commented 3 years ago

So what we would need is not fsspec.path., but fs.path. (i.e., accessed though the filesystem instance), because different file systems will follow different patterns. It sounds doable. The universal_pathlib can call those things to complete the circle.

martindurant commented 3 years ago

cc @andrewfulton9 @brl0

andrewfulton9 commented 3 years ago

I'd definitely be interested in integrating something like that into universal_pathlib. Right now, I am defaulting on using pathlib._PosixFlavour to handle a lot of path manipulations. Adding one or more flavours is currently an open issue that I haven't gotten around to addressing yet. Having something on the filesystem to help with some of those operations would make it a lot easier to implement though.

d4l3k commented 2 years ago

One thing that would help a lot would be to expose _unstrip_protocol as part of core. https://github.com/fsspec/filesystem_spec/blob/1f3b6d81feb3927d368727012823292e1da7cd2d/fsspec/utils.py#L454-L462

A lot of operations need you to call fsspec.core.url_to_fs and having the inverse would be a big help since not all fs/operations return full paths w/ protocols.

https://github.com/fsspec/filesystem_spec/blob/f236c4f3aeba244d40f8e25102ceb9aba25fe48b/fsspec/core.py#L357

Would be great to be able to do:

full_path = "memory://bar"
fs, path = fsspec.core.url_to_fs(full_path)
for file in fs.ls(path):
    print(fsspec.core.fs_to_url(fs, file))

That would help cut down a lot of duplicate code/boilerplate like

https://github.com/tensorflow/tensorboard/blob/4568e2bc948311b78ee7f2e775e5bcb589ecacf8/tensorboard/compat/tensorflow_stub/io/gfile.py#L487-L491

martindurant commented 2 years ago

In the current latest, _unstrip_protocol is available on the class as a method (because different implementations might have various opinions, particularly http).

Generally, ls/find/glob operations return URLs as seen by the implementation in question, so not including the protocol. There's probably no changing that. The new generics module in #828 will convert these into complete URLs in every case.

efiop commented 2 years ago

My 2c: we've also introduced fs.path for convenience methods like name/parts/parent/parents/join/etc in dvc https://github.com/iterative/dvc/blob/master/dvc/fs/path.py Compared to using path-like objects as we did before, it has better performance when dealing with large number of objects. It would also be convenient for anyone writing a new fs implementation.

martindurant commented 2 years ago

Ah yes, you did mention this sometime before, @efiop . You might consider upstreaming that module, maybe.

d4l3k commented 2 years ago

@martindurant I see that it's provided via "full_name" on the file class but often I want to be able to ls files without the overhead of opening them. https://github.com/fsspec/filesystem_spec/blob/master/fsspec/spec.py#L1394-L1395

Providing it on the Filesystem class would work for me though

I'm not seeing anything special on http. https://github.com/fsspec/filesystem_spec/blob/master/fsspec/implementations/http.py

Good to know about the generics change! Would be nice if the _unstripprotocol method was public (no ) in those changes

martindurant commented 2 years ago

(ah sorry, the method is only in the same PR I mentioned above, #828 - but it will come!)

martindurant commented 2 years ago

Would be nice if the _unstripprotocol method was public (no )

That's reasonable

efiop commented 2 years ago

After using our fs.path for almost a year and making attempts (private) to create a patch (and finding a lot of fs.path already used and also some existing path-manipulation methods in some filesystems), I think the most natural way is to avoid fs.path layer and just have fs.join/basename/etc directly in fs. I found myself mistyping fs.join instead of fs.path.join so often that I think it proves the point 😄

Working on submitting a PR.

logan-markewich commented 1 year ago

Is this still being worked on? Would love to this feature implemented

efiop commented 1 year ago

Other stuff got in the way, so I didn't manage to contribute it yet 🙁

agrinh commented 1 year ago

@efiop Also very interested in getting this in, any chance some time might open up / is there anything anyone else can do to help out here?

efiop commented 11 months ago

@agrinh Finally getting around to moving fs.path to plain fs methods in dvc, so hoping to get around to contributing it to fsspec around new years 🙂 (said that before, but still).