Open d4l3k opened 3 years ago
On the pathlib option, please see the https://github.com/Quansight/universal_pathlib project, a layer on top of fsspec.
Otherwise, the situation is indeed complex and I'm not certain we can easily define the expected behaviour of os.path.* for fsspec-compatible paths. Indeed, you can apply those builtin functions right now, and get something reasonable back (which are not what you were after, though!).
>>> os.path.dirname("simplecache::zip://foo/bar::s3://bucket/path.zip")
'simplecache::zip://foo/bar::s3://bucket'
>>> os.path.basename("simplecache::zip://foo/bar::s3://bucket/path.zip")
'path.zip'
(I would probably argue that fsspec.path.basename(path) -> "foo"
)
Just took a look at universal_pathlib. It seems to throw exceptions for most chained file systems. For most filesystems posixpath works but since it defines the sep
as a top level field seems like there should be an implementation that respects that. With chained paths, it's certainly more complex.
Potentially each FS could define the path operations and the top level fsspec join
handles filesystem specific implementation? That would allow defining a "standard" posixpath based solution as well as allowing for overrides for specific filesystems with non posix style paths. Though there may be file systems without any concept of directories which could be painful to support here
So what we would need is not fsspec.path., but fs.path. (i.e., accessed though the filesystem instance), because different file systems will follow different patterns. It sounds doable. The universal_pathlib can call those things to complete the circle.
cc @andrewfulton9 @brl0
I'd definitely be interested in integrating something like that into universal_pathlib
. Right now, I am defaulting on using pathlib._PosixFlavour to handle a lot of path manipulations. Adding one or more flavours is currently an open issue that I haven't gotten around to addressing yet. Having something on the filesystem to help with some of those operations would make it a lot easier to implement though.
One thing that would help a lot would be to expose _unstrip_protocol as part of core. https://github.com/fsspec/filesystem_spec/blob/1f3b6d81feb3927d368727012823292e1da7cd2d/fsspec/utils.py#L454-L462
A lot of operations need you to call fsspec.core.url_to_fs
and having the inverse would be a big help since not all fs/operations return full paths w/ protocols.
Would be great to be able to do:
full_path = "memory://bar"
fs, path = fsspec.core.url_to_fs(full_path)
for file in fs.ls(path):
print(fsspec.core.fs_to_url(fs, file))
That would help cut down a lot of duplicate code/boilerplate like
In the current latest, _unstrip_protocol
is available on the class as a method (because different implementations might have various opinions, particularly http).
Generally, ls/find/glob operations return URLs as seen by the implementation in question, so not including the protocol. There's probably no changing that. The new generics module in #828 will convert these into complete URLs in every case.
My 2c: we've also introduced fs.path
for convenience methods like name/parts/parent/parents/join/etc
in dvc https://github.com/iterative/dvc/blob/master/dvc/fs/path.py Compared to using path-like objects as we did before, it has better performance when dealing with large number of objects. It would also be convenient for anyone writing a new fs implementation.
Ah yes, you did mention this sometime before, @efiop . You might consider upstreaming that module, maybe.
@martindurant I see that it's provided via "full_name" on the file class but often I want to be able to ls files without the overhead of opening them. https://github.com/fsspec/filesystem_spec/blob/master/fsspec/spec.py#L1394-L1395
Providing it on the Filesystem class would work for me though
I'm not seeing anything special on http. https://github.com/fsspec/filesystem_spec/blob/master/fsspec/implementations/http.py
Good to know about the generics change! Would be nice if the _unstripprotocol method was public (no ) in those changes
(ah sorry, the method is only in the same PR I mentioned above, #828 - but it will come!)
Would be nice if the _unstripprotocol method was public (no )
That's reasonable
After using our fs.path
for almost a year and making attempts (private) to create a patch (and finding a lot of fs.path
already used and also some existing path-manipulation methods in some filesystems), I think the most natural way is to avoid fs.path
layer and just have fs.join/basename/etc
directly in fs
. I found myself mistyping fs.join
instead of fs.path.join
so often that I think it proves the point 😄
Working on submitting a PR.
Is this still being worked on? Would love to this feature implemented
Other stuff got in the way, so I didn't manage to contribute it yet 🙁
@efiop Also very interested in getting this in, any chance some time might open up / is there anything anyone else can do to help out here?
@agrinh Finally getting around to moving fs.path to plain fs methods in dvc, so hoping to get around to contributing it to fsspec around new years 🙂 (said that before, but still).
One of the things that have come up when trying to integrate fsspec into tensorboard (https://github.com/tensorflow/tensorboard/pull/5248) is that there aren't any standard path operations as part of fsspec as far as I can tell. When dealing with more complex chained filesystems the rules get pretty complex for users to implement.
Ideally fsspec would provide the common operations such as:
I.e.
With the chained filesystems this gets really complex and I'm not 100% sure how to implement this in all cases/filesystems so feedback would be appreciated here
An option here might be to try and implement https://docs.python.org/3/library/pathlib.html#pathlib.PurePath