Open scholtalbers opened 2 weeks ago
Hello @scholtalbers
Here is an explanation for how all of this works together. Some of it skips over details, so I really recommend to read the filesystem_spec docs thoroughly and look at the fsspec.spec.AbstractFileSystem
implementation in filesystem_spec.
All the filesystem abstractions and filesystem operations are implemented and defined in filesystem_spec. The base class of all these filesystems is in fsspec.spec.AbstractFileSystem
.
To use a filesystem registered in fsspec, you instantiate the specific subclass of AbstractFileSystem
you want to use S3FileSystem
for example, by providing storage options to the class constructor.
The instantiated filesystem then takes paths (or paths prefixed with the same protocol) as arguments in all its methods.
When you want to reference an object on a filesystem, you need 3 pieces of information:
AbstractFileSystem
can register their supported protocols in fsspec.registry
. (for example: "s3")"mybucket/my/special/key"
)If you have all three pieces of information, you can provide access to the information stored in the object you're referencing.
Some filesystems in fsspec support combining the 3 pieces of information into a single string urlpath, usually of the form: protocol://path?storage_option1=1&storage_option=2
The UPath
class does 2 things:
pathlib.PurePath
and pathlib.Path
interface for modifying the path and for reading/writing to the path, this allows you to easily add support for arbitrary filesystems if your existing code uses the pathlib.Path
interface. (Note there will be a minor, but significant change with #193, which will basically remove the __fspath__
method from the UPath interface)The copy functionality between filesystems will be available in UPath once https://github.com/fsspec/universal_pathlib/issues/227 is completed, which relies on #193. And once that interface exists, the internal implementation will likely be based on the generic filesystem, but that's tbd.
UPath
simplifies passing around paths in your code and it's a convenient tool for building uris for supported protocols. It's not performance optimized, and adds (in some cases a lot of) overhead to operations.
For cases where it's you need to move between them: UPath
provides you with an easy way to move to fsspec:
UPath().protocol # the protocol string used by fsspec
UPath().storage_options # the storage_options required to create the fsspec filesystem instance
UPath().path # a path string that can be used by the fsspec instance
UPath().fs # a convenience helper to do fsspec.filesystem(pth.protocol, **pth.storage_options)
Regarding your question
It feels (at least) the part dest.fs.put(str(src), str(dest)) is not the right way for my use case, is there a better way
As mentioned above: not yet. But hopefully soon.
Let me know if this helped, Cheers, Andreas 😃
Yes thanks a lot Andreas, this clarifies a few things for me and I'm looking forward for the upcoming features!
Not sure if this is the right place to ask this question, and I debated if it should go in the community Q&A?
Anyway, I want to leverage UPath and fsspec to work with multiple storage backends, for now limited to local disk and s3. One of the main use cases is to copy files and folders from one to the other. So triggered by the answer, I feel I may have some misunderstanding of how to use the library effectively. I have read some of the documentation, but I may have missed something obvious.
What I started doing is along these lines:
I also explored the generic filesystem with something like
It feels (at least) the part
dest.fs.put(str(src), str(dest))
is not the right way for my use case, is there a better way?