fsspec / universal_pathlib

pathlib api extended to use fsspec backends
MIT License
248 stars 43 forks source link

Question on using UPath and fs copy/put #309

Open scholtalbers opened 5 days ago

scholtalbers commented 5 days ago

Not sure if this is the right place to ask this question, and I debated if it should go in the community Q&A?

Anyway, I want to leverage UPath and fsspec to work with multiple storage backends, for now limited to local disk and s3. One of the main use cases is to copy files and folders from one to the other. So triggered by the answer, I feel I may have some misunderstanding of how to use the library effectively. I have read some of the documentation, but I may have missed something obvious.

What I started doing is along these lines:

class StorageVolume:
   (…)
   def get_fsspec_storage_options(self) -> dict:
        if self.storage_options:
            return { "endpoint_url": self.endpoint, "key": self.key, "secret", "self.secret"}
        return {}
   def get_upath(self, file_obj) -> UPath
    return UPath(file_obj.uri, **self.get_fsspec_storage_options())

src_obj = FileObject.objects.get(uri="/tmp/dir/myfile.txt")
src = src_obj.storagevolume.get_upath(src_obj) 
dest_obj = FileObject.objects.get(uri="s3://bucket1/")
dest = dest_obj.storagevolume.get_upath(dest_obj)

if both_local or src_obj.storagevolume == dest_obj.storagevolume:
  dest.fs.copy(str(src), str(dest), recursive=src.is_dir())
elif not is_remote_upath(src):
  dest.fs.put(str(src), str(dest), recursive=src.is_dir())

I also explored the generic filesystem with something like

  # try generic 
  generic.set_generic_fs(src.protocol, src.fs)
  generic.set_generic_fs(dest.protocol, dest.fs)
  fs = fsspec.filesystem("generic", default_method="generic”)
  fs.copy(str(src), str(dest), recursive=src.is_dir())

It feels (at least) the part dest.fs.put(str(src), str(dest)) is not the right way for my use case, is there a better way?

ap-- commented 4 days ago

Hello @scholtalbers

Here is an explanation for how all of this works together. Some of it skips over details, so I really recommend to read the filesystem_spec docs thoroughly and look at the fsspec.spec.AbstractFileSystem implementation in filesystem_spec.

fsspec

All the filesystem abstractions and filesystem operations are implemented and defined in filesystem_spec. The base class of all these filesystems is in fsspec.spec.AbstractFileSystem.

To use a filesystem registered in fsspec, you instantiate the specific subclass of AbstractFileSystem you want to use S3FileSystem for example, by providing storage options to the class constructor. The instantiated filesystem then takes paths (or paths prefixed with the same protocol) as arguments in all its methods.

When you want to reference an object on a filesystem, you need 3 pieces of information:

  1. which filesystem? this is provided by the protocol, since subclasses of AbstractFileSystem can register their supported protocols in fsspec.registry. (for example: "s3")
  2. what filesystem configuration? this is provided by the storage_options. They are the constructor parameters for building the filesystem instance. (this could be your aws credentials)
  3. what object this is provided by the path. this is a string that references an object for the filesystem instance. (for example: "mybucket/my/special/key")

If you have all three pieces of information, you can provide access to the information stored in the object you're referencing.

Some filesystems in fsspec support combining the 3 pieces of information into a single string urlpath, usually of the form: protocol://path?storage_option1=1&storage_option=2

upath

The UPath class does 2 things:

  1. it provides a way to store all these 3 pieces of information in one object and makes it convenient to pass the information along.
  2. it provides the pathlib.PurePath and pathlib.Path interface for modifying the path and for reading/writing to the path, this allows you to easily add support for arbitrary filesystems if your existing code uses the pathlib.Path interface. (Note there will be a minor, but significant change with #193, which will basically remove the __fspath__ method from the UPath interface)

The copy functionality between filesystems will be available in UPath once https://github.com/fsspec/universal_pathlib/issues/227 is completed, which relies on #193. And once that interface exists, the internal implementation will likely be based on the generic filesystem, but that's tbd.

when to use what

UPath simplifies passing around paths in your code and it's a convenient tool for building uris for supported protocols. It's not performance optimized, and adds (in some cases a lot of) overhead to operations.

For cases where it's you need to move between them: UPath provides you with an easy way to move to fsspec:

UPath().protocol  # the protocol string used by fsspec
UPath().storage_options  # the storage_options required to create the fsspec filesystem instance
UPath().path  # a path string that can be used by the fsspec instance
UPath().fs  # a convenience helper to do fsspec.filesystem(pth.protocol, **pth.storage_options)

Regarding your question

It feels (at least) the part dest.fs.put(str(src), str(dest)) is not the right way for my use case, is there a better way

As mentioned above: not yet. But hopefully soon.

Let me know if this helped, Cheers, Andreas 😃