drivendataorg / cloudpathlib

Python pathlib-style classes for cloud storage services such as Amazon S3, Azure Blob Storage, and Google Cloud Storage.
https://cloudpathlib.drivendata.org
MIT License
477 stars 62 forks source link

Some libraries generate confusing errors when opening CloudPath objects #315

Open pjbull opened 1 year ago

pjbull commented 1 year ago

The main problem is the libs based on opening files with fsspec won't work properly with CloudPaths. They will call fspath to resolve it (instead of str()) and therefore the file called will be cached (this is not exaclty what we want, and this will be done silently)

For single files like tif this will be not optimal but will work (your example) But for nested files likes VRT, this will crash as only the header is cached and not the relative files. This will break the paths inside the VRT. Note that GDAL (and therefore rasterio) can read VRT stored on the cloud, and this is the expected behavior.

A little example:


vrt = AnyPath("https://s3.unistra.fr/merge_32-31.vrt")
rasterio.open(str(vrt)) # will work
rasterio.open(vrt) # will fail for the wrong reason, saying that the files linked inside the VRT don't exist

Originally posted by @remi-braun in https://github.com/drivendataorg/cloudpathlib/issues/96#issuecomment-1402068539

pjbull commented 1 year ago

There aren't great options here. This is what I see. Open to other options if people have them. Maybe some combination of these are good mitigations.

Stop being os.PathLike

A lot of these libraries do something like:

def open(f):
    is instance(f, os.PathLike):
        str_path = os.fspath(f)
    ...

They then have special-case logic to handle str_path and work with the local filesystem path. This only works with CloudPaths in read scenarios for single files. Other scenarios (writing, multi-file formats, directory walking) are all broken with code that works like this.

When the code eventually goes looking on the filesystem relative to str_path it errors in potentially confusing ways.

If we are no longer os.PathLike then code will likely error earlier with clearer error messages.

Pros:

Cons:

Cache entire directory on fspath call

We could do something like the following (and make this an option)

def __fspath__(self):
    if self.is_dir():
        self.download_to(self._local)  # cache entire directory recursively on `fspath` call      

Pros:

Cons:

Warn on __fspath__ being called

We could just always warn about writing/multi-file whenever __fspath__ is called.

def __fspath__(self):
    warnings.warn(f"File {self} has been converted to a local path; writing and uploading or opening multi-file formats will not work as expected.")
    ...

Pros:

Cons:

Try to detect who is trying to do what and how and raise errors in those cases

We may be able to provide more useful errors if we can detect who is calling .str, .__fspath__ or the built-in open function.

Pros:

Cons:

Provide documentation for other library developers to support CloudPaths

The best way to support CloudPaths is for developers to just use the pathlib API throughout their code base. Since we support the majority of the pathlib API, then most code should work reasonably.

We could add a documentation page with recommendations on how to do this effectively. We could provide code snippet examples. Here are the key points:

Pros:

Cons:

Coordinate a sentinel convention to trigger fsspec code

We could potentially have a sentinel _fsspec property that indicates to a library they should open with fsspec. For example:

# in cloudpathlib

@property
def _fsspec(self):
    return str(self)
# in fsspec 
def open(f):
    if hasattr(f, '_fsspec'):
        f = f._fsspec
# in other libraries
def open(f):
    is if hasattr(f, '_fsspec'):
        handle = fsspec.open(f)
    ...

Alternatively, fsspec write a shim to handle CloudPath objects:

# in fsspec 
def open(f):
    if isinstance(f, CloudPath):
        str_cloud_path = str(f)
        ... # dispatch with normal `fsspec`

Either would require that third parties pass the object to fsspec without manipulating it first (e.g., they should not call str or fspath).

Pros:

Cons:

remi-braun commented 1 year ago

Just to add that the workaroud is for the user to pass str(cloudpath) to the problematic libraries, so this is not urgent (I think). But it would be very comfortable to have a universal way of handling cloud paths (I cannot help you for that, this is way beyond my skills haha)

pjbull commented 1 year ago

Good point, thanks for adding!

barneygale commented 1 year ago

FWIW I'm working on adding a pathlib.AbstractPath class in the standard library. If that's introduced, and libraries like cloudpathlib start subclassing it, then other code can do:

def do_thing(path):
    if not isinstance(path, pathlib.AbstractPath):
        path = pathlib.Path(path)
    with path.open('rb') as f:
        ...

In this way, functions can accept string paths, pathlib.Path objects, and any other objects that inherit from pathlib.AbstractPath and therefore provide an open() method.

FWIW, I'm also planning to make __fspath__() a Path-only feature (not present in PurePath or AbstractPath) so that open(ZipFile('README.txt', archive=ZipFile('foo.zip'))) doesn't try to open README.txt in the current directory, which would be massively confusing.

pjbull commented 1 year ago

Thanks @barneygale! Yes, we'd support having a way to signal pathlib API compatibility in the standard library so that we can signal to third part libraries to treat us like a path (which really helps with the problems listed above about people calling os.fspath too early.

Will os.PathLike still be dependent on __fspath__ being implemented? I guess in that case we would want to do is not be os.PathLike so that we don't get fspath'd, but be an AbstractPath so we signal to libraries to use pathlib functions.

Let me know if we can help testing or reviewing anything there.

(That said, this still puts us in the "get other library maintainers to rewrite their code to use/support pathlib APIs" solution bucket.)

barneygale commented 1 year ago

I'm not planning to change anything about os.PathLike, os.fspath() or __fspath__(), except that I'll move PurePath.__fspath__() to Path.__fspath__(). The proposed AbstractPath would sit between those classes, so it wouldn't have __fspath__(), and therefore wouldn't be path-like. This will require a proper deprecation process in CPython (2 or more minor releases) in which we raise warnings from PurePath.__fspath__().

I think it's a losing game trying to provide __fspath__() in other path implementations - it's impossible to know exactly how the function that calls os.fspath() intends to use the resulting string path. You can do clever things with caching of course, and unfortunately that's going to be necessary for at least another couple years.