Open pjbull opened 1 year ago
There aren't great options here. This is what I see. Open to other options if people have them. Maybe some combination of these are good mitigations.
os.PathLike
A lot of these libraries do something like:
def open(f):
is instance(f, os.PathLike):
str_path = os.fspath(f)
...
They then have special-case logic to handle str_path and work with the local filesystem path. This only works with CloudPath
s in read scenarios for single files. Other scenarios (writing, multi-file formats, directory walking) are all broken with code that works like this.
When the code eventually goes looking on the filesystem relative to str_path
it errors in potentially confusing ways.
If we are no longer os.PathLike
then code will likely error earlier with clearer error messages.
Pros:
Cons:
fspath
callWe could do something like the following (and make this an option)
def __fspath__(self):
if self.is_dir():
self.download_to(self._local) # cache entire directory recursively on `fspath` call
Pros:
Cons:
__fspath__
being calledWe could just always warn about writing/multi-file whenever __fspath__
is called.
def __fspath__(self):
warnings.warn(f"File {self} has been converted to a local path; writing and uploading or opening multi-file formats will not work as expected.")
...
Pros:
Cons:
We may be able to provide more useful errors if we can detect who is calling .str
, .__fspath__
or the built-in open
function.
Pros:
Cons:
CloudPath
sThe best way to support CloudPath
s is for developers to just use the pathlib
API throughout their code base. Since we support the majority of the pathlib
API, then most code should work reasonably.
We could add a documentation page with recommendations on how to do this effectively. We could provide code snippet examples. Here are the key points:
os.
methods for path manipulation to the pathlib
APIPath
or CloudPath
objects immediately*Path*
objects as long as possible. Don't do calls to str
or fspath
until you absolutely need to open a single file.Pros:
Cons:
fsspec
codeWe could potentially have a sentinel _fsspec
property that indicates to a library they should open with fsspec
. For example:
# in cloudpathlib
@property
def _fsspec(self):
return str(self)
# in fsspec
def open(f):
if hasattr(f, '_fsspec'):
f = f._fsspec
# in other libraries
def open(f):
is if hasattr(f, '_fsspec'):
handle = fsspec.open(f)
...
Alternatively, fsspec
write a shim to handle CloudPath
objects:
# in fsspec
def open(f):
if isinstance(f, CloudPath):
str_cloud_path = str(f)
... # dispatch with normal `fsspec`
Either would require that third parties pass the object to fsspec
without manipulating it first (e.g., they should not call str
or fspath
).
Pros:
fsspec
ecosystemCons:
Just to add that the workaroud is for the user to pass str(cloudpath)
to the problematic libraries, so this is not urgent (I think).
But it would be very comfortable to have a universal way of handling cloud paths (I cannot help you for that, this is way beyond my skills haha)
Good point, thanks for adding!
FWIW I'm working on adding a pathlib.AbstractPath
class in the standard library. If that's introduced, and libraries like cloudpathlib
start subclassing it, then other code can do:
def do_thing(path):
if not isinstance(path, pathlib.AbstractPath):
path = pathlib.Path(path)
with path.open('rb') as f:
...
In this way, functions can accept string paths, pathlib.Path
objects, and any other objects that inherit from pathlib.AbstractPath
and therefore provide an open()
method.
FWIW, I'm also planning to make __fspath__()
a Path
-only feature (not present in PurePath
or AbstractPath
) so that open(ZipFile('README.txt', archive=ZipFile('foo.zip')))
doesn't try to open README.txt in the current directory, which would be massively confusing.
Thanks @barneygale! Yes, we'd support having a way to signal pathlib
API compatibility in the standard library so that we can signal to third part libraries to treat us like a path (which really helps with the problems listed above about people calling os.fspath
too early.
Will os.PathLike
still be dependent on __fspath__
being implemented? I guess in that case we would want to do is not be os.PathLike
so that we don't get fspath'd, but be an AbstractPath
so we signal to libraries to use pathlib
functions.
Let me know if we can help testing or reviewing anything there.
(That said, this still puts us in the "get other library maintainers to rewrite their code to use/support pathlib
APIs" solution bucket.)
I'm not planning to change anything about os.PathLike
, os.fspath()
or __fspath__()
, except that I'll move PurePath.__fspath__()
to Path.__fspath__()
. The proposed AbstractPath
would sit between those classes, so it wouldn't have __fspath__()
, and therefore wouldn't be path-like. This will require a proper deprecation process in CPython (2 or more minor releases) in which we raise warnings from PurePath.__fspath__()
.
I think it's a losing game trying to provide __fspath__()
in other path implementations - it's impossible to know exactly how the function that calls os.fspath()
intends to use the resulting string path. You can do clever things with caching of course, and unfortunately that's going to be necessary for at least another couple years.
The main problem is the libs based on opening files with
fsspec
won't work properly withCloudPaths
. They will callfspath
to resolve it (instead of str()) and therefore the file called will be cached (this is not exaclty what we want, and this will be done silently)For single files like
tif
this will be not optimal but will work (your example) But for nested files likes VRT, this will crash as only the header is cached and not the relative files. This will break the paths inside the VRT. Note that GDAL (and therefore rasterio) can read VRT stored on the cloud, and this is the expected behavior.A little example:
Originally posted by @remi-braun in https://github.com/drivendataorg/cloudpathlib/issues/96#issuecomment-1402068539