Closed martindurant closed 1 year ago
👋 @martindurant sounds cool!
There are probably a few level of integration that could work well. We intend to make it pretty easy to add custom clients/backends by implementing a relatively small surface area. We're adding more docs on this, but here's the boto3
one as an example. Implementing each of the functions there should make it simple to enable all of the pathlib
API that we support:
Option (1):
I could see a FsspecClient
as part of cloudpathlib
that if you pip install cloudpathlib[fsspec]
then it uses the fsspec client for any of the schemes that fsspec supports. There's probably some details in the coupling between the specific kind of path object we return (e.g., S3Path
) and the Client
object that would need to be worked out, but that all seems pretty doable.
Option (2):
You could that same extension work for a custom Client
and CloudPath
object within fsspec
and the expose the factory methods as you see fit in your API.
Other thoughts on interoperability? There's definitely a big chunk of cool things in fsspec
that we don't want to duplicate (chunks/buffering, compressing cache, async, etc.).
I would say those two options sound pretty similar, except for where the code lives - which I am not too worried about at all. You are probably more likely to get to it, if you have time to experiment - but seeing that this was already requested at fsspec, maybe someone else in the community contributes some pathlib interface anyway.
Do you think there's an argument for S3FileSystem, GCSFileSystem and FTPFileSystem from fsspec-land to supplant the implementations you have or are developing? They may be more complete and reduce the amount of work you need to do.
@martindurant @pjbull
I took a quick stab at making FsspecClient
and FsspecPath
classes that is backed by an fsspec FileSystem
in #109.
from cloudpathlib.fsspec import FsspecClient, FsspecPath
from s3fs import S3FileSystem
FsspecClient(filesystem=S3FileSystem()).set_as_default_client()
cp = FsspecPath("s3://drivendata-public-assets/")
list(cp.iterdir())[0:3]
#> [FsspecPath('s3://drivendata-public-assets/CIP-from-brochure.jpg'), FsspecPath('s3://drivendata-public-assets/EPD-report.png'), FsspecPath('s3://drivendata-public-assets/Ending-Extreme-Poverty_A-Focus-on-Children.png')]
I think the tricky part that I don't have a great sense of is how to design the right interface to let users choose to use fsspec and to initialize the correct fsspec FileSystem
class with minimal friction. In cloudpathlib currently, we already have a way of specifying dependencies with extras_require
and with class dispatching based on the prefix/protocol in the bath. Choosing to use fsspec and dispatching to the right fsspec filesystem, and having dependencies specified is like two more layers of user choices.
The question of moving our backend logic entirely to fsspec is a pretty big one and would involve a lot of changes. This is my first time using fsspec, so I still don't have a great feel for the fsspec design or ecosystem.
Typical usage in fsspec allows access to the set of backend implementations using a protocol string and arguments, e.g., fsspec.filesystem("s3", anon=True)
, or by passing URLs including a prefix fsspec.open("s3://bucket/path/file", anon=True)
. (anon
is just an example here, specific to s3)
Many packages that use fsspec, and some internal logic too, uses storage_options
as a dict of kwargs to pass to the main entry functions, e.g.,
pd.read_csv("s3://bucket/path/file", storage_options={"anon": True})
.
The current set of optional deps for fsspec are:
extras_require={
"abfs": ["adlfs"],
"adl": ["adlfs"],
"dask": ["dask", "distributed"],
"dropbox": ["dropboxdrivefs", "requests", "dropbox"],
"gcs": ["gcsfs"],
"git": ["pygit2"],
"github": ["requests"],
"gs": ["gcsfs"],
"hdfs": ["pyarrow"],
"http": ["requests", "aiohttp"],
"sftp": ["paramiko"],
"s3": ["s3fs"],
"smb": ["smbprotocol"],
"ssh": ["paramiko"],
},
but we don't depend on anything just to import the package, and some of the backends like file and memory. Sounds like we came up with a similar set of solutions!
Note that there are some implementations that are meant/able to be used with a further target filesystem, such as the local caches.
I'd be happy to talk about how fsspec works and describe our design/ecosystem.
@remi-braun Thanks for the vote for fsspec
! So that I understand your use case, is the reason that you want an fsspec
implementation that you want to be able to do streaming read/write rather than downloading the whole file to a local cache first?
FYI, @jayqi did a good implementation of fsspec
as a backend in #109. However, it's different from just a pathlib
API for fsspec
since it uses the cloudpathlib
caching logic and business logic by implementing our *Client
and *Path
classes. To really take advantage of their approach to caching and reading/writing we'd likely need to re-implement a lot of the reading/writing (and even things like listing which can be cached in fsspec
).
(Here's the issue just on streaming: https://github.com/drivendataorg/cloudpathlib/issues/9 )
Arf sadly it is much more prosaic: my prime concern is to unify my code using agnostically files in the cloud or on disk and for now it is not possible :
path = AnyPath(path)
# Load vector in cache if needed (geopandas and cloudpathlib are not compatible for now)
if isinstance(path, CloudPath):
path = AnyPath(path.fspath)
vect = geopandas.read_file(vect_path)
However I do not use files big enough to see any slowdown caching files in my usecase (apart from rasterio
rasters, but they do not use fsspec if I'm correct)
Ah, interesting. That looks to me like a geopandas
feature request since they don't support os.PathLike
. They do say they support filelike objects. You could try the following and see if it works. If they support filelike objects then I think it should be fine with CloudPath
or Path
objects:
path = AnyPath(path)
with path.open() as f:
vect = geopandas.read_file(f)
Recent versions of pandas
should work fine with CloudPath
. For example:
In [1]: import pandas as pd
In [2]: from cloudpathlib import CloudPath
In [3]: df = pd.read_csv(CloudPath('s3://drivendata-public-assets/odsc-west-2019/california-tracts.csv'))
In [4]: df.head()
Out[4]:
GEOID year name parent-location population ... eviction-rate eviction-filing-rate low-flag imputed subbed
0 6001400100 2000 4001.0 Alameda County, California 2497.87 ... 0.00 0.00 1 0 0
1 6001400100 2001 4001.0 Alameda County, California 2497.87 ... 0.00 0.86 1 0 0
2 6001400100 2002 4001.0 Alameda County, California 2497.87 ... 0.80 0.80 1 0 0
3 6001400100 2003 4001.0 Alameda County, California 2497.87 ... 1.49 1.49 1 0 0
4 6001400100 2004 4001.0 Alameda County, California 2497.87 ... 0.00 0.00 1 0 0
[5 rows x 27 columns]
I think the point of the discussion is that we could use cloudpath to have fsspec support generic pathlike objects without having to ask third-party libraries such as geopandas to do extra work.
Ah, interesting. That looks to me like a
geopandas
feature request since they don't supportos.PathLike
. They do say they support filelike objects. You could try the following and see if it works. If they support filelike objects then I think it should be fine withCloudPath
orPath
objects:path = AnyPath(path) with path.open() as f: vect = geopandas.read_file(f)
Recent versions of
pandas
should work fine withCloudPath
. For example:In [1]: import pandas as pd In [2]: from cloudpathlib import CloudPath In [3]: df = pd.read_csv(CloudPath('s3://drivendata-public-assets/odsc-west-2019/california-tracts.csv')) In [4]: df.head() Out[4]: GEOID year name parent-location population ... eviction-rate eviction-filing-rate low-flag imputed subbed 0 6001400100 2000 4001.0 Alameda County, California 2497.87 ... 0.00 0.00 1 0 0 1 6001400100 2001 4001.0 Alameda County, California 2497.87 ... 0.00 0.86 1 0 0 2 6001400100 2002 4001.0 Alameda County, California 2497.87 ... 0.80 0.80 1 0 0 3 6001400100 2003 4001.0 Alameda County, California 2497.87 ... 1.49 1.49 1 0 0 4 6001400100 2004 4001.0 Alameda County, California 2497.87 ... 0.00 0.00 1 0 0 [5 rows x 27 columns]
Sadly it is not the case, and I don't know why... I haven't investigated though, but my workaround is sufficient for now 😉
Thanks, @martindurant. How would you use fsspec
when opening a file with geopandas
? I'm curious what fsspec
implements that would make this work without having to explicitly download to a local file first and point to that local path.
(That said, I would encourage library maintainers to consider supporting os.PathLike
since it is the official Python way to try to go from an object to a local file path.)
(Also, would be curious about your thoughts on testing #109 to see if it is viable! The fact it doesn't take advantage of all of the hard work you've done on caching, streaming, etc. does make it seem less appealing.)
what fsspec implements that would make this work without having to explicitly download to a local file first and point to that local path
fsspec can do this job implicitly; but yes, there are libraries which, typically because of some inner C code, can only work with real OS file handles on the local filesystem.
it is the official Python way to try to go from an object to a local file path
This is an exaggeration! :)
On #109, I think it's totally the way forward. For configuring the relevant backends, you could rely on the fsspec config system, or maybe allow extra kwargs when instantiating a Path, which get passed to other downstream Paths.
This is an exaggeration!
😆 certainly an exaggeration in terms of it actually getting implemented by libraries, but it is a PEP that was accepted for exactly this purpose! https://www.python.org/dev/peps/pep-0519/
fsspec can do this job implicitly; but yes, there are libraries which, typically because of some inner C code, can only work with real OS file handles on the local filesystem.
Ok, I'll do some digging to check out the mechanisms you use for this. When we visited this question, implementing os.PathLike
seemed to us the "right" answer to make this seamless, but maybe there are other approaches worth considering.
implementing os.PathLike seemed to us the "right" answer
I'm sure both are useful! fsspec wants to provide full filesystem classes with all the functionality that implies, and several high-level features that you might find interesting, but not necessary for your core offering. To have those and a nice pathlib interface would be awesome!
(( For instance (I happen to be working on this today), ReferenceFileSystem lets you view blocks of bytes at arbitrary offsets within arbitrary URLs as files, so that a library like zarr (which uses file naming conventions to locate blocks of array data) can load data from third party HDF5, tiff and grib2 files whether or not the libraries made for those formats know how to handle remote files or not. ))
By the way, in my experience, most third party libraries work with file-like objects and paths. If a path-like is passed, they do str()
at the earliest opportunity and end up treating it like a (local) path. Of course, I may well be biased in the set of packages I am exposed to - many of which know to defer opening paths to fsspec.
Hi there! I just wanted to mention that I would benefit from support for pathlib with fsspec: https://github.com/dask/dask/issues/8006 Many thanks for any help! :)
Hi! Ditto @asmith26 - I'd love to see more compatibility between fsspec and cloudpathlib's wonderful pathlib.Path interface.
My use case is when using pandas. I mention this in https://github.com/drivendataorg/cloudpathlib/issues/128.
Related issues:
pd.HDFStore(cloud_path)
https://github.com/drivendataorg/cloudpathlib/issues/240pd.DataFrame.to_csv(cloud_path)
https://github.com/drivendataorg/cloudpathlib/issues/242See also universal_pathlib, which I think is more recent than this discussion - although I am not in a position to compare the two projects.
universal_pathlib
seems nice, but I didn't find any documentation and the project seems (too) young.
As per @grisaitis, I would also love those two approaches to mix!
Other well-known projects supporting fsspec
are rasterio
and xarray
. With geopandas, those libs are pillars to the remote-sensing opensource community .
@remi-braun Are there specific issues that you're seeing? If there are bugs, it would be great to get them addressed. From an API perspective, both rasterio
(as of this PR) and rioxarray
seem to work with CloudPath
objects already without going through fsspec
.
from cloudpathlib import CloudPath
from matplotlib import pyplot as plt
import rasterio as rio
import rioxarray
cp = CloudPath("s3://drivendata-competition-biomassters-public-us/train_agbm/0003d2eb_agbm.tif")
# test rasterio directly
dataset = rio.open(cp)
plt.imshow(dataset.read(1), cmap='pink')
#> <matplotlib.image.AxesImage object at 0x1a2a333d0>
dataset.close()
#> None
# test rasterio + xarray
rds = rioxarray.open_rasterio(cp)
rds
#> <xarray.DataArray (band: 1, y: 256, x: 256)>
#> [65536 values with dtype=float32]
#> Coordinates:
#> * band (band) int64 1
#> * x (x) float64 0.5 1.5 2.5 3.5 4.5 ... 252.5 253.5 254.5 255.5
#> * y (y) float64 0.5 1.5 2.5 3.5 4.5 ... 252.5 253.5 254.5 255.5
#> spatial_ref int64 0
#> Attributes:
#> scale_factor: 1.0
#> add_offset: 0.0
I looked at the XArray docs, and there are a done of complex IO options. It looks like a number of the writing methods won't work since they don't support file-like objects. In this case, passing the str(CloudPath)
off should hand it to their fsspec
implementation. The method that does support file-like objects is engine="scipy"
for netcdf. This has a bug in their implementation where close
gets called multiple times, so we upload multiple times in quick succession. We have a small bug there to make our close
idempotent. Are there other key scenarios in XArray to test/support?
That said, there still seems to be some confusion here about what we can implement within cloudpathlib
. Unfortunately, the bulk of interoperability issues are outside of our control. I'll do my best to clear that up here and then close this issue.
First, cloudpathlib
is already generally compatible with fsspec
. Doing fs.open(str(cloudpath_object))
should work in most scenarios as long as the fsspec
filesystem is configured properly. This lets you manipulate paths with cloudpathlib and read/write with fsspec
. I wish you didn't have to call str
for this, but to change that fsspec
would need a simple shim to make the calling of str
unncessary and do the dispatch/validation themselves.
Second, the general case for other libraries. cloudpathlib
provides standard PathLike
and file-like (through .open
) objects for working with the contents of files. How other libraries decide to treat PathLike and file-like objects is out of our control. Third party libraries that support fsspec
have their own code that they wrote to special case being passed strings like s3://...
. There is no way for cloudpathlib
to know a library has these special cases, and even if we did there would be no way for us to pretend to be a normal string to trigger this code. If you want to trigger other libraries' special-case code like this either calling str
on the CloudPath
or CloudPath.as_uri()
is the most likely way to make this work. Calling str
on a Path
object should also work for most libraries, so you can still write CloudPath/Path agnostic code that should work with these libraries and take advantage of how they work with fsspec
.
Third, developers of other libraries shouldn't need any special case code for CloudPath
s to "just work" (as long as they handle PathLike and file-like objects normally). The one caveat is that per this PEP fspath
can only return a string. Because of this, in writing scenarios code that uses CloudPath
s developers should use CloudPath.open
and hand off the file-like object to other libraries. Unfortunately, there is no feasible way to effectively detect if libraries use the built-in open
to operate on a CloudPath
.
Another separate question is if we will develop and maintain an fsspec
provider like the PoC in #109. We don't have any plans to do so. Doing so would not address any of the considerations above. It would just mean we do provider communication through fsspec
, but not that any third party library would magically do the right thing with the file-like/PathLike interfaces we provide. You would still need the str
/.as_uri
calls to get that library to run their fsspec
code which is almost always based on interpreting strings. In addition to not addressing these concerns, implementing that backend will add significant maintenance burden that doesn't seem worth it until there is sufficient user demand for providers/functionality that we don't have implemented with the official SDKs. Maybe the best option is for an enterprising developor to take #109 and create a seprate project that supports an fsspec
client. We already support and easy way to make extesions by creating and registering custom CloudPath
implementations, that is a relatively small codebase. It's not well documented, but looking at any of the existing implementations contains everything necessary to build an extension to support a new provider. More docs coming soon. Again, worth reiterating that doing so would not mean that libraries that know about fsspec
would then magically know what to do.
So with all of that said, there's nothing we can implement in cloudpathlib
to make substantive progress here. The only thing that we can do is encourage developers of other libraries where things are broken to support best practices, which allow them to support CloudPath
s. These include:
Checking if objects passed to read/write functions are PathLike
, and if so using what it gets from fspath
to do the reading/writing.
Ensure that functions that do reading and writing support file-like objects, not just strings as paths.
The other best practice to ensure everything works well is for consumers of cloudpathlib
to use CloudPath.open
in write scenarios and pass the file-like object to other libraries.
One last word here: there many be real bugs on our side that prevent third party libraries from working with CloudPath
s. All of this discussion is about the design and intent, and we're happy to fix bugs that are discovered.
Hello,
Just to answer, since this issue is closed.
The main problem is the libs based on opening files with fsspec
won't work properly with CloudPaths
.
They will call fspath
to resolve it (instead of str()) and therefore the file called will be cached (this is not exaclty what we want, and this will be done silently)
For single files like tif
this will be not optimal but will work (your example)
But for nested files likes VRT, this will crash as only the header is cached and not the relative files. This will break the paths inside the VRT.
Note that GDAL (and therefore rasterio) can read VRT stored on the cloud, and this is the expected behavior.
A little example:
vrt = AnyPath("https://s3.unistra.fr/merge_32-31.vrt")
rasterio.open(str(vrt)) # will work
rasterio.open(vrt) # will fail for the wrong reason, saying that the files linked inside the VRT don't exist
Thanks @remi-braun, I opened a new issue to track that kind of scenario. fsspec
integration isn't the root issue, so it merits its own conversation in #315.
I see that we have some goals in common. fsspec is a mature library in use by Dask, pandas, zarr and others. However, we don't have a good implementation of
pathlib
on our API. Would you like to join forces to implement pathlib-on-fsspec, using the code you have already developed?xref: https://github.com/intake/filesystem_spec/issues/434
another attempt to do this, explicitly for fsspec: https://github.com/datarevenue-berlin/drfs