developmentseed / obstore

Simple, fast integration with Amazon S3, Google Cloud Storage, Azure Storage, and S3-compliant APIs like Cloudflare R2
https://developmentseed.org/obstore
MIT License
146 stars 3 forks source link

fsspec wrapper #32

Closed kylebarron closed 3 weeks ago

kylebarron commented 1 month ago

Wrapper around the existing object store API we present to Python to be used

E.g. something like this should work, but doesn't quite yet:

import object_store_rs as obs
import pyarrow.parquet as pq
from object_store_rs.fsspec import AsyncFsspecStore

store = obs.store.HTTPStore.from_url("https://github.com")
fs = AsyncFsspecStore(store)
url = "opengeospatial/geoparquet/raw/refs/heads/main/examples/example.parquet"
test = pq.read_metadata(url, filesystem=fs)
File ~/github/developmentseed/object-store-rs/.venv/lib/python3.11/site-packages/fsspec/spec.py:1918, in AbstractBufferedFile._fetch_range(self, start, end)
   1916 def _fetch_range(self, start, end):
   1917     """Get the specified set of bytes from remote"""
-> 1918     raise NotImplementedError

cc @martindurant

martindurant commented 1 month ago

Here, you are using the file-like API. I can make a thin wrapper which calls cat_file with a range for any input filesystem, and then this should work.

kylebarron commented 1 month ago

I figured that I only needed as a minimum to implement the methods on the AsyncFileSystem base class that raised NotImplementedError, and that everything else would be supported out of the box?

martindurant commented 1 month ago

Yeah, a fallback implementation in fsspec would totally make sense. This does the business

--- a/fsspec/spec.py
+++ b/fsspec/spec.py
@@ -1915,7 +1915,7 @@ class AbstractBufferedFile(io.IOBase):

     def _fetch_range(self, start, end):
         """Get the specified set of bytes from remote"""
-        raise NotImplementedError
+        return self.fs.cat_file(self.path, start=start, end=end)

     def read(self, length=-1):

together with

--- a/object-store-rs/python/object_store_rs/fsspec.py
+++ b/object-store-rs/python/object_store_rs/fsspec.py
@@ -55,7 +55,7 @@ class AsyncFsspecStore(fsspec.asyn.AsyncFileSystem):
             resp = await obs.get_async(self.store, path)
             return await resp.bytes_async()

-        if start and end:
+        if start is not None and end is not None:
             return await obs.get_range_async(
                 self.store, path, offset=start, length=end - start
             )

(because start can be and often is 0, and actually None means 0 anyway).

Leads to:

In [1]: import object_store_rs as obs
   ...: import pyarrow.parquet as pq
   ...: from object_store_rs.fsspec import AsyncFsspecStore
   ...:
   ...: store = obs.store.HTTPStore.from_url("https://github.com")
   ...: fs = AsyncFsspecStore(store)
   ...: url = "opengeospatial/geoparquet/raw/refs/heads/main/examples/example.parquet"
   ...: pq.read_metadata(url, filesystem=fs)

Out[1]:
<pyarrow._parquet.FileMetaData object at 0x10a8539c0>
  created_by: parquet-cpp-arrow version 16.1.0
  num_columns: 10
  num_rows: 5
  num_row_groups: 1
  format_version: 2.6
  serialized_size: 6685
kylebarron commented 1 month ago

Until https://github.com/fsspec/filesystem_spec/pull/1732 is merged and released, is it expected for AsyncFsspecStore to subclass both from fsspec.asyn.AsyncFileSystem and AbstractBufferedFile?

martindurant commented 1 month ago

is it expected for AsyncFsspecStore to subclass both from fsspec.asyn.AsyncFileSystem and AbstractBufferedFile?

That sounds ... painful. The linked PR has no negative side effects, so I only need to write some kind of test for it.

kylebarron commented 3 weeks ago

@martindurant Do you have any interest in pushing this PR over the line, since you know fsspec so well?

martindurant commented 3 weeks ago

Sure! What needs doing besides the fsspec PR?

kylebarron commented 3 weeks ago

I'm not fully sure how to test it. I'm also not sure if there are any provided methods in the fsspec base class that we should be implementing, that could be more efficiently provided from the Rust side (without adding a bunch of new Rust code; I'd like this fsspec wrapper to ideally be python-only)

kumare3 commented 3 weeks ago

we would love to contribute to this as well and make it work with flyteorg/flyte - wdyt?

kylebarron commented 3 weeks ago

I'm happy to accept contributions

pingsutw commented 3 weeks ago

@kylebarron, When will this be merged? I'd like to help work on it as well.

kylebarron commented 3 weeks ago

I'd be happy to accept a PR with tests. There was some progress from @martindurant in #60 which was unblocked by adding test setup in https://github.com/developmentseed/obstore/pull/62

kylebarron commented 3 weeks ago

Superseded by https://github.com/developmentseed/obstore/pull/63