Open jorisvandenbossche opened 2 years ago
It looks like GDAL's VSI plugin architecture should work work, but (if I follow correctly), it looks like there are some potential issues for us lurking here.
If the input is a single entity, such as a single GPKG or zipped shapefile (if opened via ZipFile(path)
or possibly chained with /vsizip/
), then we should be able to read it using a file-based interface. However, I think if we open a multi-file dataset (e.g., shapefile with sidecar files) via fsspec, we don't have a handle on the filesystem except perhaps through attributes specific to fsspec (e.g., an open file has an .fs
attribute). I think we need to provide some of the filesystem methods in order for GDAL to read the sidecar files (and create them when we enable write functionality for this).
Rasterio approached this by adding the concept of openers to the API, where you pass in an path as a string and an opener
that provides an interface for opening the file / filesystem. It then detects the type of opener passed in and wraps it in appropriate wrapper classes to interface with the underlying VSI filesystem plugin API calls.
I'm not particularly keen adding opener
as a keyword here. Are we OK with instead implementing this more narrowly for specific filesystems, namely fsspec
and ZipFile
(sidestepping GDAL's built-in /vsizip/
) where we can check the instance type of an open handle and provide specific wrappers for these?
I think this would give us the following options for read (and hopefully write):
BytesIO
)BytesIO
instance containing dataset bytesZipFile
with opened specific file where we can use something to get at filesystem (not sure what yet) from zipfile import ZipFile
z = ZipFile("my.zip"):
handle = z.open("my.shp")
df = read_dataframe(handle, ...)
fsspec
file where we can use the .fs
attribute to get filesystemimport fsspec
fs = fsspec.filesystem("http")
handle = fs.open("https://.../my.shp")
df = read_dataframe(handle, ...)
(note: examples are speculative, not sure if this will work)
Copying from https://github.com/geopandas/pyogrio/issues/22#issuecomment-965134925
It should be possible to connect a python file-like object with a
VSIVirtualHandle
(https://gdal.org/api/cpl_cpp.html#classVSIVirtualHandle), so you could avoid reading the full file-like object into memory as bytes, but forwarding the C "read"/"seek" calls into python read/seek calls.And this is exactly what @djhoese contributed to rasterio in https://github.com/rasterio/rasterio/pull/2141
I don't know if many OGR file formats would benefit from this, though (they would need to support reading/parsing only part of the file, if asking for a subset of columns/features)