geopandas / pyogrio

Vectorized vector I/O using OGR
https://pyogrio.readthedocs.io
MIT License
260 stars 22 forks source link

ENH: reading from file-like objects #42

Open jorisvandenbossche opened 2 years ago

jorisvandenbossche commented 2 years ago

Copying from https://github.com/geopandas/pyogrio/issues/22#issuecomment-965134925

It should be possible to connect a python file-like object with a VSIVirtualHandle (https://gdal.org/api/cpl_cpp.html#classVSIVirtualHandle), so you could avoid reading the full file-like object into memory as bytes, but forwarding the C "read"/"seek" calls into python read/seek calls.

And this is exactly what @djhoese contributed to rasterio in https://github.com/rasterio/rasterio/pull/2141

I don't know if many OGR file formats would benefit from this, though (they would need to support reading/parsing only part of the file, if asking for a subset of columns/features)

brendan-ward commented 2 months ago

It looks like GDAL's VSI plugin architecture should work work, but (if I follow correctly), it looks like there are some potential issues for us lurking here.

If the input is a single entity, such as a single GPKG or zipped shapefile (if opened via ZipFile(path) or possibly chained with /vsizip/), then we should be able to read it using a file-based interface. However, I think if we open a multi-file dataset (e.g., shapefile with sidecar files) via fsspec, we don't have a handle on the filesystem except perhaps through attributes specific to fsspec (e.g., an open file has an .fs attribute). I think we need to provide some of the filesystem methods in order for GDAL to read the sidecar files (and create them when we enable write functionality for this).

Rasterio approached this by adding the concept of openers to the API, where you pass in an path as a string and an opener that provides an interface for opening the file / filesystem. It then detects the type of opener passed in and wraps it in appropriate wrapper classes to interface with the underlying VSI filesystem plugin API calls.

I'm not particularly keen adding opener as a keyword here. Are we OK with instead implementing this more narrowly for specific filesystems, namely fsspec and ZipFile (sidestepping GDAL's built-in /vsizip/) where we can check the instance type of an open handle and provide specific wrappers for these?

I think this would give us the following options for read (and hopefully write):

from zipfile import ZipFile

z = ZipFile("my.zip"):
handle = z.open("my.shp")
df = read_dataframe(handle, ...)
import fsspec 

fs = fsspec.filesystem("http")
handle = fs.open("https://.../my.shp")
df = read_dataframe(handle, ...)

(note: examples are speculative, not sure if this will work)