geopandas / pyogrio

Vectorized vector I/O using OGR
https://pyogrio.readthedocs.io
MIT License
260 stars 22 forks source link

ENH: add better support for remote URLs #252

Open brendan-ward opened 1 year ago

brendan-ward commented 1 year ago

See GeoPandas #2908 and issues related to GeoPandas #2796 related to handling of URLs that lack extensions in the filename.

jorisvandenbossche commented 1 year ago

Testing the GDAL support for urls, using a small patch to disable any pre-processing of file paths:

Details (diff) ```diff --- a/pyogrio/raw.py +++ b/pyogrio/raw.py @@ -129,7 +129,8 @@ def read( "geometry_type": "" } """ - path, buffer = get_vsi_path(path_or_buffer) + path, buffer = path_or_buffer, None ```

I see the following:

The same can be seen using ogrinfo (so I assume ogrinfo doesn't do any custom handling of the path, compared to the C API we use, and can thus also more easily be used for testing this):

$ ogrinfo https://raw.githubusercontent.com/geopandas/geopandas/main/geopandas/tests/data/null_geom.geojson
$ ogrinfo /vsizip/vsicurl/https://raw.githubusercontent.com/geopandas/geopandas/main/geopandas/datasets/nybb_16a.zip
$ ogrinfo /vsizip/{/vsicurl/https://geonode.goosocean.org/download/480}
$ ogrinfo https://demo.pygeoapi.io/stable/collections/obs/items
jorisvandenbossche commented 1 year ago

Something else to notice is that the raw url to a file doesn't just work with geojson (first example above), for example it also works with Parquet files (ogrinfo https://github.com/opengeospatial/geoparquet/raw/main/examples/example.parquet works). But here, using a larger file to test with, there is a clear difference between the plain url:

$ ogrinfo https://storage.googleapis.com/open-geodata/linz-examples/nz-building-outlines.parquet

This takes a long time generating a lot of download traffic (I stopped it since it's more than a gig), and so using a plain url will probably always download the full file.

In contrast, when using vsicurl, GDAL can use random access to peek at the file, and so it can quickly detect it is a GeoParquet file and what type of data it contains etc (the following only takes around 1 to 2s):

ogrinfo /vsicurl/https://storage.googleapis.com/open-geodata/linz-examples/nz-building-outlines.parquet

In summary, I think it is always good to add /vsicurl when possible (and we certainly need to keep doing it when needed, eg for zip files). But thus we should have some way to detect when it is not possible (the 4th case above, url with content negotiation)

jorisvandenbossche commented 1 year ago

Some other findings: the urls that don't work with /vsicurl/ often do work with it when specifying the option to not use the HEAD:

Without vsicurl it works (just downloads the full file)

$ ogrinfo -ro -so --debug on 'https://demo.pygeoapi.io/stable/collections/obs/items'
HTTP: Fetch(https://demo.pygeoapi.io/stable/collections/obs/items)
HTTP: libcurl/7.87.0 OpenSSL/3.1.1 zlib/1.2.13 libssh2/1.10.0 nghttp2/1.47.0
HTTP: These HTTP headers were set: Accept: text/plain, application/json
GDAL: GDALOpen(https://demo.pygeoapi.io/stable/collections/obs/items, this=0x557748f5ff30) succeeds as GeoJSON.
INFO: Open of `https://demo.pygeoapi.io/stable/collections/obs/items'
      using driver `GeoJSON' successful.
OGR: GetLayerCount() = 1

1: items (Point)
GDAL: GDALClose(https://demo.pygeoapi.io/stable/collections/obs/items, this=0x557748f5ff30)
GDAL: In GDALDestroy - unloading GDAL shared library.

With vsicurl it fails:

$ ogrinfo -ro -so --debug on '/vsicurl/https://demo.pygeoapi.io/stable/collections/obs/items'
HTTP: libcurl/7.87.0 OpenSSL/3.1.1 zlib/1.2.13 libssh2/1.10.0 nghttp2/1.47.0
Warning 1: HTTP response code on https://demo.pygeoapi.io/stable/collections/obs/items: 500
VSICURL: GetFileSize(https://demo.pygeoapi.io/stable/collections/obs/items)=0  response_code=500
VSICURL: Request at offset 0, after end of file
VSICURL: GetFileList(/vsicurl/https://demo.pygeoapi.io/stable/collections/obs)
FAILURE:
Unable to open datasource `/vsicurl/https://demo.pygeoapi.io/stable/collections/obs/items' with the following drivers.
...

With use_head=no it works again:

$ ogrinfo -ro -so --debug on '/vsicurl?use_head=no&url=https://demo.pygeoapi.io/stable/collections/obs/items'
HTTP: libcurl/7.87.0 OpenSSL/3.1.1 zlib/1.2.13 libssh2/1.10.0 nghttp2/1.47.0
VSICURL: Downloading 0-16383 (https://demo.pygeoapi.io/stable/collections/obs/items)...
VSICURL: Got response_code=200
GeoJSON: First pass: 100.00 %
GDAL: GDALOpen(/vsicurl?use_head=no&url=https://demo.pygeoapi.io/stable/collections/obs/items, this=0x55df11fdc180) succeeds as GeoJSON.
INFO: Open of `/vsicurl?use_head=no&url=https://demo.pygeoapi.io/stable/collections/obs/items'
      using driver `GeoJSON' successful.
OGR: GetLayerCount() = 1

1: items (Point)
GDAL: GDALClose(/vsicurl?use_head=no&url=https://demo.pygeoapi.io/stable/collections/obs/items, this=0x55df11fdc180)
GDAL: In GDALDestroy - unloading GDAL shared library.

But even that is not a guarantee, as for example this url this fails regardless:

$ ogrinfo -ro -so --debug on "/vsicurl?use_head=no&url=https://gis-calema.opendata.arcgis.com/datasets/59d92c1bf84a438d83f78465dce02c61_0.geojson"
VSICURL: GetFileList(/vsicurl?use_head=no&url=https://gis-calema.opendata.arcgis.com/datasets)
HTTP: libcurl/7.87.0 OpenSSL/3.1.1 zlib/1.2.13 libssh2/1.10.0 nghttp2/1.47.0
VSICURL: GetFileSize(https://gis-calema.opendata.arcgis.com/datasets/59d92c1bf84a438d83f78465dce02c61_0.geojson)=0  response_code=200
VSICURL: Request at offset 0, after end of file
VSICURL: Request at offset 0, after end of file
VSICURL: GetFileSize(https://gis-calema.opendata.arcgis.com/datasets/59d92c1bf84a438d83f78465dce02c61_0.geojso1)=44601  response_code=200
VSICURL: Downloading 0-16383 (https://gis-calema.opendata.arcgis.com/datasets/59d92c1bf84a438d83f78465dce02c61_0.geojso1)...
VSICURL: Got response_code=200
VSICURL: Got more data than expected : 44601 instead of 16384
FAILURE:
Unable to open datasource `/vsicurl?use_head=no&url=https://gis-calema.opendata.arcgis.com/datasets/59d92c1bf84a438d83f78465dce02c61_0.geojson' with the following drivers.
jorisvandenbossche commented 1 year ago

Trying to summarize my exploration on this topic: it's unfortunately just not that simple ... URLs / servers have various capacities, the vsicurl filesystem has various options that might need to be configured to be able to read a certain url (https://gdal.org/user/virtual_file_systems.html#vsicurl-http-https-ftp-files-random-access), etc

The approach I am currently taking in geopandas (https://github.com/geopandas/geopandas/pull/2914) to try to fix this is to check with the header of the url whether it indicates to support reading ranges from the file ("Accept-Ranges"). That keeps the improvement from 0.13 to use /vsicurl for large files supporting this instead of downloading them upfront in geopandas itself, while still doing that for other urls.

We could do something similar in pyogrio as well. But whatever we "infer" to do by default, we should probably also give some more control over what is done (e.g. allow to disable this)?