Open brendan-ward opened 1 year ago
Testing the GDAL support for urls, using a small patch to disable any pre-processing of file paths:
I see the following:
pyogrio.read_dataframe("https://raw.githubusercontent.com/geopandas/geopandas/main/geopandas/tests/data/null_geom.geojson")
/vsicurl/
prepended/vsizip/
prepended (and thus also /vsicurl/
)pyogrio.read_dataframe("/vsizip/vsicurl/https://raw.githubusercontent.com/geopandas/geopandas/main/geopandas/datasets/nybb_16a.zip")
/vsizip/vsicurl/
also doesn't work without adding { }pyogrio.read_dataframe("/vsizip/{/vsicurl/https://geonode.goosocean.org/download/480}")
pyogrio.read_dataframe("https://demo.pygeoapi.io/stable/collections/obs/items")
pyogrio.read_dataframe("/vsicurl/https://demo.pygeoapi.io/stable/collections/obs/items")
(this gives a "HTTP response code: 500")The same can be seen using ogrinfo (so I assume ogrinfo doesn't do any custom handling of the path, compared to the C API we use, and can thus also more easily be used for testing this):
$ ogrinfo https://raw.githubusercontent.com/geopandas/geopandas/main/geopandas/tests/data/null_geom.geojson
$ ogrinfo /vsizip/vsicurl/https://raw.githubusercontent.com/geopandas/geopandas/main/geopandas/datasets/nybb_16a.zip
$ ogrinfo /vsizip/{/vsicurl/https://geonode.goosocean.org/download/480}
$ ogrinfo https://demo.pygeoapi.io/stable/collections/obs/items
Something else to notice is that the raw url to a file doesn't just work with geojson (first example above), for example it also works with Parquet files (ogrinfo https://github.com/opengeospatial/geoparquet/raw/main/examples/example.parquet
works).
But here, using a larger file to test with, there is a clear difference between the plain url:
$ ogrinfo https://storage.googleapis.com/open-geodata/linz-examples/nz-building-outlines.parquet
This takes a long time generating a lot of download traffic (I stopped it since it's more than a gig), and so using a plain url will probably always download the full file.
In contrast, when using vsicurl, GDAL can use random access to peek at the file, and so it can quickly detect it is a GeoParquet file and what type of data it contains etc (the following only takes around 1 to 2s):
ogrinfo /vsicurl/https://storage.googleapis.com/open-geodata/linz-examples/nz-building-outlines.parquet
In summary, I think it is always good to add /vsicurl
when possible (and we certainly need to keep doing it when needed, eg for zip files). But thus we should have some way to detect when it is not possible (the 4th case above, url with content negotiation)
Some other findings: the urls that don't work with /vsicurl/
often do work with it when specifying the option to not use the HEAD:
Without vsicurl it works (just downloads the full file)
$ ogrinfo -ro -so --debug on 'https://demo.pygeoapi.io/stable/collections/obs/items'
HTTP: Fetch(https://demo.pygeoapi.io/stable/collections/obs/items)
HTTP: libcurl/7.87.0 OpenSSL/3.1.1 zlib/1.2.13 libssh2/1.10.0 nghttp2/1.47.0
HTTP: These HTTP headers were set: Accept: text/plain, application/json
GDAL: GDALOpen(https://demo.pygeoapi.io/stable/collections/obs/items, this=0x557748f5ff30) succeeds as GeoJSON.
INFO: Open of `https://demo.pygeoapi.io/stable/collections/obs/items'
using driver `GeoJSON' successful.
OGR: GetLayerCount() = 1
1: items (Point)
GDAL: GDALClose(https://demo.pygeoapi.io/stable/collections/obs/items, this=0x557748f5ff30)
GDAL: In GDALDestroy - unloading GDAL shared library.
With vsicurl it fails:
$ ogrinfo -ro -so --debug on '/vsicurl/https://demo.pygeoapi.io/stable/collections/obs/items'
HTTP: libcurl/7.87.0 OpenSSL/3.1.1 zlib/1.2.13 libssh2/1.10.0 nghttp2/1.47.0
Warning 1: HTTP response code on https://demo.pygeoapi.io/stable/collections/obs/items: 500
VSICURL: GetFileSize(https://demo.pygeoapi.io/stable/collections/obs/items)=0 response_code=500
VSICURL: Request at offset 0, after end of file
VSICURL: GetFileList(/vsicurl/https://demo.pygeoapi.io/stable/collections/obs)
FAILURE:
Unable to open datasource `/vsicurl/https://demo.pygeoapi.io/stable/collections/obs/items' with the following drivers.
...
With use_head=no
it works again:
$ ogrinfo -ro -so --debug on '/vsicurl?use_head=no&url=https://demo.pygeoapi.io/stable/collections/obs/items'
HTTP: libcurl/7.87.0 OpenSSL/3.1.1 zlib/1.2.13 libssh2/1.10.0 nghttp2/1.47.0
VSICURL: Downloading 0-16383 (https://demo.pygeoapi.io/stable/collections/obs/items)...
VSICURL: Got response_code=200
GeoJSON: First pass: 100.00 %
GDAL: GDALOpen(/vsicurl?use_head=no&url=https://demo.pygeoapi.io/stable/collections/obs/items, this=0x55df11fdc180) succeeds as GeoJSON.
INFO: Open of `/vsicurl?use_head=no&url=https://demo.pygeoapi.io/stable/collections/obs/items'
using driver `GeoJSON' successful.
OGR: GetLayerCount() = 1
1: items (Point)
GDAL: GDALClose(/vsicurl?use_head=no&url=https://demo.pygeoapi.io/stable/collections/obs/items, this=0x55df11fdc180)
GDAL: In GDALDestroy - unloading GDAL shared library.
But even that is not a guarantee, as for example this url this fails regardless:
$ ogrinfo -ro -so --debug on "/vsicurl?use_head=no&url=https://gis-calema.opendata.arcgis.com/datasets/59d92c1bf84a438d83f78465dce02c61_0.geojson"
VSICURL: GetFileList(/vsicurl?use_head=no&url=https://gis-calema.opendata.arcgis.com/datasets)
HTTP: libcurl/7.87.0 OpenSSL/3.1.1 zlib/1.2.13 libssh2/1.10.0 nghttp2/1.47.0
VSICURL: GetFileSize(https://gis-calema.opendata.arcgis.com/datasets/59d92c1bf84a438d83f78465dce02c61_0.geojson)=0 response_code=200
VSICURL: Request at offset 0, after end of file
VSICURL: Request at offset 0, after end of file
VSICURL: GetFileSize(https://gis-calema.opendata.arcgis.com/datasets/59d92c1bf84a438d83f78465dce02c61_0.geojso1)=44601 response_code=200
VSICURL: Downloading 0-16383 (https://gis-calema.opendata.arcgis.com/datasets/59d92c1bf84a438d83f78465dce02c61_0.geojso1)...
VSICURL: Got response_code=200
VSICURL: Got more data than expected : 44601 instead of 16384
FAILURE:
Unable to open datasource `/vsicurl?use_head=no&url=https://gis-calema.opendata.arcgis.com/datasets/59d92c1bf84a438d83f78465dce02c61_0.geojson' with the following drivers.
Trying to summarize my exploration on this topic: it's unfortunately just not that simple ... URLs / servers have various capacities, the vsicurl filesystem has various options that might need to be configured to be able to read a certain url (https://gdal.org/user/virtual_file_systems.html#vsicurl-http-https-ftp-files-random-access), etc
/vsicurl
). GDAL will then not use a VSI but do a simple HTTP Fetch of the data. This however doesn't work or is sub-optimal in some cases:
/vsizip
handler to be able to read from the archive on the fly without decompressing beforehand. And it seems that just adding vsizip doesn't work with urls (/vsizip/https://...
), so in that case you have to add /vsizip//vsicurl/
(or /vsizip/{/vsicurl/..}
without .zip extension)/vsicurl
file system handler to enable random access into the files (e.g. original driver to change this in geopandas: geopandas/geopandas#2795)./vsicurl/
in front of any URL (as we currently do, and fiona as well) is clearly not the best default behaviour, as there are many urls that cannot be read this way (either don't work with vsicurl at all or need some extra curl option such as /vsicurl?use_head=no
or /vsicurl?empty_dir=yes
)The approach I am currently taking in geopandas (https://github.com/geopandas/geopandas/pull/2914) to try to fix this is to check with the header of the url whether it indicates to support reading ranges from the file ("Accept-Ranges"). That keeps the improvement from 0.13 to use /vsicurl for large files supporting this instead of downloading them upfront in geopandas itself, while still doing that for other urls.
We could do something similar in pyogrio as well. But whatever we "infer" to do by default, we should probably also give some more control over what is done (e.g. allow to disable this)?
See GeoPandas #2908 and issues related to GeoPandas #2796 related to handling of URLs that lack extensions in the filename.