geopandas / pyogrio

Vectorized vector I/O using OGR
https://pyogrio.readthedocs.io
MIT License
257 stars 21 forks source link

http 403 on intermediate files when reading GPKG from S3 #413

Open gtmaskall opened 1 month ago

gtmaskall commented 1 month ago

I'm testing reading vector data from S3. s3fs is installed in my environment. I've created a public bucket with a bucket policy granting any principal the s3:GetObject action to objects in the bucket. I'm deliberately avoiding access and secret access keys because the intention is to enable access via roles on the hosting EC2 instance. Thus, I'm specifying:

from pyogrio import set_gdal_config_options
set_gdal_config_options(
    {'AWS_NO_SIGN_REQUEST': True}
)

pyogrio engine does successfully return the data, but with RuntimeWarnings:

test_vec_s3_path = "s3://BKTNAME/watershed_results_ndr_prelim.gpkg"
test_vec_s3 = gp.read_file(test_vec_s3_path, engine="pyogrio")
/home/guy/anaconda3/envs/test_s3/lib/python3.12/site-packages/pyogrio/raw.py:196: RuntimeWarning: HTTP response code on https://BKTNAME.s3.eu-west-2.amazonaws.com/watershed_results_ndr_prelim.gpkg-journal: 403
  return ogr_read(
/home/guy/anaconda3/envs/test_s3/lib/python3.12/site-packages/pyogrio/raw.py:196: RuntimeWarning: HTTP response code on https://BKTNAME.s3.eu-west-2.amazonaws.com/watershed_results_ndr_prelim.gpkg-wal: 403
  return ogr_read(
/home/guy/anaconda3/envs/test_s3/lib/python3.12/site-packages/pyogrio/raw.py:196: RuntimeWarning: HTTP response code on https://BKTNAME.s3.eu-west-2.amazonaws.com/watershed_results_ndr_prelim.gpkg.aux.xml: 403
  return ogr_read(
/home/guy/anaconda3/envs/test_s3/lib/python3.12/site-packages/pyogrio/raw.py:196: RuntimeWarning: HTTP response code on https://BKTNAME.s3.eu-west-2.amazonaws.com/watershed_results_ndr_prelim.aux: 403
  return ogr_read(
/home/guy/anaconda3/envs/test_s3/lib/python3.12/site-packages/pyogrio/raw.py:196: RuntimeWarning: HTTP response code on https://BKTNAME.s3.eu-west-2.amazonaws.com/watershed_results_ndr_prelim.AUX: 403
  return ogr_read(
/home/guy/anaconda3/envs/test_s3/lib/python3.12/site-packages/pyogrio/raw.py:196: RuntimeWarning: HTTP response code on https://BKTNAME.s3.eu-west-2.amazonaws.com/watershed_results_ndr_prelim.gpkg.aux: 403
  return ogr_read(
/home/guy/anaconda3/envs/test_s3/lib/python3.12/site-packages/pyogrio/raw.py:196: RuntimeWarning: HTTP response code on https://BKTNAME.s3.eu-west-2.amazonaws.com/watershed_results_ndr_prelim.gpkg.AUX: 403
  return ogr_read(

fiona also hits a 403, but bombs out and doesn't return data:

test_vec_s3_path = "s3://BKTNAME/watershed_results_ndr_prelim.gpkg"
test_vec_s3 = gp.read_file(test_vec_s3_path, engine="fiona")
...
DriverError: b'HTTP response code on https://BKTNAME.s3.eu-west-2.amazonaws.com/watershed_results_ndr_prelim.gpkg-wal: 403'

Is this use case an anti-pattern? Is pyogrio looking for these extensions to use them if they exist (and the triggering of the 403 is unfortunate), or is it trying to create them as intermediates? I'm assuming the former, due to the reference to ogr_read().

If it's just a warning, then it's unfortunate but "meh". But should I allow write access so something (GDAL?) can create these intermediates if there's an advantage to pyogrio when reading data?

gtmaskall commented 1 month ago

Further, I get something similar when reading raster data using xarray/rioxarray, viz

dem_s3_path = "s3://BKTNAME/filled_colne_dem_new_nodata.tiff"
dem_s3 = xr.load_dataarray(dem_s3_path)
...
/home/guy/anaconda3/envs/test_s3/lib/python3.12/site-packages/rioxarray/_io.py:430: RuntimeWarning: HTTP response code on https://BKTNAME.s3.eu-west-2.amazonaws.com/filled_colne_dem_new_nodata.tiff.msk: 403
  out = riods.read(band_key, window=window, masked=self.masked)
/home/guy/anaconda3/envs/test_s3/lib/python3.12/site-packages/rioxarray/_io.py:430: RuntimeWarning: HTTP response code on https://BKTNAME.s3.eu-west-2.amazonaws.com/filled_colne_dem_new_nodata.tiff.MSK: 403
  out = riods.read(band_key, window=window, masked=self.masked)

So, should I really be including a PutObject allow (and presumably also then a delete as I guess they're temporary, working files) or are they warnings that can be ignored?

Corollary: do these calls all generally require write access to local filesystems when you read data?

jorisvandenbossche commented 1 month ago

@gtmaskall this is probably more a question for GDAL (and how it connects with S3 exactly). I also see an initial warning when trying this with the GDAL command line (before failing because I don't have access):

$ ogrinfo --config AWS_NO_SIGN_REQUEST=NO -ro -al -so /vsis3/BKTNAME/watershed_results_ndr_prelim.gpkg
Warning 1: HTTP response code on https://BKTNAME.s3.amazonaws.com/watershed_results_ndr_prelim.gpkg: 403
...

To understand better which requests are being made, you could maybe set CPL_CURL_VERBOSE=YES env variable (https://gdal.org/user/configoptions.html#logging)

rouault commented 1 month ago

you may want to set the GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR config option (https://github.com/OSGeo/gdal/issues/9443#issuecomment-1988483122 / https://gdal.org/user/configoptions.html#performance-and-caching) to prevent GDAL from issuing a directory listing HTTP request which might not be sufficient here. So you may need to set CPL_VSIL_CURL_ALLOWED_EXTENSIONS to ".gpkg" (cf https://gdal.org/user/configoptions.html#networking-options)