corteva / rioxarray

geospatial xarray extension powered by rasterio
https://corteva.github.io/rioxarray
Other
525 stars 83 forks source link

Lazy access of jp2 file in private S3 bucket works, but subsequent computation outside of the session fails with HTTP response 403 #816

Open konstntokas opened 1 week ago

konstntokas commented 1 week ago

Code Sample, a copy-pastable example if possible

The code sample below raises a warning with an HTTP response 403. Note that key and secret for AWS bucket can be obtained by CDSE.

import rasterio
import rioxarray

uri = (
    "s3://eodata/Sentinel-2/MSI/L2A/2020/07/05/S2B_MSIL2A_20200705T101559_N0214_R065"
    "_T32TMT_20200705T135630.SAFE/GRANULE/L2A_T32TMT_A017394_20200705T101917/"
    "IMG_DATA/R10m/T32TMT_20200705T101559_B02_10m.jp2"
)
session = rasterio.session.AWSSession(
    aws_unsigned=False,
    endpoint_url="eodata.dataspace.copernicus.eu",
    aws_access_key_id="xxx",
    aws_secret_access_key="xxx",
)
with rasterio.env.Env(session=session, AWS_VIRTUAL_HOSTING=False):
    ds = rioxarray.open_rasterio(uri, chunks=dict(x=1024, y=1024))

mean = ds.mean()
print(mean)
print(mean.compute())

The output is is given in the following cell. When computing the dask graph in the last line print(mean.compute()), the actual data needs to be accessed, which raises the warning. In larger examples it raises the error ''Aborting load due to failure while reading'.

<xarray.DataArray ()> Size: 8B
dask.array<mean_agg-aggregate, shape=(), dtype=float64, chunksize=(), chunktype=numpy.ndarray>
Coordinates:
    spatial_ref  int64 8B 0
Warning 1: HTTP response code on https://eodata.s3.us-east-2.amazonaws.com/Sentinel-2/MSI/L2A/2020/07/05/S2B_MSIL2A_20200705T101559_N0214_R065_T32TMT_20200705T135630.SAFE/GRANULE/L2A_T32TMT_A017394_20200705T101917/IMG_DATA/R10m/T32TMT_20200705T101559_B02_10m.jp2.msk: 403
Warning 1: HTTP response code on https://eodata.s3.us-east-2.amazonaws.com/Sentinel-2/MSI/L2A/2020/07/05/S2B_MSIL2A_20200705T101559_N0214_R065_T32TMT_20200705T135630.SAFE/GRANULE/L2A_T32TMT_A017394_20200705T101917/IMG_DATA/R10m/T32TMT_20200705T101559_B02_10m.jp2.MSK: 403
<xarray.DataArray ()> Size: 8B
array(802.02040494)
Coordinates:
    spatial_ref  int64 8B 0

Process finished with exit code 0

When performing the last line within the rasterio Env, it works just fine.

import rasterio
import rioxarray

uri = (
    "s3://eodata/Sentinel-2/MSI/L2A/2020/07/05/S2B_MSIL2A_20200705T101559_N0214_R065"
    "_T32TMT_20200705T135630.SAFE/GRANULE/L2A_T32TMT_A017394_20200705T101917/"
    "IMG_DATA/R10m/T32TMT_20200705T101559_B02_10m.jp2"
)
session = rasterio.session.AWSSession(
    aws_unsigned=False,
    endpoint_url="eodata.dataspace.copernicus.eu",
    aws_access_key_id="O0M0CUQIDQO9TDZ4D8NR",
    aws_secret_access_key="qPUyXs9G6j8on6MY5KPhQNHuA5uZTqxEscrbBCGx",
)
with rasterio.env.Env(session=session, AWS_VIRTUAL_HOSTING=False):
    ds = rioxarray.open_rasterio(uri, chunks=dict(x=1024, y=1024))

mean = ds.mean()
print(mean)

with rasterio.env.Env(session=session, AWS_VIRTUAL_HOSTING=False):
    print(mean.compute())

Problem description

When computing the dask graph in the last line print(mean.compute()), the actual data needs to be accessed, which raises the warning. In larger examples it raises ''Aborting load due to failure while reading'.

Expected Output

The access credentials should be somehow saved when open the data. Otherwise for each computation on the actual data, a new environment will need to be created and applied.

Environment Information

python -c "import rioxarray; rioxarray.show_versions()" rioxarray (0.17.0) deps: rasterio: 1.4.1 xarray: 2024.6.0 GDAL: 3.9.3 GEOS: 3.13.0 PROJ: 9.5.0 PROJ DATA: /home/konstantin/micromamba/envs/xcube-stac/share/proj GDAL DATA: /home/konstantin/micromamba/envs/xcube-stac/share/gdal

Other python deps: scipy: 1.14.1 pyproj: 3.7.0

System: python: 3.12.7 | packaged by conda-forge | (main, Oct 4 2024, 16:05:46) [GCC 13.3.0] executable: /home/konstantin/micromamba/envs/xcube-stac/bin/python machine: Linux-6.8.0-47-generic-x86_64-with-glibc2.35

Installation method

Conda environment information (if you installed with conda):


Environment (micromamba list):

``` $ micromamba list | grep -E "rasterio|xarray|gdal" gdal 3.9.3 py312h1299960_0 conda-forge libgdal 3.9.3 ha770c72_0 conda-forge libgdal-core 3.9.3 hd5b9bfb_0 conda-forge libgdal-fits 3.9.3 h2db6552_0 conda-forge libgdal-grib 3.9.3 hc3b29a1_0 conda-forge libgdal-hdf4 3.9.3 hd5ecb85_0 conda-forge libgdal-hdf5 3.9.3 h6283f77_0 conda-forge libgdal-jp2openjpeg 3.9.3 h1b2c38e_0 conda-forge libgdal-kea 3.9.3 h1df15e4_0 conda-forge libgdal-netcdf 3.9.3 hf2d2f32_0 conda-forge libgdal-pdf 3.9.3 h600f43f_0 conda-forge libgdal-pg 3.9.3 h5e77dd0_0 conda-forge libgdal-postgisraster 3.9.3 h5e77dd0_0 conda-forge libgdal-tiledb 3.9.3 h4a3bace_0 conda-forge libgdal-xls 3.9.3 h03c987c_0 conda-forge rasterio 1.4.1 py312h8456570_0 conda-forge rioxarray 0.17.0 pyhd8ed1ab_0 conda-forge xarray 2024.10.0 pyhd8ed1ab_0 conda-forge ```


Details about micromamba and system ( micromamaba info ):

``` $ micromamba info libmamba version : 1.5.8 micromamba version : 1.5.8 curl version : libcurl/8.6.0 OpenSSL/3.2.1 zlib/1.2.13 zstd/1.5.5 libssh2/1.11.0 nghttp2/1.58.0 libarchive version : libarchive 3.7.2 zlib/1.2.13 bz2lib/1.0.8 libzstd/1.5.5 envs directories : /home/konstantin/micromamba/envs package cache : /home/konstantin/micromamba/pkgs /home/konstantin/.mamba/pkgs environment : xcube-stac (active) env location : /home/konstantin/micromamba/envs/xcube-stac user config files : /home/konstantin/.mambarc populated config files : virtual packages : __unix=0=0 __linux=6.8.0=0 __glibc=2.35=0 __archspec=1=x86_64-v3 channels : base environment : /home/konstantin/micromamba platform : linux-64 ```
snowman2 commented 1 week ago

That sounds correct. The data is lazy loaded from disk. If you load in all of the data inside the session, then this likely won't be an issue.

konstntokas commented 1 week ago

This I understand. However the idea is to lazy load the data in a reading routine and later load the data when plotting etc. Otherwise, each operation which loads the data needs to be performed within the environment session.

I downloaded one tile of the dataset and stored in one of our private S3 buckets. When accessing this file, it works as expected. The lazy loading is done within the environment session and operation with loading data can be done outside of the Env.

import rasterio
import rioxarray

uri = "s3://xxx/L2A_T33SXA_20150715T094306_B02_10m.jp2"

session = rasterio.session.AWSSession(
    aws_unsigned=False,
    aws_access_key_id="xxx",
    aws_secret_access_key="xxx",
)
with rasterio.env.Env(session=session):
    ds = rioxarray.open_rasterio(uri, chunks=dict(x=1024, y=1024))

print(ds)
mean = ds.mean()
print(mean)
print(mean.compute())

They only difference above is that I need to set an endpoint_url and `AWS_VIRTUAL_HOSTING=False. Do you think is can have an impact?