NASA-IMPACT / veda-backend

Backend services for VEDA
Other
11 stars 5 forks source link

Tiling/AWS Session problems in veda-dev #192

Open anayeaye opened 1 year ago

anayeaye commented 1 year ago

What

We are attempting to promote a large diff from our dev backend to staging but have encountered problems with the maps in the discovery and explore views of a preview of the dashboard running against the development backend. This is a hard error to document because a request for /cog/info that fails on first attempt succeeds when attempted a second time (I've seen that before but don't recall our answer at the moment). I suspect at least some of the solution may be in our raster-api GDAL environment, perhaps the configuration has drifted?

Dashboard preview:

https://deploy-preview-281--visex.netlify.app/

Mosaic examples

Failing mosaic in dev

https://dev-raster.delta-backend.com/mosaic/tiles/795277e64375a264bf3f73506a6cd2d0/WebMercatorQuad/2/0/1@1x?assets=cog_default&resampling=bilinear&bidx=1&colormap_name=rdylbu_r&rescale=0,1

First try: '/vsis3/veda-data-store-staging/OMSO2PCA-COG/OMSO2PCA_LUT_SCD_2005.tif' does not exist in the file system, and is not recognized as a supported dataset name.

Second attempt after executing /cog/info: Read or write failed. IReadBlock failed at X offset 0, Y offset 0: /vsis3/veda-data-store-staging/OMSO2PCA-COG/OMSO2PCA_LUT_SCD_2005.tif, band 1: IReadBlock failed at X offset 0, Y offset 0: TIFFReadEncodedTile() failed."

Mosaic works in staging

https://staging-raster.delta-backend.com/mosaic/tiles/795277e64375a264bf3f73506a6cd2d0/WebMercatorQuad/2/0/1@1x?assets=cog_default&resampling=bilinear&bidx=1&colormap_name=rdylbu_r&rescale=0,1

COG info examples

Note we are unable to read COG info for the file for the mosaic but can access other files in the same collection so it is not purely a permission issue

https://dev-raster.delta-backend.com/cog/info?url=s3://veda-data-store-staging/OMSO2PCA-COG/OMSO2PCA_LUT_SCD_2005.tif

On first attempt: `'/vsis3/veda-data-store-staging/OMSO2PCA-COG/OMSO2PCA_LUT_SCD_2005.tif' does not exist in the file system, and is not recognized as a supported dataset name.

On second attempt: endpoint returns cog/info

These are yearly COGs so the error should be reproducible by incrementing the date in the tif name.

COG Tiles example

We already know that /cog is handling the env, this tiles example works as expected. https://dev-raster.delta-backend.com/cog/tiles/WebMercatorQuad/0/0/0@1x?url=s3://veda-data-store-staging/OMSO2PCA-COG/OMSO2PCA_LUT_SCD_2005.tif&bidx=1&rescale=0,1

Stack Notes

We cannot make a one to one comparison of the dev and staging veda-backend stacks because we have upgraded the version of pgstac for the dev database but not staging.

Similarities

Differences

smohiudd commented 1 year ago

@anayeaye this is the issue we encountered before where the endpoint was failing intermittently. The problem that time was the creds weren't being passed to gdal. This is what that fix looked like: https://github.com/NASA-IMPACT/veda-backend/pull/144/files

The error that we're seeing now:

"detail": "'/vsis3/veda-data-store-staging/EIS/COG/coastal-flooding-and-slr/MODIS_LC_2001_BD_v2.cog.tif' does not exist in the file system, and is not recognized as a supported dataset name."

looks very similar to what we saw in the previous issue.

smohiudd commented 1 year ago

In the PR we had some changes to how GDAL envs are passed through titiler based on the 0.7.0 breaking changes: https://github.com/developmentseed/titiler/blob/main/CHANGES.md#070-2022-06-08

Do we know if these gdal config changes were tested in dev?

anayeaye commented 1 year ago

Just for the record (no new insights): I tried some pinning in the raster-api. These changes did not solve our problem and the dev deployment is reverted to the current develop branch.

To be extra sure we weren't getting the breaking version of starlette. This looked promising because there is some subtle difference between the cold start true/false conditions that cause slightly different results on multiple tries of the same request (examples in the issue description).

"fastapi>=0.87,<0.92",
"starlette>=0.21.0,<0.25",

And on a whim, to see if the recent release of rasterio was related to our woes

"rasterio<1.3.8",

So our current condition remains: /cog routes are happily using the sts assume role session credentials, /mosaic and /stac endpoints are not. I don't see where the divergence happens--I'm pretty sure they all have titiler core's BaseTilerFactory underneath.

vincentsarago commented 1 year ago

mosaic and stac may use another level of threading which might explain why the environment is not the same. I've been trying to talk this issue for a while without success.

Can you test by setting RIO_TILER_MAX_THREADS=1 and MOSAIC_CONCURRENCY=1 (this in theory will remove any multi-threading)

ref: https://github.com/developmentseed/titiler/issues/186

anayeaye commented 1 year ago

With RIO_TILER_MAX_THREADS=1 (our current deployment already does) and MOSAIC_CONCURRENCY=1 (just now) Still seeing an Access Denied 403 on the first hit followed by a does not exist in the file system error on re-tries.

EDIT/note: I've now reverted the lambda environment to match the env variables stored for github actions: RIO_TILER_MAX_THREADS=1; MOSAIC_CONCURRENCY is unset.

ranchodeluxe commented 1 year ago

@vincentsarago: My hacky fix for this issue created a success case for veda-backend and shows where I believe the issue resides. I wish I would've seen this convo yesterday b/c it would've saved hours 😆

The issue explained:

vincentsarago commented 1 year ago

Thanks so much @ranchodeluxe for this deep dive. This is definitely a bug that we should fix at rio-tiler level

I wonder if using a combination of https://github.com/rasterio/rasterio/blob/main/rasterio/env.py#L328C1-L339C1 to get the options in the environment and forward them to a new Env will work 🤷

vincentsarago commented 1 year ago

FYI this can be simply demo with

with rasterio.Env(
    session=AWSSession(
        aws_access_key_id="MyDevseedId",
        aws_secret_access_key="MyDevseedKey",
    )
):
    with rasterio.open("s3://ds-satellite/cogs/NaturalEarth/world_grey.tif") as src:
        print(src.profile)

    with rasterio.Env():
        with rasterio.open("s3://ds-satellite/cogs/NaturalEarth/world_grey_1024_512.tif") as src:
            print(src.profile)

{'driver': 'GTiff', 'dtype': 'uint8', 'nodata': None, 'width': 21580, 'height': 10780, 'count': 3, 'crs': CRS.from_epsg(4326), 'transform': Affine(0.01666666666667, 0.0, -179.8333333333333,
       0.0, -0.01666666666667, 89.83333333333331), 'blockxsize': 128, 'blockysize': 128, 'tiled': True, 'compress': 'jpeg', 'interleave': 'pixel', 'photometric': 'ycbcr'}

rasterio/_base.pyx in rasterio._base.DatasetBase.__init__()

RasterioIOError: Access Denied
vincentsarago commented 1 year ago

ok, I may have a fix for this but it will require a full rio-tiler/titiler/titiler-pgstac update

I see veda raster-api is a bit behind the actual version (titiler-pgstac=0.2.3 / titiler 0.10.2), ideally I'll release titiler-pgstac 0.5 and titiler 0.12 with a new rio-tiler 4.2

The move from titiler-pgstac 0.2.3 to 0.5 will have couple breaking changes:

Now

{endpoint}/collections/collection1/items/item1/info

Before

{endpoint}/mosaic/tiles/20200307aC0853900w361030/0/0/0

Now

{endpoint}/mosaic/20200307aC0853900w361030/tiles/0/0/0


- https://github.com/stac-utils/titiler-pgstac/blob/0.4.1/CHANGES.md#040-2023-05-22

Before

/{searchid}/{z}/{x}/{y}/assets

Now

/{searchid}/tiles/{z}/{x}/{y}/assets


- https://github.com/stac-utils/titiler-pgstac/blob/0.4.1/CHANGES.md#041-2023-06-21

rename add_map_viewer to add_viewer option in MosaicTilerFactory for consistency with titiler's options

ranchodeluxe commented 1 year ago

FYI this can be simply demo with

with rasterio.Env(
    session=AWSSession(
        aws_access_key_id="MyDevseedId",
        aws_secret_access_key="MyDevseedKey",
    )
):
    with rasterio.open("s3://ds-satellite/cogs/NaturalEarth/world_grey.tif") as src:
        print(src.profile)

    with rasterio.Env():
        with rasterio.open("s3://ds-satellite/cogs/NaturalEarth/world_grey_1024_512.tif") as src:
            print(src.profile)

{'driver': 'GTiff', 'dtype': 'uint8', 'nodata': None, 'width': 21580, 'height': 10780, 'count': 3, 'crs': CRS.from_epsg(4326), 'transform': Affine(0.01666666666667, 0.0, -179.8333333333333,
       0.0, -0.01666666666667, 89.83333333333331), 'blockxsize': 128, 'blockysize': 128, 'tiled': True, 'compress': 'jpeg', 'interleave': 'pixel', 'photometric': 'ycbcr'}

rasterio/_base.pyx in rasterio._base.DatasetBase.__init__()

RasterioIOError: Access Denied

I'm confused as to why rasterio is operating this way in the same thread. Based on the source code it should be picking these things up!: https://github.com/rasterio/rasterio/blob/main/rasterio/env.py#L272-L291

vincentsarago commented 1 year ago

Even in the same thread it seems the session is not forwarded. I'm opening an issue in rasterio because to me it seems to be a Bug

ranchodeluxe commented 1 year ago

Even in the same thread it seems the session is not forwarded. I'm opening an issue in rasterio because to me it seems to be a Bug

Yeah, based on the code I'm reading it is a bug

ranchodeluxe commented 1 year ago

@vincentsarago : For a single thread nested rasterio.Env DOES find the previous environ. The exact same thing works fine for me below. Not the same s3 endpoint (don't have a DS AWS account). Can you double check that you don't have any existing AWS_* os environ variables exported and please remove them?

import rasterio
import pprint

session = {
    "session": rasterio.session.AWSSession(
        aws_access_key_id="<blah>",
        aws_secret_access_key="<blah>",
        aws_session_token="<blah>",
    )
}

with rasterio.Env(**session) as rioenv1:
    print('########### rioenv1 ###########')
    pprint.pprint(rioenv1.options, indent=4)
    with rasterio.open("s3://veda-data-store-staging/geoglam/CropMonitor_202001.tif") as src:
        pprint.pprint(src.profile, indent=4)
    with rasterio.Env() as rioenv2:
        print('########### rioenv2 ###########')
        pprint.pprint(rioenv2.options, indent=4)
        with rasterio.open("s3://veda-data-store-staging/geoglam/CropMonitor_202001.tif") as src:
            pprint.pprint(src.profile, indent=4)
vincentsarago commented 1 year ago
########### rioenv1 ###########
{   'AWS_ACCESS_KEY_ID': '<blah>',
    'AWS_REGION': 'us-east-1',
    'AWS_SECRET_ACCESS_KEY': '<blah>'}

{   'blockxsize': 128,
    'blockysize': 128,
    'compress': 'jpeg',
    'count': 3,
    'crs': CRS.from_epsg(4326),
    'driver': 'GTiff',
    'dtype': 'uint8',
    'height': 10780,
    'interleave': 'pixel',
    'nodata': None,
    'photometric': 'ycbcr',
    'tiled': True,
    'transform': Affine(0.01666666666667, 0.0, -179.8333333333333,
       0.0, -0.01666666666667, 89.83333333333331),
    'width': 21580}

########### rioenv2 ###########
{}

{   'blockxsize': 512,
    'blockysize': 512,
    'compress': 'jpeg',
    'count': 3,
    'crs': CRS.from_epsg(4326),
    'driver': 'GTiff',
    'dtype': 'uint8',
    'height': 10780,
    'interleave': 'pixel',
    'nodata': None,
    'photometric': 'ycbcr',
    'tiled': True,
    'transform': Affine(0.01666666666667, 0.0, -179.8333333333333,
       0.0, -0.01666666666667, 89.83333333333331),
    'width': 21580}

Note: the second call should fails but because I've got my default AWS profile as devseed it works 😅

vincentsarago commented 1 year ago

@ranchodeluxe feel free to add more comments in the rasterio ticket 🙏

ranchodeluxe commented 1 year ago

@ranchodeluxe feel free to add more comments in the rasterio ticket 🙏

will do, but I have to build my rasterio image and want to do it as a test case for them

moradology commented 8 months ago

After asking around, it appears this has been resolved for the time being. The ultimate fix is in rasterio, so the next step is bumping rasterio versions once the next release is cut (>1.3.9)