developmentseed / titiler-cmr

Dynamic tiles from CMR queries
MIT License
5 stars 0 forks source link

start sketching the CMR mosaic backend #10

Closed vincentsarago closed 5 months ago

vincentsarago commented 5 months ago
from titiler.cmr.backend import CMRBackend

with CMRBackend("C1996881146-POCLOUD") as src:
    assets = src.assets_for_tile(10,10,10)

Granules found: 7903
print(assets)
['https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20020601090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc',
 'https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20020602090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc',
 'https://archive.podaac.earthdata.nasa.gov/podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20020603090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc',
 ...
]
vincentsarago commented 5 months ago

Data access

I'm not sure how we should configure the data access.

In earthaccess there are multiple ways: https://earthaccess.readthedocs.io/en/latest/howto/edl/

right now I've set .data_links(access="direct") for the DataGranule, which (if I understand correctly) will return an S3 URL.

Then to access this S3 url we need AWS credentials. EarthAccess provide a way to return credential using auth.get_s3_credentials but we need to have DAAC name 🤷

should we assume titiler-cmr to be run in an environment where we have direct access to the S3 files?

cc @sharkinsspatial @abarciauskas-bgse

sharkinsspatial commented 5 months ago

@vincentsarago This is the concern I was discussing in our call with @abarciauskas-bgse a while back. Does earthaccess have an authentication "escape hatch" that we can use?

The initial goal for us would be to deploy titiler-cmr to a Lambda which will be executed using a role has direct access credentials for the DAAC. This is the situation we use now with the VEDA titiler instances. To use this approach, I think earthaccess will need an "escape hatch" so that it just assumes that id doesn't need to pass any credentials to boto3 and will just use the execution role. @abarciauskas-bgse can you confirm that the direct access role we have in VEDA can support multiple DAACs (or is it just LPDAAC at the moment).

This will handle our initial NASA controlled deployments. But we should also consider eventually having another option for non-NASA users to use temporary s3_credentials. Because the Cumulus EDL s3_credentials are slow we've previously used an external credential rotation service to support this that periodically fetches temporary s3 creds for each DAAC and then stores them in an SSM parameter that the tiling Lambda can access at runtime.

vincentsarago commented 5 months ago

With the last commit I've implemented two ways to obtain credentials:

import earthaccess
from earthaccess.daac import find_provider
cmr_auth = earthaccess.login(strategy="netrc")

from titiler.cmr.backend import CMRBackend
from titiler.cmr.reader import ZarrReader

with CMRBackend("C1996881146-POCLOUD", cmr_auth, reader=ZarrReader, reader_options={"variable": "analysed_sst"}) as src:
   img = src.tile(4, 4, 4, cmr_query={"temporal": ("2020-02-01", "2020-02-01")}, )

PermissionError: Forbidden

😬 not sure why it doesn't work for now

vincentsarago commented 5 months ago
import earthaccess
cmr_auth = earthaccess.login(strategy="netrc")

from titiler.cmr.backend import aws_s3_credential
import rasterio

s3_credentials = aws_s3_credential(cmr_auth, "POCLOUD")
aws_session = rasterio.session.AWSSession(
   aws_access_key_id=s3_credentials["accessKeyId"],
   aws_secret_access_key=s3_credentials["secretAccessKey"],
   aws_session_token=s3_credentials["sessionToken"],
)

with rasterio.Env(aws_session):
   with rasterio.open("s3://podaac-ops-cumulus-protected/MUR-JPL-L4-GLOB-v4.1/20200201090000-JPL-L4_GHRSST-SSTfnd-MUR-GLOB-v02.0-fv04.1.nc") as src:
      print(src.meta)

RasterioIOError: Access Denied

same with rasterio, so I'm not sure what the credentials are for 🤷

sharkinsspatial commented 5 months ago

@vincentsarago Where are you trying to execute this code from? Though unfortunately not all the DAACs seem to document this when describing their s3_credentials endpoints, the temporary credentials can only be used "in-region" to prevent S3 based operations from incurring an egress cost. So if you want to run this code, it needs to be in an environment in us-west-2 or you'll receive access errors. https://earthaccess.readthedocs.io/en/latest/tutorials/getting-started/#accessing-the-data

abarciauskas-bgse commented 5 months ago

@sharkinsspatial

can you confirm that the direct access role we have in VEDA can support multiple DAACs (or is it just LPDAAC at the moment).

The veda-date-reader-dev role in the smce-veda account I believe has the same access as the nasa-veda-prod of the VEDA JHub. Using the nasa-veda-prod role I tested it can access PO.DAAC, GESDISC, LPDAAC, ORNL, NSIDC

I tested access to GHRC (s3://ghrcw-protected), LADS (s3://prod-lads), ASF (s3://asf-ngap2w-p-s1-ocn-1e29d408/) and got access denied.

I can't find anything for ASDC, CDDIS, OB.DAAC or SEDAC in Earthdata Cloud but I may not be searching everything as I used the organization filter and Earthdata Cloud filter for those DAACs and nothing came up.

I do think we want to use the veda-data-read-dev role or one of the other roles we have whitelisted for access in the deployment so that we don't have to handle rotating credentials and fetching credentials from different endpoints for now.

However, I have low confidence role-based access works with earthaccess at this time. We should help fix this if we can and I can continue to look into it next week but we can also use earthaccess to find the data and then xarray + s3fs to open it ourselves.

See https://github.com/nsidc/earthaccess/issues/431

sharkinsspatial commented 5 months ago

@abarciauskas-bgse For clarification, the issue that @vincentsarago is experiencing here is unrelated to direct role-based access (he hasn't iam credentials option yet). The first comment https://github.com/developmentseed/titiler-cmr/pull/10#issuecomment-1912628269 is s3fs throwing an access error from the use of https://github.com/developmentseed/titiler-cmr/blob/4b383a5fc07c684c44f443e1190c7b60f1f079a3/titiler/cmr/reader.py#L69-L75. The second comment https://github.com/developmentseed/titiler-cmr/pull/10#issuecomment-1912656467 is from rasterio's underlying boto3 calls throwing an access error. I suspect that these are probably both related to attempting to use the creds outside of us-west-2.

vincentsarago commented 5 months ago

Yes I was running this locally not on AWS 😅

sharkinsspatial commented 5 months ago

@vincentsarago A few questions on filesystem instantiation

Rather than manage our own fsspec filesystem should we defer to just using the earthaccess convenience open method? I think @abarciauskas-bgse and I can implement an IAM based escape hatch for earthacess to support the alternative authorization method https://github.com/nsidc/earthaccess/issues/431#issuecomment-1912887762.

There are a few possible issues here

  1. We won't have direct access to the fsspec filesystem instantiation so if we wanted to continue with @abarciauskas-bgse 's filesystem level caching work we'd need to add more config options to earthaccess (🤔 maybe not a bad thing).
  2. The other area where my understanding is limited are the possible performance differences in rasterio when using an fsspec filesystem rather than vsicurl or vsis3 for access. Have we done any benchmarking to understand fine grained differences in the Range requests generated when using different file access mechanisms (vsis3, s3fs) in rasterio? Maybe we can write up an issue for testing this as part of this development.
vincentsarago commented 5 months ago

Rather than manage our own fsspec filesystem should we defer to just using the earthaccess convenience open method?

👍, I think I wanted to reuse the code @abarciauskas-bgse worked on for titiler-xarray, believing there was some optimization done!

I think @abarciauskas-bgse and I can implement an IAM based escape hatch for earthacess to support the alternative authorization method https://github.com/nsidc/earthaccess/issues/431#issuecomment-1912887762.

TBH, I don't think we need to this, we can simply require this project to be deployed on AWS in the same region as the data 🤷

The other area where my understanding is limited are the possible performance differences in rasterio when using an fsspec filesystem rather than vsicurl or vsis3 for access. Have we done any benchmarking to understand fine grained differences in the Range requests generated when using different file access mechanisms (vsis3, s3fs) in rasterio? Maybe we can write up an issue for testing this as part of this development.

I don't think there will be much difference between both TBH, fsspec and rasterio might not do merge requests (this would be interested to validate, more info on multi_range request can be found in https://github.com/rasterio/rasterio/pull/2969)

The main issue would be that rasterio do not support multi dimensional data so it will be easier to use fsspec + xarray + rioxarray

vincentsarago commented 5 months ago

I think the easiest way to test everything is to deploy this to AWS 🚢

I've opened #11 and will update the CI (#13) in another PR