NASA-IMPACT / veda-ui

Frontend for the Dashboard Evolution project
Other
20 stars 4 forks source link

How to integrate titiler-cmr #1000

Closed abarciauskas-bgse closed 1 month ago

abarciauskas-bgse commented 2 months ago

Context

Titiler-cmr offers a way to dynamically return map tiles via NASA's Common Metadata Repository (CMR). The special sauce of titiler-cmr is that it does not require an asset url for tiling - it handles querying CMR for items and assets and then delivers tiles in response. This is similar to how titiler-pgstac works.

We want to demonstrate the use of titiler-cmr in VEDA in case it could be useful to other NASA applications or extend the set of data available to instances of VEDA.

titiler-cmr integration poses a challenge because it doesn't fit into the same model as any of the existing layer configurations. It is similar to how the MapLayerRasterTimeseries layer works with pgSTAC in that both submit a query to the tile API which handles querying the data catalog. But in the case of the existing "raster" type, there is no initial query to the STAC API (for tiling that is). A query is sent to the tile api (titiler-pgstac) which returns a mosaic URL which is used in the tile layer as the source. In this new case, we want to submit the catalog query parameters (to CMR in this case) as part of the tile layer source.

Having a STAC request where the STAC response is used as the parameters to the tile API (i.e. not a query to the mosaic tile registration endpoint) is more similar to how the "zarr" and "cmr" types work. However, the existing "zarr" and "cmr" types both query a STAC endpoint for an asset URL and use that asset url in the parameters sent to the tiling request. In the case of "zarr" it uses the asset URL from the collection's assets and in the case of "cmr" it uses the asset url from an item search[^1].

In this new scenario, we don't want to use an asset URL in the tile parameters, but the collection concept id and renders data in the STAC response to the tile endpoint.

Here is an imperfect diagram of the scenarios we support for raster layers, with the scenario in red with dotted lines representing the scenario we want to support:

zarr-viz-arch(20)

How things work now

From what I understand about how the code currently works for raster layers:

  1. There is a type value you can set on the collection which can be "raster", "cmr" or "zarr" (not including vector in this discussion)
  2. If the type is raster, the MapLayerRasterTimeseries will submit a search payload to the STAC endpoint and then use response's href (mosaicUrl) to create a tile layer.
  3. If the type is "cmr" or "zarr" the layer will use ZarrPaintLayer + useCmr or + useZarr, respectively.
    1. useCMR will search the STAC API endpoint for items and return the assetUrl
    2. useZarr will fetch the STAC API endpoint for the collection and return the assetUrl on the collection for the zarr asset
  4. The ZarrPaintLayer will use the asset url returned from the use* hook and the sourceParams from the collection's config file to create a tile layer.

Proposal

Ultimately, we will have a RasterPaintLayer which accepts the tile parameters returned from the use hook. The use hooks will have type-specific logic for how to query STAC and combine the results with the collection configuration to define a tile layer in the generic RasterPaintLayer class.

Some thoughts on generalization

We could potentially generalize this further. Since the STAC API and TILER API endpoints are defined in the configuration file, what differs in these scenarios that we need to account for in the layer configuration is:

  1. What concept are we searching for in STAC? This can be a collection or item.
  2. What parameters from the response are being sent to the tiler? This could be a growing and configurable list, so may be defined in the collection configuration.

Interested in your thoughts @anayeaye @sandrahoang686 @hanbyul-here and @j08lue @aboydnw as an FYI.

[^1]: In the case of CMR item search, the selected item is just the first one so if multiple items are returned this is a limitation in the design and implementation - i.e. no way to merge multiple items.

aboydnw commented 2 months ago

The special sauce of titiler-cmr is that it does not require an asset url for tiling Totally naive question, what is the value of this? Is it easier to integrate with, more accurate, faster, something else?

Another naive question, what does this integration provide access to? Is it all of the data in Earthdata search? Or some other set of data? Asking because, I'm curious how the outcome of this work is similar or different from the ArcGIS integration work intended to integrate data from DAACs into VEDA.

abarciauskas-bgse commented 2 months ago

Definitely not a naïve question 😄

There are 2 ways in which the set of datasets titiler-cmr can serve is limited:

  1. titiler-cmr, as it is deployed now, will only work with files in a subset of data in earthdata cloud (S3). This access is possible because the lambda running the service is using the veda-data-reader-dev role which has been allowed to access to certain earthdata cloud buckets. As the lambda is not deployed into a VPC, it cannot access the public internet. This means it cannot access files in NASA's archives over HTTPS. I am pretty sure but can't recall at this moment why, that not being in a VPC is required for this S3 access to earthdata cloud buckets to work. Further, not all DAACs have enabled the veda-data-reader-dev role to be able to access their S3 buckets.

  2. Not all data formats will be supported. It should support most NetCDF4 and COG collections, and even many HDF5 collections. But within those formats, there is creative use of the format that means we can't always guarantee they will work. And then there are many other formats employed by NASA that I would not claim titiler-cmr will work with, such as binary.

In terms of communication, in brief I would say something like:

"titiler-cmr can directly create tiles from many collections in NASA's Earthdata cloud so long as they are in NetCDF4 or COG and are using conventional dimensions for latitude, longitude and time. But titiler-cmr is a relatively new technology which has had limited testing so we welcome new use cases to be tested with titiler-cmr so we can expand it's application".

hanbyul-here commented 2 months ago

Thanks for writing down the thoughts @abarciauskas-bgse

As we have more and more types of layers than we initially thought,

We now have more types of layers than we initially planned. They have different ways of getting the data they need, but eventually, most of them are displayed as a raster layer in Mapbox. I was wondering if it would be better to separate out the layer for drawing so it can be used throughout the layers and make each different layer pass the information that the raster layer needs (tile URL and parameters).

What you described here basically aligns with my thoughts, and I liked that this is scoped down to the zarr and cmr-related layers now.

As @sandrahoang686 mentioned in pr #436, it will be helpful to know the schema of each type of response. Where can we find these?

abarciauskas-bgse commented 2 months ago

helpful to know the schema of each type of response. Where can we find these?

I think you are asking we can know the schema of the response from STAC for each dataset type. We can definitely declare that. But I think we may also want to consider that the workflow is different, not just the type.

  1. The default is for RasterTimeseries, where the current collection and datetime selection is used to send a mosaic register request and the response contains a mosaic tiles json URL which is used as the source endpoint in the raster layer.
  2. For cmr-stac, the current collection and datetime selection are used to send a STAC item search request, and the response includes an asset URL that is used as a parameter in the tiles endpoint in the raster layer.
  3. For zarr, the current collection is used to send a STAC collection get request to a STAC endpoint, and the response includes an asset URL that is used as a parameter in the tiles endpoint in the raster layer, along with the datetime selection.
  4. For titiler-cmr (new), the current collection is used to send a STAC collection get and then the collection concept id (CMR concept id) and renders parameters in the response are used to construct the tiles URL for the raster layer.

You probably know all of this but I am just thinking through how we would need different interfaces for the request + response needed for each dataset type 🤔 .

I think we agree that there could definitely be a way to extract out so there is just one paint layer class. I have done that for the latter 3 layers which all use raster paint layer, but it could probably be improved, made cleaner and benefit from more typed interfaces for the responses.

I could try and refactor raster timeseries to use raster paint layer but am not as comfortable with all of the things that are currently also happening in the raster timeseries module, so I am probably not the best person to do it.

hanbyul-here commented 1 month ago

Closing since all the follow-up issues are created.