NASA-IMPACT / veda-ui

Frontend for the Dashboard Evolution project
Other
23 stars 5 forks source link

[Analysis page] Query for datasets via collection metadata? #658

Closed j08lue closed 1 year ago

j08lue commented 1 year ago

Currently, we query for datasets available for the user-defined area and date range of interest by asking STAC for all items and then finding all collections.

This approach has several issues - it is costly and a lot of data gets transferred to the client that is not needed and it currently does not return all collections that should be returned, probably because not all items are loaded due to some limit / no pagination.

We have been discussing adding an aggregation endpoint to the STAC API / pgSTAC that could perform these queries in the database. However, also there, the issue remains that these queries are very costly and pgSTAC (unlike ElasticSearch) is not too fast for them.

An alternative solution is to make use of the total bounding box and date range information on the STAC collection level: STAC collection metadata already contains this information and we would just need to do the intersection in the client. While this approach is less accurate than the item query for edge cases where data coverage is sparse with large gaps, it is a lot faster and could at least limit the number of collections to query.

We will push for developing an aggregation function on our STAC backend, but that will take a while to develop. In the meantime, replacing the current approach by the fast collection metadata method would be great

Acceptance criteria

anayeaye commented 1 year ago

A few quick thoughts here about high level full catalog searches without collection filters:

stac-api/collection/items/search|aggregation (answer specific questions)

When we implement some aggregation functionality, we will have lots of opportunity for innovation and will be able to support investigations like:

stac-api/collections (provide a little spatial temporal info about all collections)

The collections endpoint gives us gross information about where and when collections have coverage. There is a lot of flexibility in the descriptive metadata we add to collection records including more precise geometry.

RE

An alternative solution is to make use of the total bounding box and date range information on the STAC collection level: STAC collection metadata already contains this information and we would just need to do the intersection in the client. While this approach is less accurate than the item query for edge cases where data coverage is sparse with large gaps, it is a lot faster and could at least limit the number of collections to query.

Relevant properties for using only the stac-api/collections response and one suggestion

Smallsat data explorer

For the case in which a user arrives at an explore interface and simply wants to know what collections have any data within a time and area of interest, we should look into how the smallsat explorer supports completely open ended searches with a sampling grid. Is this something we can do? I think the backend is very similar. https://github.com/NASA-IMPACT/csdap-frontend/ https://csdap.earthdata.nasa.gov/explore/

smallsat
hanbyul-here commented 1 year ago

I used the collections endpoint in https://github.com/NASA-IMPACT/veda-ui/pull/666. I think the main concern with this approach is that we can filter datasets only through their bbox, therefore spatially sparse datasets can have empty results. Check the preview and let me know what you think / if the filter can be better fine-tuned.

j08lue commented 1 year ago

Wow, that turnaround was quick.

I am sure we will hit the challenge with spatially (or temporally) sparse datasets eventually, but this solution is better than the current situation, at least for the GHG datasets. Rather show a bit too many datasets (and then have empty plots) than too few.

We need to make a few random tests and validate that the results are as expected. All datasets that (possibly) have any data within the query should be listed.

To address the spatial case in the future, maybe we could compute the real coverage upon ingest (union(existing_geom, new_geom)) and store that in addition to the max bbox. 🤷

j08lue commented 1 year ago

Done!