High-Level: HLS data to the dashboard

abarciauskas-bgse commented 2 years ago

@sharkinsspatial will be our point of contact for this dataset since he has produced and published the COGs to LP.DAAC

[x] Identify dataset and where it will be accessed from.
- I just know from Sean that the datasets are available in Earthdata Search https://search.earthdata.nasa.gov/search?gdf=Cloud%20Optimized%20GeoTIFF%20(COG). Looks like these datasets are in the lp-prod-protected bucket but are they using request-payer? Can we configure access to these files from our 853 dashboard AWS account?
- We need to ask Sean if we should use the inline STAC records to ingest into our pgSTAC database or generate new metadata using rio-stac
~Ask about specific variables and required spatial and temporal extent.~
[ ] If the dataset is ongoing (i.e. new files are continuously added and should be included in the dashboard), design and construct the forward-processing workflow.
- Each collection will have a workflow which includes discovering data files from the source, generating the cloud-optimized versions of the data and writing STAC metadata.
- Each collection will have different requirements for both the generation and scheduling of these steps, so a design step much be included for each new collection / data layer.

Sean says the pipeline is creating 150k COGS every day 😱 what do we need for the dashboard?

[x] Verify the COG output with the science team by sharing in a visual interface.
[ ] Verify the metadata output with STAC API developers and any systems which may be depending on this STAC metadata (e.g. the front-end dev team).
[ ] If the dataset should be backfilled, create and monitor the backward-processing workflow.
[ ] Engage the science team to add any required background information on the methodology used to derive the dataset.
[ ] Add the dataset to an example dashboard.

abarciauskas-bgse commented 2 years ago

@jvntf just fyi I noticed we can also access the HLS files using URS authentication https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/HLSS30.020/HLS.S30.T55GEM.2022035T235241.v2.0/HLS.S30.T55GEM.2022035T235241.v2.0.B03.tif but I don't think we'll want to use that if we can configure direct S3 access through AWS IAM policies.

abarciauskas-bgse commented 2 years ago

Thinking about this a little bit more, I'm pretty sure we'll want to reuse the mosaic generation that Sean has implemented for planetary computer. For planetary computer I think Sean mentioned a daily global mosaic which may still be higher temporal resolution than what we need for the dashboard but if it's easy to copy those mosaic records that already exist for planetary computer into our database than maybe that is the best thing to do.

I'm going to also checkin to see if we have any preliminary landcover stories to inform the decision about what HLS temporal and spatial extents and resolutions are right for the first release of the dashboard.

abarciauskas-bgse commented 2 years ago

Notes from today’s meeting with @sharkinsspatial :

Aimee: I think we want the daily mosaics that are being used for MS Planetary Computer

Sean: no HLS data in the planetary computer, store portions of the old 1.5 data and that's not global just for sample regions

We’re working on BLM forest service request for SWIR for fire season and have started to experiment with HLS global daily mosaics for this purpose, but we have been running into performance problems:

LPDAAC URS Redirects not working as before, even openscapes notebooks not working with netrc
Tried to deploy tiler with netrc authentication, working ok for titiler
Not working for EO API so something else is going on with EO API
Finally able using the S3 endpoints using temporary S3 credentials endpoint and just experimented by hardcoding them by hardcoding them into the environment
- Vincent and Sean seeing really poor performance, 500 errors on API Gateway has the hard 30 second timeout
- 99% request success for the lambda, successfully processing the request
- Set the lambda timeout higher the lambda success, but shutdowns magnum
Leo has experienced this too and also had to set the timeout on boto3 to get errors to bubble up

Sean: But in general, tile construction should not take 30 seconds so we have to determine what is causing the titiler lambda function to take over 30 seconds

About mosaics:

There’s no mosaicjson anymore, all of the mosaic logic is happening in pgSTAC
- When you submit a mosaic request, you “materialize a query”: pgSTAC makes a hash of that query, every time you make a request to the tile endpoint it’s going to find the geometries that intersect plus the tile geometry request and figures out which assets which need to get
Sean proposes an extension to the mosaics endpoint to return a list of IDs of existing mosaics and using a common name key so it’s a bit easier for front end developers to look up

Sean

Reading the data from the S3 bucket into the lambda context is what’s causing the slowness
We’re doing this for SNWG
- The alternative is to hard render and send to GIBS but this is not sustainable
- Current work will be deployed to UAH as an experimental feature
More about credentials
- Under the hood uses gdal, netrc GDAL settings, gdal cookiejar, can use netrc file with the lambda
- Direct S3 credentials - this endpoint is slow, so eventually we will want to generate and rotate tokens in the background, either in the parameter store or the lambda environment asynchronous
Current STAC records are not a good model for assets, all STAC records only create the HTTP endpoints, only been modifying URLs
- Will know by the end of this week if the approach should be to use netrc + HTTP protocol or s3 protocol + temporary S3 credentials

Next steps for HLS data for EO dashboard:

[x] Determine what temporal and spatial ranges we need for the flood stories

There are nearly 3 million HLS Sentinel-2 granules and nearly 4.5 million Landsat granules. The story that is currently planned is to highlight flooding in Manville, NJ September 2021.

TL;DR: David Bitner says to just stick it in the database

Brian F showed an S30 tile (HLS.S30.T18TWK.2021245T154911.v2.0) so assuming that's the dataset we want to use for this use case:

Sentinel 30 for all of September 2021 is about 201,000 granules: Earthdata Search result
Sentinel 30 for 1 day is 4,453 granules: Earthdata Search result for September 1, 2021

I was proposing that we generate STAC records for the subset we need for our use case. Brian F showed an S30 tile for the flooding in Manville, NJ on Sept 2, 2021 and so I was thinking of generating a few daily global mosaics of S30 for a few days before and after the flood which would require loading about 113,700 granules into our STAC database (query: [https://search.earthdata.nasa.gov/search/granules?p=C2021957295-LPCLOUD&pg[0][v]=f&pg[0][[…]0!3!!&m=40.76707484150656!-75.22251319885254!7!1!0!0%2C2](https://search.earthdata.nasa.gov/search/granules?p=C2021957295-LPCLOUD&pg[0][v]=f&pg[0][gsk]=-start_date&q=hls%20sentinel&qt=2021-08-20T00%3A00%3A00.000Z%2C2021-09-04T11%3A59%3A59.000Z&tl=1644432500!3!!&m=40.76707484150656!-75.22251319885254!7!1!0!0%2C2)).

[x] Determine if we need both sentinel and landsat products

I think we just need sentinel

[ ] Generate a test bulk insert of STAC records, probably using August 20 - September 4, 2021 sentinel data as a test

https://github.com/stac-utils/pgstac#bulk-data-loading

STAC records are inline with CMR metadata and the data itself, see metadata "rel" of "http://esipfed.org/ns/fedsearch/1.1/metadata" or any of the links in the metadata which end in "_stac.json"

Note from David:

the docs on that are pretty lacking, but there are three "magic tables" that allow you to load data with three strategies insert and error on anything duplicate, insert while ignoring anything that is duplicate, or insert and overwrite with any new data. There are triggers on those tables that then move the data into the right place. you can either use regular pg tools like psql and "COPY" data into one of those magic tables or you can use pypgstac load which is a python utility / library that is just a wrapper of using copy to load the data

[ ] Estimate time to complete for all 7.5 million records
[ ] Generate STAC records in our delta dashboard STAC catalogue for all corresponding HLS data
[ ] Add earthdata credentials to our dynamic tiler instance (also see next point)
[ ] Follow up with Sean and Vincent about the best way to set credentials for the titiler lambda to access files via Earthdata Cloud

Other next steps:

[ ] Test producing and visualizing mosaics for the dates of the flood(s) for the stories and setup some contract for telling the story in the front end
[ ] setup forward processing.

Notes from Sean: The most difficult area here is maintaining consistent synchronization with CMR. Given the async nature of the HLS processing pipelines, granules might created at variable times after collection. For example, our nominal latency is 3 days so you might query CMR for a date 4 days after it has passed, but we might process a new granule for that date 5 days later (which you’d then be missing in pgstac). I’d like to work with Lauren so that Cumulus publication can support a configurable SNS topic that we could use to continuously ingest forward processing data and avoid any syncing issues (this is how we currently handle the Landsat data in the HLS pipeline).

@sharkinsspatial please correct anything I misstated

abarciauskas-bgse commented 2 years ago

@freitagb mentioned that there exists many of these STAC records already which we could probably obtain so that we don't have to go through URS authentication for the STAC metadata generation. We should check in with him to see if it makes sense for us to use the STAC records he is maintaining in the staging environment versus the STAC records maintained in LP.DAAC

sharkinsspatial commented 2 years ago

@abarciauskas-bgse There are inline STAC records available as public links for all the HLS granules without authentication via the LPDDAC lp-prod-public Cloudfront endpoint so no authentication is necessary. See the example below

curl --location --request GET 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-public/HLSS30.020/HLS.S30.T51SYC.2021210T022549.v2.0/HLS.S30.T51SYC.2021210T022549.v2.0_stac.json'

abarciauskas-bgse commented 2 years ago

Cool thanks @sharkinsspatial that's good to know, that wasn't clear in the conversation today with @freitagb

abarciauskas-bgse commented 2 years ago

@sharkinsspatial mentioned that he has a lambda to update the AWS credentials stored for the lambda every 30 minutes, which updates

Sean also has a lambda that will queue list of CMR records from query to SQS to generate STAC records, using bulk loading

Plan is to tagup with @jvntf and @anayeaye about this infrastructure at some point soon

abarciauskas-bgse commented 2 years ago

Here are some notes from the meeting mostly focused on the HLS stack that I took earlier today: https://docs.google.com/document/d/15XB0lP3bm8MbtgLZb_bdALu0JJX0w9OlmivvBwNht7Y/edit

I think we agreed to use the same infrastructure configuration as Sean but starting with just one day of Sentinel data to benchmark timing and test configuration between all the components in our AWS environment. @jvntf does that make sense to you? It sounded like you may be able to start on this ticket soon.

abarciauskas-bgse commented 2 years ago

My current understanding of Seans workflow for HLS data:

Query Earthdata CMR for XX HLS data (in our case it will be 1 day of sentinel data), and queues each STAC item JSON URL as a one message. (Someone already generated the STAC JSON and stored it in the metadata links for the HLS products). Each of these URLs is a different message on the ItemQueue.
NDJson Builder Lambda receives messages from the ItemQueue, reads the STAC metadata for up to 100 items into a single ndjson file and puts that NDJson file onto an NDJson queue for the last lambda
Final Lambda receives messages from the NDJson queue and does batch write to pgSTAC

jvntf commented 2 years ago

Our HLS collection was inserter using id HLSS30.002 but CMR stac items use the id HLSS30. I have temporarily inserted a second collection using HLSS30. cc @abarciauskas-bgse

jvntf commented 2 years ago

ingested 4 hours of HLS data ending with 2021-07-29 05:00:0 Pipeline should be ready to ingest all the HLS records

jvntf commented 2 years ago

ingested the records under HLSS30.002 as well. i'll remove the HLSS30 collection / records from the db

abarciauskas-bgse commented 2 years ago

Awesome!

abarciauskas-bgse commented 2 years ago

Since HLS is a large volume dataset, we are working on it in parts and I'm leaving this ticket open as a catch all for HLS work for now.

aboydnw commented 2 years ago

@aboydnw to make some followup tickets for continuous ingest & multi-daac stuff and link to this ticket

aboydnw commented 2 years ago

Closing in favor of https://github.com/NASA-IMPACT/veda-data-airflow/issues/99 and https://github.com/NASA-IMPACT/veda-data-airflow/issues/100

NASA-IMPACT / veda-data-pipelines

High-Level: HLS data to the dashboard #70