NASA-IMPACT / veda-data-pipelines

data transformation - ingestion - publication pipelines to support VEDA
Other
13 stars 6 forks source link

High-Level: HLS data to the dashboard #70

Closed abarciauskas-bgse closed 2 years ago

abarciauskas-bgse commented 2 years ago

@sharkinsspatial will be our point of contact for this dataset since he has produced and published the COGs to LP.DAAC

Sean says the pipeline is creating 150k COGS every day 😱 what do we need for the dashboard?

abarciauskas-bgse commented 2 years ago

@jvntf just fyi I noticed we can also access the HLS files using URS authentication https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-protected/HLSS30.020/HLS.S30.T55GEM.2022035T235241.v2.0/HLS.S30.T55GEM.2022035T235241.v2.0.B03.tif but I don't think we'll want to use that if we can configure direct S3 access through AWS IAM policies.

abarciauskas-bgse commented 2 years ago

Thinking about this a little bit more, I'm pretty sure we'll want to reuse the mosaic generation that Sean has implemented for planetary computer. For planetary computer I think Sean mentioned a daily global mosaic which may still be higher temporal resolution than what we need for the dashboard but if it's easy to copy those mosaic records that already exist for planetary computer into our database than maybe that is the best thing to do.

I'm going to also checkin to see if we have any preliminary landcover stories to inform the decision about what HLS temporal and spatial extents and resolutions are right for the first release of the dashboard.

abarciauskas-bgse commented 2 years ago

Notes from today’s meeting with @sharkinsspatial :

Aimee: I think we want the daily mosaics that are being used for MS Planetary Computer

Sean: no HLS data in the planetary computer, store portions of the old 1.5 data and that's not global just for sample regions

We’re working on BLM forest service request for SWIR for fire season and have started to experiment with HLS global daily mosaics for this purpose, but we have been running into performance problems:

Sean: But in general, tile construction should not take 30 seconds so we have to determine what is causing the titiler lambda function to take over 30 seconds

About mosaics:

Sean

Next steps for HLS data for EO dashboard:

There are nearly 3 million HLS Sentinel-2 granules and nearly 4.5 million Landsat granules. The story that is currently planned is to highlight flooding in Manville, NJ September 2021.

TL;DR: David Bitner says to just stick it in the database

Brian F showed an S30 tile (HLS.S30.T18TWK.2021245T154911.v2.0) so assuming that's the dataset we want to use for this use case:

I was proposing that we generate STAC records for the subset we need for our use case. Brian F showed an S30 tile for the flooding in Manville, NJ on Sept 2, 2021 and so I was thinking of generating a few daily global mosaics of S30 for a few days before and after the flood which would require loading about 113,700 granules into our STAC database (query: [https://search.earthdata.nasa.gov/search/granules?p=C2021957295-LPCLOUD&pg[0][v]=f&pg[0][[…]0!3!!&m=40.76707484150656!-75.22251319885254!7!1!0!0%2C2](https://search.earthdata.nasa.gov/search/granules?p=C2021957295-LPCLOUD&pg[0][v]=f&pg[0][gsk]=-start_date&q=hls%20sentinel&qt=2021-08-20T00%3A00%3A00.000Z%2C2021-09-04T11%3A59%3A59.000Z&tl=1644432500!3!!&m=40.76707484150656!-75.22251319885254!7!1!0!0%2C2)).

I think we just need sentinel

https://github.com/stac-utils/pgstac#bulk-data-loading

STAC records are inline with CMR metadata and the data itself, see metadata "rel" of "http://esipfed.org/ns/fedsearch/1.1/metadata" or any of the links in the metadata which end in "_stac.json"

Note from David:

the docs on that are pretty lacking, but there are three "magic tables" that allow you to load data with three strategies insert and error on anything duplicate, insert while ignoring anything that is duplicate, or insert and overwrite with any new data. There are triggers on those tables that then move the data into the right place. you can either use regular pg tools like psql and "COPY" data into one of those magic tables or you can use pypgstac load which is a python utility / library that is just a wrapper of using copy to load the data

Other next steps:

Notes from Sean: The most difficult area here is maintaining consistent synchronization with CMR. Given the async nature of the HLS processing pipelines, granules might created at variable times after collection. For example, our nominal latency is 3 days so you might query CMR for a date 4 days after it has passed, but we might process a new granule for that date 5 days later (which you’d then be missing in pgstac). I’d like to work with Lauren so that Cumulus publication can support a configurable SNS topic that we could use to continuously ingest forward processing data and avoid any syncing issues (this is how we currently handle the Landsat data in the HLS pipeline).

@sharkinsspatial please correct anything I misstated

abarciauskas-bgse commented 2 years ago

@freitagb mentioned that there exists many of these STAC records already which we could probably obtain so that we don't have to go through URS authentication for the STAC metadata generation. We should check in with him to see if it makes sense for us to use the STAC records he is maintaining in the staging environment versus the STAC records maintained in LP.DAAC

sharkinsspatial commented 2 years ago

@abarciauskas-bgse There are inline STAC records available as public links for all the HLS granules without authentication via the LPDDAC lp-prod-public Cloudfront endpoint so no authentication is necessary. See the example below

curl --location --request GET 'https://data.lpdaac.earthdatacloud.nasa.gov/lp-prod-public/HLSS30.020/HLS.S30.T51SYC.2021210T022549.v2.0/HLS.S30.T51SYC.2021210T022549.v2.0_stac.json'
abarciauskas-bgse commented 2 years ago

Cool thanks @sharkinsspatial that's good to know, that wasn't clear in the conversation today with @freitagb

abarciauskas-bgse commented 2 years ago

@sharkinsspatial mentioned that he has a lambda to update the AWS credentials stored for the lambda every 30 minutes, which updates

Sean also has a lambda that will queue list of CMR records from query to SQS to generate STAC records, using bulk loading

Plan is to tagup with @jvntf and @anayeaye about this infrastructure at some point soon

abarciauskas-bgse commented 2 years ago

Here are some notes from the meeting mostly focused on the HLS stack that I took earlier today: https://docs.google.com/document/d/15XB0lP3bm8MbtgLZb_bdALu0JJX0w9OlmivvBwNht7Y/edit

I think we agreed to use the same infrastructure configuration as Sean but starting with just one day of Sentinel data to benchmark timing and test configuration between all the components in our AWS environment. @jvntf does that make sense to you? It sounded like you may be able to start on this ticket soon.

abarciauskas-bgse commented 2 years ago

My current understanding of Seans workflow for HLS data:

  1. Query Earthdata CMR for XX HLS data (in our case it will be 1 day of sentinel data), and queues each STAC item JSON URL as a one message. (Someone already generated the STAC JSON and stored it in the metadata links for the HLS products). Each of these URLs is a different message on the ItemQueue.
  2. NDJson Builder Lambda receives messages from the ItemQueue, reads the STAC metadata for up to 100 items into a single ndjson file and puts that NDJson file onto an NDJson queue for the last lambda
  3. Final Lambda receives messages from the NDJson queue and does batch write to pgSTAC
jvntf commented 2 years ago

Our HLS collection was inserter using id HLSS30.002 but CMR stac items use the id HLSS30. I have temporarily inserted a second collection using HLSS30. cc @abarciauskas-bgse

jvntf commented 2 years ago

ingested 4 hours of HLS data ending with 2021-07-29 05:00:0 Pipeline should be ready to ingest all the HLS records

jvntf commented 2 years ago

ingested the records under HLSS30.002 as well. i'll remove the HLSS30 collection / records from the db

abarciauskas-bgse commented 2 years ago

Awesome!

abarciauskas-bgse commented 2 years ago

Since HLS is a large volume dataset, we are working on it in parts and I'm leaving this ticket open as a catch all for HLS work for now.

aboydnw commented 2 years ago

@aboydnw to make some followup tickets for continuous ingest & multi-daac stuff and link to this ticket

aboydnw commented 2 years ago

Closing in favor of https://github.com/NASA-IMPACT/veda-data-airflow/issues/99 and https://github.com/NASA-IMPACT/veda-data-airflow/issues/100