NASA-IMPACT / veda-data-pipelines

data transformation - ingestion - publication pipelines to support VEDA
Other
13 stars 6 forks source link

Create COGs and publish LIS dataset to the API #144

Closed abarciauskas-bgse closed 2 years ago

abarciauskas-bgse commented 2 years ago

NOTE: The dataset ingest + publication workflows are currently undergoing a refactor in this branch: https://github.com/NASA-IMPACT/cloud-optimized-data-pipelines/tree/refactor

Brendan McAndrew is one of the science leads on the freshwater team. We will be helping the freshwater team convert and publish this dataset to COGs so that the freshwater story on Midwest Flooding can be told in the new climate dashboard.

1. Identify the dataset and what the processing needs are

Brendan McAndrew shared a sample NetCDF file: https://drive.google.com/file/d/1i8-hEa2jl4E36TK78fIMUMvh_RHKOR-T/view?usp=sharing which we can use to test COG conversion.

More from Brendan:

I’ll need to ask which variables we want to have included—it won’t be all of them. I’ll get back to you with a list on Monday.

The projection is equidistant cylindrical/plate carrée.

Currently the files are on our HPC system. I can upload them to our S3 bucket on SMCE, but it is in us-east-1. Total filesize for the collection is ~375GB.

2. Create COG conversion code and verify the COG output with a data product expert (for example, someone at the DAAC which hosts the native format) by sharing in a visual interface.

Design the metadata and publish to the Dev API

  1. Review conventions for generating STAC collection and item metadata:
    • Collections: https://github.com/NASA-IMPACT/delta-backend/issues/29 and STAC version 1.0 specification for collections
    • Items: https://github.com/NASA-IMPACT/delta-backend/issues/28 and STAC version 1.0 specification for items
    • NOTE: The delta-backend instructions are specific to datasets for the climate dashboard, however not all datasets are going to be a part of the visual layers for the dashboard so I believe you can ignore the instructions that are specific to "dashboard" extension, "item_assets" in the collection and "cog_default" asset type in the item.

A collection will need the following fields, some of which may be self-evident through the filename or an about page for the product, however there are many cases in which we may need to reach out to product owners to define the right values for these fields:

  1. After reviewing the STAC documentation for collections and items and reviewing existing scripts for generating collection metadata (generally with SQL) and item metadata, generate or reuse scripts for your collection and a few items to publish to the testing API. There is some documentation and examples for how to generate a pipeline or otherwise document your dataset workflow in https://github.com/NASA-IMPACT/cloud-optimized-data-pipelines. We would like to maintain the scripts folks are using to publish datasets in that repo so we can easily re-run those datasets ingest and publish workflows if necessary.

  2. If necessary, request access and credentials to the dev database and ingest and publish to the Dev API. Submit a PR with the manual or CDK scripts used to run the workflow to publish to the Dev API and include links to the published datasets in the Dev API

Publish to the Staging API

Once the PR is approved, we can merge and publish those datasets to the Staging API

slesaad commented 2 years ago

@abarciauskas-bgse the link to the sample file says that it's been deleted. do you have it by any chance?

abarciauskas-bgse commented 2 years ago

@slesaad thanks for looking at this, link is updated

abarciauskas-bgse commented 2 years ago

@sahmad3 is taking over the story development from the EIS Freshwater team side. He sent some additional details via email that I am adding here:

Visualizing the LIS data is what we are focusing on, but at the global scale. The model outputs are ready for around 20 years. The [rest of the NetCDF files] are on discover and Brendan copied them over to SMCE S3 bucket (eis-dh-hydro/LIS_NETCDF/DA_10km/GLOBAL/SURFACEMODEL/*/LIS_HIST.nc)

slesaad commented 2 years ago

lis-tws-trend collection/items has been published here lis-tws-anomaly collection and a subset of items (dec 2002) has been published here

aboydnw commented 2 years ago

Waiting on the rest of the data to be processed for the anomaly dataset

abarciauskas-bgse commented 2 years ago

As discussed in sprint planning today, @slesaad will check that what can be published has been published and we will create a new ticket for issues (empty data mentioned by @anayeaye ) with other LIS COGs

slesaad commented 2 years ago

The datasets are published to the STAC API. The task that's remaining is specified in #192 , so this is being closed!