Transform annual tmaxXF to COG and publish STAC metadata for a single NEX-GDDP model

anayeaye commented 8 months ago

What

The annual number of days with a maximum temperature greater than 90F has been selected as the pilot Climdex Nex-GDDP dataset for VEDA. This metric is one of the 5 thresholds included in tmaxXF netCDFs. We have a version of this index for each of the 35 NEX-GDDP CMIP6 models with multiple SSPs each. This pilot is to transform and ingest tmaxXF for a single model, not all 35 yet.

Details

Raw data in protected veda-uah bucket: s3://cmip6-staging/climdex/tmaxXF/ACCESS-CM2/*.nc
Destination pattern: ~s3://veda-data-store-staging/climdex/tmaxXF/ACCESS-CM2/*.tif (if we do ingest all 35 models we will want this key structure to compare model usage and for browsability)~ EDIT see update
Pilot model: ACCESS-CM2
Pilot scenarios (based on shared socioeconomic pathways): SSP126, SSP245, SSP370, SSP585, (& historical)
Climdex.org

Transformation notes

explode the netCDF variables to a single COG per threshold. The 90F threshold is the priority for this Climdex, start with that and add the others time permitting.
these data have already been aligned to -180 to 180
pixel reference has been corrected to pixel as area (the expected upper left reference cell description for a tif versus referring to the center of the grid cell)
these data do need to be flipped when transformed to COG, though
choose rasterio's COG deflate profile and use a predictor (check that output file size is properly compressed)

STAC notes

Refer to CMIP6 STAC extension and implement where straight forward. This does not need to be perfect or complete, we can learn from our choices before ingesting more Climdex and models.
Titles, variable names, and units are all described in the NetCDF
Consider 4 collections for this pilot. While this flat structure will not scale for nex-gddp it could make it easier for us to fast track Climdex for the dashboard and provide an experience similar to the old CMIP6 dashboard
Proposed 4 Collections
- tmaxxf-access-cm2-ssp126
- - -ssp245, -ssp370, -ssp585
Proposed items will have one asset per each of the 5 thresholds
Consider padding each collection with duplicate item records for the 65 years of historical data for the model (this would make a time series of 1950 to 2100 possible

AC

[x] netCDFs 'exploded' to single band yearly COGs (1 per threshold EDIT: starting with 90F threshold only)
[x] stac metadata generated and ingested
[ ] collection definition(s) stored in veda-data
[ ] transformation code and metadata generation stored in veda-data or stactools a branch on either project with a notebook is fine ~- [ ] BONUS if this is cake, consider repeating for one more model for the pilot~
[x] Coordinate veda-config (usually data services end with STAC metadata but for this rush delivery we should ensure that the data make it to the dashboard so update the mdx or make sure there is clear information for a hand off)

anayeaye commented 8 months ago

UPDATE: Streamlined Plan

4 collections

One collection for each SSP

climdex-tmaxxf-access-cm2-ssp126
climdex-tmaxxf-access-cm2-ssp245
climdex-tmaxxf-access-cm2-ssp370
climdex-tmaxxf-access-cm2-ssp585
historical was not requested so we will not use it for stage 1

Items within these collections:

each item will have a single asset named tmax_above_90 (later we may add other thresholds as assets but not for Nov 17)
example asset href: s3://veda-data-store-staging/climdex-tmaxxf-access-cm2-ssp126/tmaxXF-ACCESS-CM2-ssp126_tmax_above_90_<year>.tif (_compressed.nc is replaced with _tmax_above_90_<year>.tif)
86 items will be created for each year in the ssp

Ingest plan

Publish transformed COGs to `veda-data-store-staging//
Publish the 4 collections
If we set things up this way we should be able to use airflow pipelines to generate item and insert metadata, confirm that we can use start/end datetime as expected and any other common properties we need

anayeaye commented 7 months ago

@SwordSaintLancelot I had a look at the first outputs in s3://climatedashboard-data/climdex/tmaxXF/ACCESS-CM2/ and they look good. I have a couple requests for the files before we publish the objects in veda-data-store-staging

Suggested changes

Use DEFLATE instead of LZW compression (as in: da.rio.to_raster("<outname>.tif", driver="COG", compress=compress))
Filename adjustment, new pattern instead of tmaxXF-ACCESS-CM2-ssp126_tmax_above_90_2015.tif, use tmaxXF-ACCESS-CM2-ssp126_2015_tmax_above_90.tif. As in put the year before the netcdf variable name <netcdf-basename>_<YYYY>_<VARIABLE_NAME>.tif. _I think this will make it easier to generate multi asset STAC items: for the 86 years in the source file with basename tmaxXF-ACCESS-CM2-ssp126_compressed.nc we will want to generate a STAC items with ids 'tmaxXF-ACCESS-CM2-ssp126_<YYYY>_

Object publication

After those adjustments I think we are good to publish the objects for the 4 collections to veda-data-store-staging as s3://veda-data-store-staging/<collection-id>/<filename.tif>. For this pilot work I think we should just use a simple collection-id/files path instead of copying the complex storage structure that was in the original request (for the sake of making airflow ingests easy--does that sound right @ividito?). As in:

s3://veda-data-store-staging/climdex-tmaxxf-access-cm2-ssp126/
     tmaxXF-ACCESS-CM2-ssp126_tmax_above_90_2015.tif
     tmaxXF-ACCESS-CM2-ssp126_tmax_above_90_2016.tif

Sample nc2cog transformation code

import s3fs 
import xarray as xr

# Open NetCDF with s3fs and read to xarray using h5netcdf engine
fs = s3fs.S3FileSystem()

VARIABLE_NAME = "tmax_above_90"
aws_url = "s3://cmip6-staging/climdex/tmaxXF/ACCESS-CM2/tmaxXF-ACCESS-CM2-ssp126_compressed.nc"

fileObj = fs.open(aws_url)
ds = xr.open_dataset(fileObj, engine="h5netcdf")
da= ds[VARIABLE_NAME].isel(time=0)

# Add crs and set spatial dims if needed
if not da.rio.crs:
    da.rio.write_crs("epsg:4326", inplace=True)

# Flip and set spatial dimensions
da = da.reindex(lat=list(reversed(da.lat)))
da.rio.set_spatial_dims("lon", "lat")

# Cloud optimize and generate raster
driver = "COG"
compress = "DEFLATE"
da.rio.to_raster("test_compressed.tif", driver=driver, compress=compress)

slesaad commented 7 months ago

The four collections have been published to staging stac catalog.

Each item has 5 assets for above 86, above 90, above 100, above 110, and above 150.

tmax_above_86
tmax_above_90
tmax_above_100
tmax_above_110
tmax_above_115

anayeaye commented 7 months ago

Config notes (wip)

NASA Earth Exchange (NEX) Global Daily Downscaled Projections (GDDP) background and DOI on cmip6 dashboard
NEX-GDDP Tech Note
NASA Global Daily Downscaled Projections, CMIP6 in Nature
Climdex homepage

anayeaye commented 7 months ago

~https://github.com/NASA-IMPACT/veda-config-eic/pull/21~ https://github.com/NASA-IMPACT/veda-config-eic/pull/32

j08lue commented 7 months ago

This is complete, right? 🎉

slesaad commented 6 months ago

PR for the collection configs - https://github.com/NASA-IMPACT/veda-data/pull/97 Should now be complete!

NASA-IMPACT / veda-data