Convert and publish GPM IMERG dataset to COG

abarciauskas-bgse commented 2 years ago

Epic

None, but to support the ArcGIS Enterprise in the Cloud Effort

Description

Convert the half-hour product to COG for use by ADC initiative

Product description and data access: https://disc.gsfc.nasa.gov/datasets/GPM_3IMERGHHE_06/summary,
daily, half-hourly and monthly are all listed in the ADC dataset spreadsheet, variable PreciptationCal and Preciptation respectively https://docs.google.com/spreadsheets/d/1XS4VS8_MKsbeH8ncHcN_LClURodO1tg-/edit#gid=958058316
but Brian Tisdale mentioned the half-hourly specifically over slack, so proposing we start there
In order to make this a useful / reusable workflow, @ingalls proposed making an API and simple UI for generating COGs. This use case will serve as the first use case to prototype that idea. However that API and UI being end-user ready is not a requirement for this task.

Background

Brian Tisdale who is leading the ArcGIS Enterprise in the Cloud effort reached out on slack:

The newly formed ArcGIS Enterprise in the Cloud team is starting to get their footing and ready to dive into the details of how the GIS component of VEDA will be integrated. I know we have the larger stakeholder sync next week but hoping we can coordinate on a few questions prior. As GPM is a priority dataset for both VEDA and ArcGIS Enterprise, we'd like to propose focusing on it for initial prototyping to inform the cross-team decisions that will need to be made. Provided below is a link to our GPM ArcGIS Image Service API. It's currently hosted at Goddard but will be migrated to the Earthdata Cloud as part of the ArcGIS Enterprise in the Cloud activity. Most of our initial questions are based on how the COG generation is going to occur. Do you know if VEDA or EIS has started COG generation for GPM? GPM ArcGIS Image Service API: https://arcgis.gesdisc.eosdis.nasa.gov/authoritative/rest/services/GPM_3IMERGHHE_06/ImageServer

I sent Brian an email message: If I understand correctly, to support the ADC (or is it a different acronym now "ArcGIS Enterprise in the Cloud"?) we want to:

Ingest and publish GPM 3IMERG HHE 06 data into the VEDA metadata API (STAC)
We want to create Cloud-Optimized GeoTIFFs to support services described below
Publish services for visualization: I see there is a WMS service - would WMTS be ok to support ArGIS in the Cloud or it must be WMS?
Publish services for access: We will need WCS support for ArcGIS Enterprise in the Cloud

GPM IMERG is a high value first example of executing the above steps but there will be many other datasets to follow a similar model to the above.

Acceptance Criteria:

[ ] GPM IMERG in COG shared with Brian Tisdale for inspection
[ ] Publish a few files to dev STAC API and share API for dynamic visualization and testing in ArcGIS Enterprise interface
[ ] Document some lessens and best practices for COG conversion and publishing to support clients like ArcGIS. What did we learn about the conversion scripts required and challenges that should inform the final API design, UI design, and implementation?
[ ] Demo any API and UI developed in this effort

sharkinsspatial commented 2 years ago

@abarciauskas-bgse Out of curiosity what are the plans for COG layout for the IMERG variables? Will you create multi-band COGs with variables or a host of single band COGs with variable naming conventions? If there is a consideration for generating COGs for large numbers of netCDF files it might be worthwhile to consult with the user community as we’ll be diverging from the commonly accepted CF Conventions https://cfconventions.org/ which most scientific producers and consumers try to adhere to. For a reference example of working with the IMERG data here is the recipe we developed for pangeo-forge https://github.com/pangeo-forge/gpm-imerge-hhr-feedstock/blob/main/feedstock/recipe.py

Another consideration is the update strategy. We are still considering our incremental append strategy for pangeo-forge but we should have something well defined in the next few sprints. But this is a question that has been brought up recently in relation to the IMERG data https://github.com/pangeo-forge/gpm-imerge-hhr-feedstock/issues/2

abarciauskas-bgse commented 2 years ago

@sharkinsspatial these are all good questions.

For IMERG, I think @ingalls is starting by creating an API and UI so it is easy to modify the configuration for how variables are selected and named. @ingalls have you considered how to specify things like which variables correspond to which bands, in 1 to many files, and the option for variable file naming? I'm assuming this means that if one wishes to store a different variable for each output COG, I would configure the generation to name the output file with a substring which includes the band/variable name.
I need to read up on CF conventions so I will have to get back on the question about how we can adhere to CF Conventions for IMERG and future collections.

In general, I want to centralize questions and answers about generating cloud-optimized (analysis-ready?) data. So far @wildintellect has helped start these documents:

I would be interested to know what you @sharkinsspatial think about the layout and content so far in those documents. I know there are a lot of resources on COG and Zarr out there, but I think the intention with these documents is to be able to point our stakeholders somewhere when they are looking for guidance in creating COGs or Zarr.

ingalls commented 2 years ago

@abarciauskas-bgse Current codebase is here as we sketch this out: https://github.com/developmentseed/raster-uploader/

Infra for hosting API & database has been deployed
Support for uploading user supplied rasters => s3 deployed
Full upload management API has been deployed
Upload "Steps" have been sketched out which will provide the ability to users to dynamically alter rasters in an interactive way throughout the import process

Current API Location: raster-uploader-prod-1759918000.us-east-1.elb.amazonaws.com Username: default Password: [DM Me]

Screenshot from 2022-06-14 08-24-00

abarciauskas-bgse commented 2 years ago

@ingalls got this working today (I believe, still looking at the result and making sure it looks correct)

https://github.com/NASA-IMPACT/cloud-optimized-data-pipelines/tree/ab/updates-for-imerg/docker/hdf5-to-cog#gpm-imerg-example

so will generate a few samples tomorrow to send to the ADC team

abarciauskas-bgse commented 2 years ago

@ingalls can you share the IMERG COG output you generated with raster-uploader, along with what was the source NetCDF and the config you used to generate it? I want to compare it with the one I produced and previously shared with the ADC team.

ingalls commented 2 years ago

@abarciauskas-bgse The general directory can be found here:

aws s3 ls s3://raster-uploader-prod-853558080719-us-east-1/uploads/67/

The input file exists here:

aws s3 ls s3://raster-uploader-prod-853558080719-us-east-1/uploads/67/imerg_test.nc

And the precipitationCal output exists here:

aws s3 ls s3://raster-uploader-prod-853558080719-us-east-1/uploads/67/step/77/final.tif

I just grabbed a random IMERG dataset to use for testing. Would be happy to get some time on the caledar and run through your process vs mine with the same input file. Alternatively happy to do it async if you can provide an input file that you used to make sure we have parity.

abarciauskas-bgse commented 2 years ago

I'm going to probably try this myself but did you generate this before or after you added the flipping option? When I compare it the sample i created it makes me think one is flipped and one is not but it could depend on the source.

comparing the one i generated:

https://ejd872yh78.execute-api.us-east-1.amazonaws.com/cog/preview?url=s3%3A%2F%2Fveda-data-store-staging%2FGPM_3IMERGHHE.06%2F3B-HHR-E.MS.MRG.3IMERG.20220101-S000000-E002959.0000.V06B.HDF5.tif&unscale=false&resampling=nearest&rescale=0%2C10&colormap_name=blues_r&return_mask=true

with the one linked above: (locally using rio viz)

For reference, I think the file you generated was using https://gpm1.gesdisc.eosdis.nasa.gov/data/GPM_L3/GPM_3IMERGHHE.06/2022/167/3B-HHR-E.MS.MRG.3IMERG.20220616-S000000-E002959.0000.V06C.HDF5 by gdalinfo'ing the netCDF file

abarciauskas-bgse commented 2 years ago

Just also noting some of the conversation from email and slack:

Nick is working on discovery + fan out of the GPM IMERG files, which currently relies on discovery from the HTTPS directory and requesting files via EDL
We are asking via email Owen Kelly + George Huffman about getting the files another way. George mentioned "in bulk" so if there is no way to discover + fan out directly from their servers, perhaps we could bulk download the HDFs to an S3 "landing zone", discover + fan out COG generation from the S3 location, and then remove the HDF files. cc @ingalls

wildintellect commented 2 years ago

There's another https way to access IMERG that does not use EDL which we used in the Pangeo-Forge recipe (@sharkinsspatial and I wrote). Also the naming pattern is very well known, no need to discover it once you know the date range and product you want. https://github.com/pangeo-forge/staged-recipes/blob/b3f80f1e23ff9df1a1cf9622a7d7fa9107305754/recipes/gpm-imerg/recipe.py#L11-L26

I believe this access method might allow for fsspec (or s3fs) access to the files without pre-download.

cc: @abarciauskas-bgse @ingalls @sharkinsspatial

wildintellect commented 2 years ago

Here's the bulk access instructions https://gpm.nasa.gov/sites/default/files/2021-01/arthurhouhttps_retrieval.pdf

abarciauskas-bgse commented 2 years ago

I picked this up again and started the work deploying and testing it, and everything is going smoothly, kudos @slesaad @xhagrg for the veda-data-pipelines refactor. Work is in https://github.com/NASA-IMPACT/veda-data-pipelines/tree/ab/deploy-for-imerg

Work to go:

[ ] Get Owen and George to 👍🏽 the COG (In-progress)
[ ] Get Owen and George to 👍🏽 the metadata
[ ] Merge updated main (which has @alukach updated register STAC code) and test workflow still works
[ ] Publish a few granules to develop
[ ] Check with ADC team that the location and metadata look reasonable to them (do they need the metadata or just the files?)
[ ] Test on a larger subset and verify we can scale to total number of granules
[ ] Publish all granules to staging (389,168)

smohiudd commented 1 year ago

I uploaded around 50 COG samples to s3://climatedashboard-data/GPM_3IMERGHHE/

@abarciauskas-bgse can we send this to Owen and George for review?

abarciauskas-bgse commented 1 year ago

Thanks @smohiudd - sorry if this wasn't clear but we should put them in s3://veda-data-store-staging before sending it to them so they ensure they can access the files when they are in an "official" "staging" (though it should eventually be in s3://veda-data-store) bucket.

j08lue commented 1 year ago

The GPM IMERG data is also available as Zarr - does not help us for viz, but is relevant to include in our catalog anyways.

j08lue commented 1 year ago

Stale

NASA-IMPACT / veda-data