NASA-IMPACT / veda-data

3 stars 0 forks source link

Convert and publish GPM IMERG dataset to COG #73

Closed abarciauskas-bgse closed 1 year ago

abarciauskas-bgse commented 2 years ago

Epic

None, but to support the ArcGIS Enterprise in the Cloud Effort

Description

Convert the half-hour product to COG for use by ADC initiative

Background

Brian Tisdale who is leading the ArcGIS Enterprise in the Cloud effort reached out on slack:

The newly formed ArcGIS Enterprise in the Cloud team is starting to get their footing and ready to dive into the details of how the GIS component of VEDA will be integrated. I know we have the larger stakeholder sync next week but hoping we can coordinate on a few questions prior. As GPM is a priority dataset for both VEDA and ArcGIS Enterprise, we'd like to propose focusing on it for initial prototyping to inform the cross-team decisions that will need to be made. Provided below is a link to our GPM ArcGIS Image Service API. It's currently hosted at Goddard but will be migrated to the Earthdata Cloud as part of the ArcGIS Enterprise in the Cloud activity. Most of our initial questions are based on how the COG generation is going to occur. Do you know if VEDA or EIS has started COG generation for GPM? GPM ArcGIS Image Service API: https://arcgis.gesdisc.eosdis.nasa.gov/authoritative/rest/services/GPM_3IMERGHHE_06/ImageServer

I sent Brian an email message: If I understand correctly, to support the ADC (or is it a different acronym now "ArcGIS Enterprise in the Cloud"?) we want to:

  1. Ingest and publish GPM 3IMERG HHE 06 data into the VEDA metadata API (STAC)
  2. We want to create Cloud-Optimized GeoTIFFs to support services described below
  3. Publish services for visualization: I see there is a WMS service - would WMTS be ok to support ArGIS in the Cloud or it must be WMS?
  4. Publish services for access: We will need WCS support for ArcGIS Enterprise in the Cloud

GPM IMERG is a high value first example of executing the above steps but there will be many other datasets to follow a similar model to the above.

Acceptance Criteria:

sharkinsspatial commented 2 years ago

@abarciauskas-bgse Out of curiosity what are the plans for COG layout for the IMERG variables? Will you create multi-band COGs with variables or a host of single band COGs with variable naming conventions? If there is a consideration for generating COGs for large numbers of netCDF files it might be worthwhile to consult with the user community as we’ll be diverging from the commonly accepted CF Conventions https://cfconventions.org/ which most scientific producers and consumers try to adhere to. For a reference example of working with the IMERG data here is the recipe we developed for pangeo-forge https://github.com/pangeo-forge/gpm-imerge-hhr-feedstock/blob/main/feedstock/recipe.py

Another consideration is the update strategy. We are still considering our incremental append strategy for pangeo-forge but we should have something well defined in the next few sprints. But this is a question that has been brought up recently in relation to the IMERG data https://github.com/pangeo-forge/gpm-imerge-hhr-feedstock/issues/2

abarciauskas-bgse commented 2 years ago

@sharkinsspatial these are all good questions.

In general, I want to centralize questions and answers about generating cloud-optimized (analysis-ready?) data. So far @wildintellect has helped start these documents:

I would be interested to know what you @sharkinsspatial think about the layout and content so far in those documents. I know there are a lot of resources on COG and Zarr out there, but I think the intention with these documents is to be able to point our stakeholders somewhere when they are looking for guidance in creating COGs or Zarr.

ingalls commented 2 years ago

@abarciauskas-bgse Current codebase is here as we sketch this out: https://github.com/developmentseed/raster-uploader/

Current API Location: raster-uploader-prod-1759918000.us-east-1.elb.amazonaws.com Username: default Password: [DM Me]

Screenshot from 2022-06-14 08-24-00 Screenshot from 2022-06-14 08-24-04

abarciauskas-bgse commented 2 years ago

@ingalls got this working today (I believe, still looking at the result and making sure it looks correct)

https://github.com/NASA-IMPACT/cloud-optimized-data-pipelines/tree/ab/updates-for-imerg/docker/hdf5-to-cog#gpm-imerg-example

so will generate a few samples tomorrow to send to the ADC team

abarciauskas-bgse commented 2 years ago

@ingalls can you share the IMERG COG output you generated with raster-uploader, along with what was the source NetCDF and the config you used to generate it? I want to compare it with the one I produced and previously shared with the ADC team.

ingalls commented 2 years ago

@abarciauskas-bgse The general directory can be found here:

aws s3 ls s3://raster-uploader-prod-853558080719-us-east-1/uploads/67/

The input file exists here:

aws s3 ls s3://raster-uploader-prod-853558080719-us-east-1/uploads/67/imerg_test.nc

And the precipitationCal output exists here:

aws s3 ls s3://raster-uploader-prod-853558080719-us-east-1/uploads/67/step/77/final.tif

I just grabbed a random IMERG dataset to use for testing. Would be happy to get some time on the caledar and run through your process vs mine with the same input file. Alternatively happy to do it async if you can provide an input file that you used to make sure we have parity.

abarciauskas-bgse commented 2 years ago

I'm going to probably try this myself but did you generate this before or after you added the flipping option? When I compare it the sample i created it makes me think one is flipped and one is not but it could depend on the source.

comparing the one i generated:

Screen Shot 2022-07-28 at 4 36 31 PM

https://ejd872yh78.execute-api.us-east-1.amazonaws.com/cog/preview?url=s3%3A%2F%2Fveda-data-store-staging%2FGPM_3IMERGHHE.06%2F3B-HHR-E.MS.MRG.3IMERG.20220101-S000000-E002959.0000.V06B.HDF5.tif&unscale=false&resampling=nearest&rescale=0%2C10&colormap_name=blues_r&return_mask=true

with the one linked above: (locally using rio viz)

Screen Shot 2022-07-28 at 4 35 26 PM

For reference, I think the file you generated was using https://gpm1.gesdisc.eosdis.nasa.gov/data/GPM_L3/GPM_3IMERGHHE.06/2022/167/3B-HHR-E.MS.MRG.3IMERG.20220616-S000000-E002959.0000.V06C.HDF5 by gdalinfo'ing the netCDF file

abarciauskas-bgse commented 2 years ago

Just also noting some of the conversation from email and slack:

wildintellect commented 2 years ago

There's another https way to access IMERG that does not use EDL which we used in the Pangeo-Forge recipe (@sharkinsspatial and I wrote). Also the naming pattern is very well known, no need to discover it once you know the date range and product you want. https://github.com/pangeo-forge/staged-recipes/blob/b3f80f1e23ff9df1a1cf9622a7d7fa9107305754/recipes/gpm-imerg/recipe.py#L11-L26

I believe this access method might allow for fsspec (or s3fs) access to the files without pre-download.

cc: @abarciauskas-bgse @ingalls @sharkinsspatial

wildintellect commented 2 years ago

Here's the bulk access instructions https://gpm.nasa.gov/sites/default/files/2021-01/arthurhouhttps_retrieval.pdf

abarciauskas-bgse commented 2 years ago

I picked this up again and started the work deploying and testing it, and everything is going smoothly, kudos @slesaad @xhagrg for the veda-data-pipelines refactor. Work is in https://github.com/NASA-IMPACT/veda-data-pipelines/tree/ab/deploy-for-imerg

Work to go:

smohiudd commented 1 year ago

I uploaded around 50 COG samples to s3://climatedashboard-data/GPM_3IMERGHHE/

@abarciauskas-bgse can we send this to Owen and George for review?

abarciauskas-bgse commented 1 year ago

Thanks @smohiudd - sorry if this wasn't clear but we should put them in s3://veda-data-store-staging before sending it to them so they ensure they can access the files when they are in an "official" "staging" (though it should eventually be in s3://veda-data-store) bucket.

j08lue commented 1 year ago

The GPM IMERG data is also available as Zarr - does not help us for viz, but is relevant to include in our catalog anyways.

j08lue commented 1 year ago

Stale