Closed abarciauskas-bgse closed 1 year ago
@abarciauskas-bgse Out of curiosity what are the plans for COG layout for the IMERG variables? Will you create multi-band COGs with variables or a host of single band COGs with variable naming conventions? If there is a consideration for generating COGs for large numbers of netCDF files it might be worthwhile to consult with the user community as we’ll be diverging from the commonly accepted CF Conventions https://cfconventions.org/ which most scientific producers and consumers try to adhere to. For a reference example of working with the IMERG data here is the recipe we developed for pangeo-forge
https://github.com/pangeo-forge/gpm-imerge-hhr-feedstock/blob/main/feedstock/recipe.py
Another consideration is the update strategy. We are still considering our incremental append strategy for pangeo-forge
but we should have something well defined in the next few sprints. But this is a question that has been brought up recently in relation to the IMERG data https://github.com/pangeo-forge/gpm-imerge-hhr-feedstock/issues/2
@sharkinsspatial these are all good questions.
For IMERG, I think @ingalls is starting by creating an API and UI so it is easy to modify the configuration for how variables are selected and named. @ingalls have you considered how to specify things like which variables correspond to which bands, in 1 to many files, and the option for variable file naming? I'm assuming this means that if one wishes to store a different variable for each output COG, I would configure the generation to name the output file with a substring which includes the band/variable name.
I need to read up on CF conventions so I will have to get back on the question about how we can adhere to CF Conventions for IMERG and future collections.
In general, I want to centralize questions and answers about generating cloud-optimized (analysis-ready?) data. So far @wildintellect has helped start these documents:
I would be interested to know what you @sharkinsspatial think about the layout and content so far in those documents. I know there are a lot of resources on COG and Zarr out there, but I think the intention with these documents is to be able to point our stakeholders somewhere when they are looking for guidance in creating COGs or Zarr.
@abarciauskas-bgse Current codebase is here as we sketch this out: https://github.com/developmentseed/raster-uploader/
Current API Location: raster-uploader-prod-1759918000.us-east-1.elb.amazonaws.com
Username: default
Password: [DM Me]
@ingalls got this working today (I believe, still looking at the result and making sure it looks correct)
so will generate a few samples tomorrow to send to the ADC team
@ingalls can you share the IMERG COG output you generated with raster-uploader, along with what was the source NetCDF and the config you used to generate it? I want to compare it with the one I produced and previously shared with the ADC team.
@abarciauskas-bgse The general directory can be found here:
aws s3 ls s3://raster-uploader-prod-853558080719-us-east-1/uploads/67/
The input file exists here:
aws s3 ls s3://raster-uploader-prod-853558080719-us-east-1/uploads/67/imerg_test.nc
And the precipitationCal
output exists here:
aws s3 ls s3://raster-uploader-prod-853558080719-us-east-1/uploads/67/step/77/final.tif
I just grabbed a random IMERG dataset to use for testing. Would be happy to get some time on the caledar and run through your process vs mine with the same input file. Alternatively happy to do it async if you can provide an input file that you used to make sure we have parity.
I'm going to probably try this myself but did you generate this before or after you added the flipping option? When I compare it the sample i created it makes me think one is flipped and one is not but it could depend on the source.
comparing the one i generated:
with the one linked above: (locally using rio viz)
For reference, I think the file you generated was using https://gpm1.gesdisc.eosdis.nasa.gov/data/GPM_L3/GPM_3IMERGHHE.06/2022/167/3B-HHR-E.MS.MRG.3IMERG.20220616-S000000-E002959.0000.V06C.HDF5 by gdalinfo'ing the netCDF file
Just also noting some of the conversation from email and slack:
There's another https way to access IMERG that does not use EDL which we used in the Pangeo-Forge recipe (@sharkinsspatial and I wrote). Also the naming pattern is very well known, no need to discover it once you know the date range and product you want. https://github.com/pangeo-forge/staged-recipes/blob/b3f80f1e23ff9df1a1cf9622a7d7fa9107305754/recipes/gpm-imerg/recipe.py#L11-L26
I believe this access method might allow for fsspec (or s3fs) access to the files without pre-download.
cc: @abarciauskas-bgse @ingalls @sharkinsspatial
Here's the bulk access instructions https://gpm.nasa.gov/sites/default/files/2021-01/arthurhouhttps_retrieval.pdf
I picked this up again and started the work deploying and testing it, and everything is going smoothly, kudos @slesaad @xhagrg for the veda-data-pipelines refactor. Work is in https://github.com/NASA-IMPACT/veda-data-pipelines/tree/ab/deploy-for-imerg
Work to go:
I uploaded around 50 COG samples to s3://climatedashboard-data/GPM_3IMERGHHE/
@abarciauskas-bgse can we send this to Owen and George for review?
Thanks @smohiudd - sorry if this wasn't clear but we should put them in s3://veda-data-store-staging before sending it to them so they ensure they can access the files when they are in an "official" "staging" (though it should eventually be in s3://veda-data-store) bucket.
The GPM IMERG data is also available as Zarr - does not help us for viz, but is relevant to include in our catalog anyways.
Stale
Epic
None, but to support the ArcGIS Enterprise in the Cloud Effort
Description
Convert the half-hour product to COG for use by ADC initiative
Background
Brian Tisdale who is leading the ArcGIS Enterprise in the Cloud effort reached out on slack:
I sent Brian an email message: If I understand correctly, to support the ADC (or is it a different acronym now "ArcGIS Enterprise in the Cloud"?) we want to:
GPM IMERG is a high value first example of executing the above steps but there will be many other datasets to follow a similar model to the above.
Acceptance Criteria: