Workflow to rechunk NWM retrospective zarr data

jpolchlo commented 1 year ago

Overview

We have wanted to rechunk the NWM retrospective data to be optimized for time-series queries over a limited spatial extent, which seems to be a more common use case. Previous studies done by Azavea have shown that this style of query is faster on a rechunked data set.

When we run this process on a single EC2 node, we run out of memory for the job. This PR presents a solution based on a Dask cluster running in Kubernetes. Using an Argo workflow, we are able to execute the included python script (rechunk-retro-data.py) on an arbitrarily-sized cluster to perform the rechunking operation. In my test, I used 48 workers with 8GB of RAM. The result can be found at s3://azavea-noaa-hydro-data/experiments/jp/rechunk/output.zarr.

Closes #119

Checklist

[x] Ran nbautoexport export . in /opt/src/notebooks and committed the generated scripts. This is to make reviewing notebooks easier. (Note the export will happen automatically after saving notebooks from the Jupyter web app.)
[ ] Documentation updated if needed
[x] PR has a name that won't get you publicly shamed for vagueness

Notes

~~This is built on top of the contents of #120. This PR should be considered a draft until that PR is merged.~~ Rebased and ready.

Testing Instructions

Start a workflow based on the run-dask-job.yaml workflow template
Point the script-location parameter to a version of rechunk-retro-data.py that sets the proper output URI location (figuring out a clean way to pass arguments via the Argo UI is a task for the future)
Tune the scale of your cluster
Create the workflow
Monitor the logs for the Dask client dashboard, append it to https://jupyter.noaa.azavea.com, and direct a browser to that URL to watch the job progress

jpolchlo commented 1 year ago

This is ready for review, but I don't intend for you to actually run this. Necessarily. I was thinking that the review would involve checking the zarr file I created against the original, NOAA-provided zarr file. If there are benchmarks that can be simply rerun from the ESIP work by changing a couple URIs, then that would be a good idea.

jpolchlo commented 1 year ago

Pushing the go button here. I had wanted a positive confirmation that the zarr that was generated showed the same speedup as our tests for ESIP, but that shouldn't hold this up anymore. When we do the test to confirm, we can readdress the script contributed here if there are problems.

jpolchlo commented 1 year ago

For posterity: This job required two r5.8xlarge and one r5.xlarge for two hours, which should have burned about $2.60 in additional compute costs at the time of execution on the spot market. I can't say how much additional S3 costs would have been triggered.

rajadain commented 1 year ago

Also, we checked out the size of the generated dataset:

aws --profile=noaa s3 ls --human-readable --summarize --recursive s3://azavea-noaa-hydro-data/experiments/jp/rechunk/output.zarr

...

Total Objects: 132293
   Total Size: 638.6 GiB

jpolchlo commented 1 year ago

It's important to put the above in context:

aws s3 ls --recursive --summarize --human-readable s3://noaa-nwm-retrospective-2-1-zarr-pds/chrtout.zarr

...

Total Objects: 102330
   Total Size: 1.3 TiB

We don't yet have an explanation for the factor-of-two change in size, though an obvious thing to check is if we coerced the data type inadvertently.

azavea / noaa-hydro-data