Closed jpolchlo closed 1 year ago
This is ready for review, but I don't intend for you to actually run this. Necessarily. I was thinking that the review would involve checking the zarr file I created against the original, NOAA-provided zarr file. If there are benchmarks that can be simply rerun from the ESIP work by changing a couple URIs, then that would be a good idea.
Pushing the go button here. I had wanted a positive confirmation that the zarr that was generated showed the same speedup as our tests for ESIP, but that shouldn't hold this up anymore. When we do the test to confirm, we can readdress the script contributed here if there are problems.
For posterity: This job required two r5.8xlarge
and one r5.xlarge
for two hours, which should have burned about $2.60 in additional compute costs at the time of execution on the spot market. I can't say how much additional S3 costs would have been triggered.
Also, we checked out the size of the generated dataset:
aws --profile=noaa s3 ls --human-readable --summarize --recursive s3://azavea-noaa-hydro-data/experiments/jp/rechunk/output.zarr
...
Total Objects: 132293
Total Size: 638.6 GiB
It's important to put the above in context:
aws s3 ls --recursive --summarize --human-readable s3://noaa-nwm-retrospective-2-1-zarr-pds/chrtout.zarr
...
Total Objects: 102330
Total Size: 1.3 TiB
We don't yet have an explanation for the factor-of-two change in size, though an obvious thing to check is if we coerced the data type inadvertently.
Overview
We have wanted to rechunk the NWM retrospective data to be optimized for time-series queries over a limited spatial extent, which seems to be a more common use case. Previous studies done by Azavea have shown that this style of query is faster on a rechunked data set.
When we run this process on a single EC2 node, we run out of memory for the job. This PR presents a solution based on a Dask cluster running in Kubernetes. Using an Argo workflow, we are able to execute the included python script (
rechunk-retro-data.py
) on an arbitrarily-sized cluster to perform the rechunking operation. In my test, I used 48 workers with 8GB of RAM. The result can be found ats3://azavea-noaa-hydro-data/experiments/jp/rechunk/output.zarr
.Closes #119
Checklist
Rannbautoexport export .
in/opt/src/notebooks
and committed the generated scripts. This is to make reviewing notebooks easier. (Note the export will happen automatically after saving notebooks from the Jupyter web app.)Notes
This is built on top of the contents of #120. This PR should be considered a draft until that PR is merged.Rebased and ready.Testing Instructions
run-dask-job.yaml
workflow templatescript-location
parameter to a version ofrechunk-retro-data.py
that sets the proper output URI location (figuring out a clean way to pass arguments via the Argo UI is a task for the future)