azavea / noaa-hydro-data

NOAA Phase 2 Hydrological Data Processing
11 stars 3 forks source link

Workflow to rechunk NWM retrospective zarr data #122

Closed jpolchlo closed 1 year ago

jpolchlo commented 1 year ago

Overview

We have wanted to rechunk the NWM retrospective data to be optimized for time-series queries over a limited spatial extent, which seems to be a more common use case. Previous studies done by Azavea have shown that this style of query is faster on a rechunked data set.

When we run this process on a single EC2 node, we run out of memory for the job. This PR presents a solution based on a Dask cluster running in Kubernetes. Using an Argo workflow, we are able to execute the included python script (rechunk-retro-data.py) on an arbitrarily-sized cluster to perform the rechunking operation. In my test, I used 48 workers with 8GB of RAM. The result can be found at s3://azavea-noaa-hydro-data/experiments/jp/rechunk/output.zarr.

Closes #119

Checklist

Notes

This is built on top of the contents of #120. This PR should be considered a draft until that PR is merged. Rebased and ready.

Testing Instructions

jpolchlo commented 1 year ago

This is ready for review, but I don't intend for you to actually run this. Necessarily. I was thinking that the review would involve checking the zarr file I created against the original, NOAA-provided zarr file. If there are benchmarks that can be simply rerun from the ESIP work by changing a couple URIs, then that would be a good idea.

jpolchlo commented 1 year ago

Pushing the go button here. I had wanted a positive confirmation that the zarr that was generated showed the same speedup as our tests for ESIP, but that shouldn't hold this up anymore. When we do the test to confirm, we can readdress the script contributed here if there are problems.

jpolchlo commented 1 year ago

For posterity: This job required two r5.8xlarge and one r5.xlarge for two hours, which should have burned about $2.60 in additional compute costs at the time of execution on the spot market. I can't say how much additional S3 costs would have been triggered.

rajadain commented 1 year ago

Also, we checked out the size of the generated dataset:

aws --profile=noaa s3 ls --human-readable --summarize --recursive s3://azavea-noaa-hydro-data/experiments/jp/rechunk/output.zarr

...

Total Objects: 132293
   Total Size: 638.6 GiB
jpolchlo commented 1 year ago

It's important to put the above in context:

aws s3 ls --recursive --summarize --human-readable s3://noaa-nwm-retrospective-2-1-zarr-pds/chrtout.zarr

...

Total Objects: 102330
   Total Size: 1.3 TiB

We don't yet have an explanation for the factor-of-two change in size, though an obvious thing to check is if we coerced the data type inadvertently.