bidhya / verse

0 stars 0 forks source link

Replace NetCDF files with ZARR #10

Closed bidhya closed 6 months ago

bidhya commented 6 months ago

The final input file that combines 4 LIS variables and 1 MODIS SCF has a combined uncompressed size of ~666 GB. This cannot be loaded in any of available compute nodes (Max available memory on latest Milan node is 488 GB). Though compression though 666 GB is reduced to ~90 GB.

Big NetCDF files takes a long time to serialize (save to hard-drive). Sometimes it can take 10 to 20 hours! This is a huge bottleneck of the existing workflow that was updated from 5km run 1km run.

Proposed here is to change the NetCDF file to zarr format, at least of intermediate outputs. Want to retain the final output in NetCDF because Blender workflow in Julia already uses this. Plus, zarr in Julia is a bit of untested territory .... though it might just work.

bidhya commented 6 months ago

The first script (jupyter notebook) processing the new LIS files only ouput zarr files. I am also using dask local cluster and chunks. Overall, this gives a significant improvement of processing LIS files. Entire water year can be processed in less than an hour with 30 cores.

This is very prelim result. More concrete benchmarking may be performed in future.

Saving a large netcdf file takes a long time, thus switching to zarr non-trivially saved the processing time.