interp_hybrid_to_pressure failing on large datasets

kafitzgerald commented 1 month ago

Noting this here for tracking purposes.

While some performance improvements were made in #592, we've gotten another report about problems with interp_hybrid_to_pressure. In particular, it's failing while operating on a larger dataset on Casper. I've been able to replicate the issue and it seems like it has to do with a very large Dask task graph. I'm guessing it's a combination of our internal usage of Dask in geocat-comp and the size of the dataset. We didn't see this failure while testing before because the dataset was much smaller and indeed when you subset the dataset temporally and run the function again it executes successfully.

I've followed up with a temporary workaround, but it'd be good to prioritize this. This function gets a good bit of use and it's likely to come up again. I also suspect there's still a lot of performance improvements to be made here.

sudhansu-s-rath commented 3 weeks ago

Hi, It Seems the same thing is happening to me. I can't go for the workaround with smaller data. I would greatly appreciate any solution to this.

Original post: [https://discourse.pangeo.io/t/code-hangs-while-saving-dataset-to-disk-using-to-netcdf/4413]https://discourse.pangeo.io/t/code-hangs-while-saving-dataset-to-disk-using-to-netcdf/4413 I tried both NetCDF and Zarr format to save the data using dask clients too, It’s still the same issue, the code does not stop and keeps on running. I saved precip data with the same procedure earlier (~117 GB for each model file). I guess the problem is with 3d pressure level variables. If any one want to reproduce the error:

I am trying to save MIROC6_historical gs://cmip6/CMIP6/CMIP/MIROC/MIROC6/historical/r1i1p1f1/6hrLev/ta/gn/v20191114/

after reading this above as ds1, I interpolate to get one pressure level

import geocat.comp as gc

ta = ds1.ta

ps = ds1.ps

hyam = ds1.a

hybm = ds1.b

p0 = ds1.p0

new_levels = np.array([85000])

ta_850 = gc.interpolation.interp_hybrid_to_pressure(ta, ps, hyam, hybm, p0, new_levels=new_levels,lev_dim=None, method=‘linear’, extrapolate=False, variable=None, t_bot=None, phi_sfc=None)

Now I want to save the dataset to my local cluster. ta_850.to_zarr(‘/path/miroc6.zarr’, consolidated=True) or ta_850.to_netcdf(‘/path/miroc6.nc’)

The code runs forever and the process is not complete. Any help in solving this will be appreciated. Thank you so much for your attention and participation.

kafitzgerald commented 3 weeks ago

Hi, @sudhansu-s-rath thanks for the note!

It's interesting that you're seeing this with extrapolate=False as well. It looks like the time dimension on your dataset is quite large though and you're working with a lot of data.

Could you share what version of geocat-comp you're using and a bit more info about where you're running this?

If you're looking for a workaround, you might try working on smaller temporal subsets of the data.

If you're up for some deeper troubleshooting, I'd take a look at some of the diagnostics and task graphs from Dask if you haven't already.

I hope to do some additional profiling of this function this week (it really hasn't been tested at scale) and might have some better suggestions then as well.

sudhansu-s-rath commented 3 weeks ago

Hi @kafitzgerald ,

Thanks for your response.

The output from gc.version, says its "'2024.4.0'"
I am running this in my department cluster with the following resources in my batch file:

!/bin/bash

SBATCH --job-name=cmipdload_ta

SBATCH -p node

SBATCH -n 4

SBATCH --time=96:00:00

SBATCH --mem-per-cpu=20g

SBATCH --output=cmipdload_ta.log

SBATCH --mail-user=emailXXXX@illinois.edu

~/miniconda3/envs/g_env/bin/python /Databatch/dloadscript_ta.py
I need to compute this for the entire model time, so I can't use a small slice of computation, I tried to check with a small slice of a few days, and the code works fine.
Let me know how I can provide more info to help you troubleshoot this, I have been struggling with this for a long time.
I have tried this with variable ta, the code completed in 20 hours, and the output was ~ 19 GB, But when the same I tried with ua or va the code kept on running forever. Even after reaching the timeout, I checked the output file with mostly blank data.
I tried to use dask cluster in a jupyter notebook too, but with no success.

kafitzgerald commented 1 week ago

@sudhansu-s-rath I'm still looking into this, but wanted to get back to you with a few suggestions at least.

It appears the dataset you're working with is quite large and when processing the full dataset the memory usage is quite significant (often spilling to disk in my case, which slows things down significantly) and the Dask task graph rather large. It's possible data transfer from cloud storage is playing a role here as well. I did have some luck using dask.optimize() to help with the task graph, but it's still a large, memory intensive workflow.

It sounds like you're interested in running this for the full dataset/simulation, but breaking this down into processing steps and then appending to a file or writing multiple intermediate files might be a good approach here (the calculations themselves are largely independent in time at least) and avoid a lot of the memory related issues. I think that's probably the most efficient currently available approach especially if subsetting to a specific area / time of interest first isn't an option.

You may also find some relevant tips/resources for optimization and profiling here: https://docs.xarray.dev/en/latest/user-guide/dask.html#optimization-tips

NCAR / geocat-comp

interp_hybrid_to_pressure failing on large datasets #633

!/bin/bash

SBATCH --job-name=cmipdload_ta

SBATCH -p node

SBATCH -n 4

SBATCH --time=96:00:00

SBATCH --mem-per-cpu=20g

SBATCH --output=cmipdload_ta.log

SBATCH --mail-user=emailXXXX@illinois.edu