"xarray.to_netcdf" needs too much memory for long datasets(e.g., R95P based on the ERA5 data)

Ouranosinc / xclim

Library of derived climate variables, ie climate indicators, based on xarray.

Apache License 2.0

323 stars 57 forks source link

self.FileNameList=["1950.nc“,”1951.nc“,..."2023.nc"] self.Files=open_mfdataset(self.FileNameList,concat_dim="time", combine="nested",data_vars='minimal', coords='minimal', compat='override') self.Variable='tp' self.Files[self.Variable] = xclim.core.units.amount2rate(self.Files[self.Variable], out_units="mm/d") wetdays_Array = self.Files[self.Variable].where(self.Files[self.Variable] >= 1) a=xclim.core.calendar.percentile_doy(wetdays_Array,per=95,window=5) results_R95P = xclim.indicators.icclim.R95p(pr='tp',pr_per=a,freq='YS',ds=self.Files) results_R95P .to_netcdf('../R95P.nc', format='NETCDF4', engine='netcdf4')

self.Rows=self.Files[self.Variable].shape[1] # todo Longitude or latitude for i in range(self.Rows): RD_N_Y_data = [] data = self.Files.tp[:, i, :].load() File_Block = xclim.core.units.amount2rate(data, out_units="mm/d") wetdays_Array = File_Block.where(File_Block >= 1) RNT = wetdays_Array.quantile([0.8], dim='time', keep_attrs=True) RD_N_Y = xclim.indicators.icclim.R95p(File_Block, pr_per=RNT, freq='YS') RD_N_Y_data.append(RD_N_Y) xr.concat(RD_N_Y_data , dim='latitude').to_netcdf('../R95P.nc', format='NETCDF4', engine='netcdf4')

Hi @millet5818

On the other hand, i'm confused about the difference between the function quantile and the function percentile_doy.

percentile_doy computes percentiles for each day of the years, so you get 365 values per grid cells. The result of percentile_doy may be used as threshold to compute, for example, the number of days where the doy threshold is reach, this is typically what an climate index like TX90p expect.

On the other hand, quantile, when computed on the time axis, will compute 1 value per grid cells. These values can then be used as threshold to compute the exceedance in indices such as R95p.

I would like to also draw your attention that, according to the ECAD's ATBD, R95p should be computed on "period percentiles" and not on doy percentiles as you have shown in your example. See https://knmi-ecad-assets-prd.s3.amazonaws.com/documents/atbd.pdf

Regarding performances

First, if you compute period percentiles instead of doy percentiles, you may not have any performance issue because computing doy percentiles requires much more operations than for period percentiles. Then, if you still have perf issues read the following. xclim relies on dask to handle cases where datasets do not fit in memory. Basically dask does what you attempted by chunking the dataset into small parts that fits in memory. You may not have notice this, but you are already using dask via xarray's open_mfdataset(...) function. So the chunking should already take place and it means you need to dive deeper into dask to be able to run this computation on your machine.

I suggest that you try the distributed scheduler of dask, it gives much more control over the memory management of the computation. Have a look at the quickstart here: https://distributed.dask.org/en/latest/quickstart.html

In short you first need to install it with pip or conda, like pip install dask distributed

Then you can setup the Localcluster of dask with:

from distributed import Client

client = Client(memory_limit="20GB", n_workers=1, threads_per_worker=16)

(adapt mem and threads to your machine). Note that on a laptop, I would recommend to stick with a single worker and adding threads to minimize the per process communication.

And then you can run your computation in the same python process (typically the same notebook). Dask will trigger computation either when calling .compute on the resulting xarray object or when you run to_netCDF. You can even monitor the computation on your browser by reaching http://localhost:8787/status (port by default) after the client object has been created.

I hope this helps!

Ouranosinc / xclim

"xarray.to_netcdf" needs too much memory for long datasets(e.g., R95P based on the ERA5 data) #1762

Generic Issue

Description

Code

What I Did

Simple example for my solution

Code of Conduct

Regarding performances