Ouranosinc / xclim

Library of derived climate variables, ie climate indicators, based on xarray.
https://xclim.readthedocs.io/en/stable/
Apache License 2.0
323 stars 57 forks source link

"xarray.to_netcdf" needs too much memory for long datasets(e.g., R95P based on the ERA5 data) #1762

Open millet5818 opened 4 months ago

millet5818 commented 4 months ago

Generic Issue

Description

Dear developers: I want to calculate the R95P based on ERA5 data from 1950 to 2023. The function create_ensemble or open_mfdataset were used to load the dataset. The functions ensemble_percentiles or percentile_doy were used to calculate the percentile of the day of the year. Then, according to the function xclim.indicators.icclim.R95p, the R95P we got. However, the computer memory exploded when the R95P results were exported. On the other hand, i'm confused about the difference between the function quantile and the function percentile_doy.

Code

self.FileNameList=["1950.nc“,”1951.nc“,..."2023.nc"]
self.Files=open_mfdataset(self.FileNameList,concat_dim="time", combine="nested",data_vars='minimal', coords='minimal', compat='override') 
self.Variable='tp'
self.Files[self.Variable] = xclim.core.units.amount2rate(self.Files[self.Variable], out_units="mm/d")
wetdays_Array = self.Files[self.Variable].where(self.Files[self.Variable] >= 1) 
a=xclim.core.calendar.percentile_doy(wetdays_Array,per=95,window=5)
results_R95P = xclim.indicators.icclim.R95p(pr='tp',pr_per=a,freq='YS',ds=self.Files)
results_R95P .to_netcdf('../R95P.nc', format='NETCDF4', engine='netcdf4')

Computer memory explodes when proceeding at this point (results_R95P .to_netcdf('../R95P.nc', format='NETCDF4', engine='netcdf4')

What I Did

I initially guessed that the computer memory was too small, so I loaded the data for each grid into the computer memory and finally concat all grids, but this way made the calculation too time-consuming. Do you have a better way?

Simple example for my solution

self.Rows=self.Files[self.Variable].shape[1] # todo Longitude or latitude
for i in range(self.Rows):
    RD_N_Y_data = []
    data = self.Files.tp[:, i, :].load()
    File_Block = xclim.core.units.amount2rate(data, out_units="mm/d")
    wetdays_Array = File_Block.where(File_Block >= 1)
    RNT = wetdays_Array.quantile([0.8], dim='time', keep_attrs=True)
    RD_N_Y = xclim.indicators.icclim.R95p(File_Block, pr_per=RNT, freq='YS')
    RD_N_Y_data.append(RD_N_Y)
xr.concat(RD_N_Y_data , dim='latitude').to_netcdf('../R95P.nc', format='NETCDF4', engine='netcdf4')

Thank you very much for your help, and I look forward to your reply!

Code of Conduct

bzah commented 3 months ago

Hi @millet5818

On the other hand, i'm confused about the difference between the function quantile and the function percentile_doy.

percentile_doy computes percentiles for each day of the years, so you get 365 values per grid cells. The result of percentile_doy may be used as threshold to compute, for example, the number of days where the doy threshold is reach, this is typically what an climate index like TX90p expect.

On the other hand, quantile, when computed on the time axis, will compute 1 value per grid cells. These values can then be used as threshold to compute the exceedance in indices such as R95p.

I would like to also draw your attention that, according to the ECAD's ATBD, R95p should be computed on "period percentiles" and not on doy percentiles as you have shown in your example. See https://knmi-ecad-assets-prd.s3.amazonaws.com/documents/atbd.pdf

Regarding performances

First, if you compute period percentiles instead of doy percentiles, you may not have any performance issue because computing doy percentiles requires much more operations than for period percentiles. Then, if you still have perf issues read the following. xclim relies on dask to handle cases where datasets do not fit in memory. Basically dask does what you attempted by chunking the dataset into small parts that fits in memory. You may not have notice this, but you are already using dask via xarray's open_mfdataset(...) function. So the chunking should already take place and it means you need to dive deeper into dask to be able to run this computation on your machine.

I suggest that you try the distributed scheduler of dask, it gives much more control over the memory management of the computation. Have a look at the quickstart here: https://distributed.dask.org/en/latest/quickstart.html

In short you first need to install it with pip or conda, like pip install dask distributed

Then you can setup the Localcluster of dask with:

from distributed import Client

client = Client(memory_limit="20GB", n_workers=1, threads_per_worker=16)

(adapt mem and threads to your machine). Note that on a laptop, I would recommend to stick with a single worker and adding threads to minimize the per process communication.

And then you can run your computation in the same python process (typically the same notebook). Dask will trigger computation either when calling .compute on the resulting xarray object or when you run to_netCDF. You can even monitor the computation on your browser by reaching http://localhost:8787/status (port by default) after the client object has been created.

I hope this helps!