Ouranosinc / xclim

Library of derived climate variables, ie climate indicators, based on xarray.
https://xclim.readthedocs.io/en/stable/
Apache License 2.0
333 stars 59 forks source link

Chunk issue with sdba.MBCn.train #1903

Open sylvainmarchi opened 2 months ago

sylvainmarchi commented 2 months ago

Generic Issue

Xclim version: 0.52.0 Python version: 3.10.14 Operating System: Linux

Description

I apply the MBCn algorithm for bias-correcting Euro CORDEX data. I successfully reproduced the example presented in the notebook. Now I follow the same procedure using my data.

What I Did

I loaded my files using open_mfdataset and merged them (one file per variable) to create dref and dhist. The reference and historical datasets are made of a single chunk of size 7305 for the time coordinate. This is my reference file:

<xarray.DataArray 'multivariate' (multivar: 2, time: 7305, lat: 25, lon: 41)> Size: 120MB
dask.array<rechunk-merge, shape=(2, 7305, 25, 41), dtype=float64, chunksize=(2, 7305, 25, 41), chunktype=numpy.ndarray>
Coordinates:
  * time      (time) object 58kB 1986-01-01 00:00:00 ... 2005-12-31 00:00:00
  * multivar  (multivar) <U3 24B 'pr' 'tas'
Dimensions without coordinates: lat, lon
Attributes:
    CDI:            Climate Data Interface version 1.9.5 (http://mpimet.mpg.d...
    Conventions:    CF-1.6
    history:        Thu Jul 11 11:29:46 2024: cdo remapcon,grid_description_0...
    creation_date:  25-06-2024
    creators:       Ghilain N., Van Schaeybroeck B., Vanderkelen I.
    contact:        inne.vanderkelen@meteo.be
    version:        1.1
    affiliation:    Royal Meteorological Institute of Belgium
    projection:     +proj=lcc +lat_2=50.569898649999999 +lat_1=50.56989864999...
    CDO:            Climate Data Operators version 1.9.5 (http://mpimet.mpg.d...
    units:

This is my historical file:

<xarray.DataArray 'multivariate' (multivar: 2, time: 7305, lat: 25, lon: 41)> Size: 60MB
dask.array<rechunk-merge, shape=(2, 7305, 25, 41), dtype=float32, chunksize=(2, 1827, 25, 41), chunktype=numpy.ndarray>
Coordinates:
  * time      (time) object 58kB 1986-01-01 12:00:00 ... 2005-12-31 12:00:00
    height    float64 8B ...
  * multivar  (multivar) <U3 24B 'pr' 'tas'
Dimensions without coordinates: lat, lon
Attributes: (12/32)
    CDI:                            Climate Data Interface version 1.9.6 (htt...
    history:                        Tue May 19 20:33:10 2020: cdo -z szip rem...
    source:                         CLMcom-CCLM4-8-17
    institution:                    Climate Limited-area Modelling Community ...
    Conventions:                    CF-1.4
    institute_id:                   CLMcom
    ...                             ...
    table_id:                       Table day (Sept 2013) 0cf1782745489246c9f...
    modeling_realm:                 atmos
    realization:                    1
    cmor_version:                   2.9.1
    CDO:                            Climate Data Operators version 1.9.6 (htt...
    units:

What I Received

I got the following error message when I executed:

group = 'time'
ADJ = sdba.MBCn.train(
    dref.chunk(dict(time=-1)),
    dhist,
    base_kws={"nquantiles": 20, "group": group},
    adj_kws={"interp": "nearest", "extrapolation": "constant"},
    n_iter=20,  # perform 20 iteration
    n_escore=1000,  # only send 1000 points to the escore metric
)

File /.../conda/envs/xclim_metpy/lib/python3.10/site-packages/xarray/core/computation.py:767, in apply_variable_ufunc(func, signature, exclude_dims, dask, output_dtypes, vectorize, keep_attrs, dask_gufunc_kwargs, *args) 765 for axis, dim in enumerate(core_dims, start=-len(core_dims)): 766 if len(data.chunks[axis]) != 1: --> 767 raise ValueError( 768 f"dimension {dim} on {n}th function argument to " 769 "apply_ufunc with dask='parallelized' consists of " 770 "multiple chunks, but is also a core dimension. To " 771 "fix, either rechunk into a single array chunk along " 772 f"this dimension, i.e., .chunk(dict({dim}=-1)), or " 773 "pass allow_rechunk=True in dask_gufunc_kwargs " 774 "but beware that this may significantly increase memory usage." 775 ) 776 dask_gufunc_kwargs["allow_rechunk"] = True 778 output_sizes = dask_gufunc_kwargs.pop("output_sizes", {}) ValueError: dimension time on 1th function argument to apply_ufunc with dask='parallelized' consists of multiple chunks, but is also a core dimension. To fix, either rechunk into a single array chunk along this dimension, i.e., .chunk(dict(time=-1)), or pass allow_rechunk=True in dask_gufunc_kwargs but beware that this may significantly increase memory usage.

What is going wrong here?

Code of Conduct

sylvainmarchi commented 2 months ago

It seems to be related to the group attribute. I do not face this error when I use

group = sdba.Grouper("time.dayofyear", window=31)
coxipi commented 2 months ago

Hi!

The reference and historical datasets are made of a single chunk of size 7305 for the time coordinate.

Are you sure? That doesn't seem to be the case in the code snippet you shared, for your historical dataset you have chunksize != size of time, 1827 != 7305.

<xarray.DataArray 'multivariate' (multivar: 2, time: 7305 , lat: 25, lon: 41)> Size: 60MB dask.array<rechunk-merge, shape=(2, 7305, 25, 41), dtype=float32, chunksize=(2, 1827, 25, 41), chunktype=numpy.ndarray>

Try running

group = 'time'
ADJ = sdba.MBCn.train(
    dref.chunk(dict(time=-1)),
    dhist.chunk(dict(time=-1)),
    base_kws={"nquantiles": 20, "group": group},
    adj_kws={"interp": "nearest", "extrapolation": "constant"},
    n_iter=20,  # perform 20 iteration
    n_escore=1000,  # only send 1000 points to the escore metric
)

Cheers, Éric

sylvainmarchi commented 2 months ago

Hello Éric, My bad. It now works. Unfortunately, I am stuck at the next step. Maybe I should create a new "issue" or discussion. When executing the next piece of code

scenh, scens = (
    ADJ.adjust(
        sim=ds,
        ref=dref,
        hist=dhist,
        base=sdba.QuantileDeltaMapping,
        base_kws_vars={
            "pr": {
                "kind": "*",
                "jitter_under_thresh_value": "0.01 mm d-1", # Trick to remove zeros, as they can be problematic for bias-adusting (as far as I understood)
                "adapt_freq_thresh": "0.1 mm d-1", # Value used to biascorrect the frequency of dry days, that is days with almost no precipitation.
            },
            "tas": {"kind": "+"},
        },
        adj_kws={"interp": "nearest", "extrapolation": "constant"},
    )
    for ds in (dhist, dsim)
)

I get: ValueError: The dimension(s) over which we group, reduce or interpolate cannot be chunked ({'time': (7305, 7305)}).

Below is how my inputs look like. dref

<xarray.DataArray 'multivariate' (multivar: 2, time: 7305, lat: 25, lon: 41)> Size: 120MB
dask.array<rechunk-merge, shape=(2, 7305, 25, 41), dtype=float64, chunksize=(2, 7305, 25, 41), chunktype=numpy.ndarray>
Coordinates:
  * time      (time) object 58kB 1986-01-01 00:00:00 ... 2005-12-31 00:00:00
  * multivar  (multivar) <U3 24B 'pr' 'tas'
Dimensions without coordinates: lat, lon
Attributes:
    CDI:            Climate Data Interface version 1.9.5 (http://mpimet.mpg.d...
    Conventions:    CF-1.6
    history:        Thu Jul 11 11:29:46 2024: cdo remapcon,grid_description_0...
    creation_date:  25-06-2024
    creators:       Ghilain N., Van Schaeybroeck B., Vanderkelen I.
    contact:        inne.vanderkelen@meteo.be
    version:        1.1
    affiliation:    Royal Meteorological Institute of Belgium
    projection:     +proj=lcc +lat_2=50.569898649999999 +lat_1=50.56989864999...
    CDO:            Climate Data Operators version 1.9.5 (http://mpimet.mpg.d...
    units:

dhist

<xarray.DataArray 'multivariate' (multivar: 2, time: 7305, lat: 25, lon: 41)> Size: 60MB
dask.array<rechunk-merge, shape=(2, 7305, 25, 41), dtype=float32, chunksize=(2, 7305, 25, 41), chunktype=numpy.ndarray>
Coordinates:
  * time      (time) object 58kB 1986-01-01 12:00:00 ... 2005-12-31 12:00:00
    height    float64 8B ...
  * multivar  (multivar) <U3 24B 'pr' 'tas'
Dimensions without coordinates: lat, lon
Attributes: (12/32)
    CDI:                            Climate Data Interface version 1.9.6 (htt...
    history:                        Tue May 19 20:33:10 2020: cdo -z szip rem...
    source:                         CLMcom-CCLM4-8-17
    institution:                    Climate Limited-area Modelling Community ...
    Conventions:                    CF-1.4
    institute_id:                   CLMcom
    ...                             ...
    table_id:                       Table day (Sept 2013) 0cf1782745489246c9f...
    modeling_realm:                 atmos
    realization:                    1
    cmor_version:                   2.9.1
    CDO:                            Climate Data Operators version 1.9.6 (htt...
    units:

dsim

<xarray.DataArray 'multivariate' (multivar: 2, time: 7305, lat: 25, lon: 41)> Size: 60MB
dask.array<rechunk-merge, shape=(2, 7305, 25, 41), dtype=float32, chunksize=(2, 7305, 25, 41), chunktype=numpy.ndarray>
Coordinates:
  * time      (time) object 58kB 2036-01-01 12:00:00 ... 2055-12-31 12:00:00
    height    float64 8B ...
  * multivar  (multivar) <U3 24B 'pr' 'tas'
Dimensions without coordinates: lat, lon
Attributes: (12/32)
    CDI:                            Climate Data Interface version 1.9.6 (htt...
    history:                        Tue May 19 21:32:31 2020: cdo -z szip rem...
    source:                         CLMcom-CCLM4-8-17
    institution:                    Climate Limited-area Modelling Community ...
    Conventions:                    CF-1.4
    institute_id:                   CLMcom
    ...                             ...
    table_id:                       Table day (Sept 2013) 0cf1782745489246c9f...
    modeling_realm:                 atmos
    realization:                    1
    cmor_version:                   2.9.1
    CDO:                            Climate Data Operators version 1.9.6 (htt...
    units:

ADJ.ds

<xarray.Dataset> Size: 3MB
Dimensions:         (multivar_prime: 2, win_dim0: 1, lat: 25, lon: 41,
                     iterations: 20, quantiles: 20, multivar: 2)
Coordinates:
    height          float64 8B 2.0
  * multivar_prime  (multivar_prime) <U3 24B 'pr' 'tas'
  * quantiles       (quantiles) float64 160B 0.025 0.075 0.125 ... 0.925 0.975
  * win_dim0        (win_dim0) int64 8B -1
  * multivar        (multivar) <U3 24B 'pr' 'tas'
Dimensions without coordinates: lat, lon, iterations
Data variables:
    af_q            (win_dim0, lat, lon, iterations, multivar_prime, quantiles) float32 3MB dask.array<chunksize=(1, 25, 41, 20, 2, 20), meta=np.ndarray>
    escores         (win_dim0, lat, lon, iterations) float32 82kB dask.array<chunksize=(1, 25, 41, 20), meta=np.ndarray>
    rot_matrices    (iterations, multivar, multivar_prime) float32 320B 0.995...
Attributes:
    _xclim_adjustment:  {"py/object": "xclim.sdba.adjustment.MBCn", "py/state...
    adj_params:         MBCn(quantiles=array([0.025, 0.075, 0.125, 0.175, 0.2...

Is it OK to use proleptic gregorian calendar and 2 dimensions for space?

coxipi commented 2 months ago

No worries!

Space dimensions are not an issue. I will look into it, but if you're using group = "time.dayofyear", you will need a regular calendar yes. I think this is assumed for other sdba methods as well. For group="time" this should not be a constraint.

Can you try:

scenh, scens = (
    ADJ.adjust(
        sim=ds.convert_calendar("noleap"),
        ref=dref.convert_calendar("noleap"),
        hist=dhist.convert_calendar("noleap"),
        base=sdba.QuantileDeltaMapping,
        base_kws_vars={
            "pr": {
                "kind": "*",
                "jitter_under_thresh_value": "0.01 mm d-1", # Trick to remove zeros, as they can be problematic for bias-adusting (as far as I understood)
                "adapt_freq_thresh": "0.1 mm d-1", # Value used to biascorrect the frequency of dry days, that is days with almost no precipitation.
            },
            "tas": {"kind": "+"},
        },
        adj_kws={"interp": "nearest", "extrapolation": "constant"},
    )
    for ds in (dhist, dsim)
)

and see if this removes the error messages? In the meantime I will investigate.

coxipi commented 2 months ago

Can you share more information on the error? If you could share 3 .nc files for dref, dhist,dsim with a single lon/lat point it would be helpful so I can troubleshoot.

sylvainmarchi commented 2 months ago

I still have the same error message after converting calendars. Please find my netCDF files as well as my python script at (.py and .nc files are not supported): https://cloud.meteo.be/s/8c9KMpcwxGb4gE2

coxipi commented 2 months ago

The python script in your folder is for the old way of working things with NpdfTransform? Do you have the script you used for this above?

sylvainmarchi commented 2 months ago

I have updated the file.

coxipi commented 2 months ago

Thanks! The problem is that you have different time coordinates:

dref.time 
>>> '1986-01-01T00:00:00.000000000', '1986-01-02T00:00:00.000000000',
       '1986-01-03T00:00:00.000000000', ...,
dhist.time 
>>> '1986-01-01T12:00:00.000000000', '1986-01-02T12:00:00.000000000',
       '1986-01-03T12:00:00.000000000', ...,

for instance if you try sdba.QuantileDeltaMapping.train(ref=dref, hist=dhist) you will get the same error. If you change the time in dhist:

dhist["time"] = dref.time

the error disappears.

I think we could control this earlier in the process since our methods expect equal time coordinates and give a clearer error message. Thanks for this feedback.

Let me know if there are other problems after that.

coxipi commented 2 months ago

For xclim devs: We could have a similar function to _harmonize_units which simply checks if times match? This would look more like _check_inputs, but it could not really fit in _check_inputs since the input list is fairly general, inputs could be [ref, hist] which would work for our needs of matching times, but it can also be [ref, hist, sim] or [sim].