Ouranosinc / xscen

A climate change scenario-building analysis framework.
https://xscen.readthedocs.io/
Apache License 2.0
15 stars 2 forks source link

Attributes incorrectly labeled with extract_dataset within a Dask client in a Jupyter Notebook #176

Open mccrayc opened 1 year ago

mccrayc commented 1 year ago

Setup Information

Description

A pretty specific issue here: when extracting data with xs.extract_dataset, the attributes within the resulting dataset are labeled "intake_esm_attrs:..." rather than "cat:..." when extraction is done within a Dask client, and within JupyterLab. Doing the same thing but running from within a script produces the correct attribute names. Running outside of a Dask client works as normal, both in a script and in a Notebook.

Steps To Reproduce

Simple example. Attributes are correct for this version (e.g.: 'cat:id':'CMIP6_ScenarioMIP_CCCma_CanESM5_ssp370_r1i1p1f1_global'):

import xscen as xs
cat='/tank/scenario/catalogues/simulation.json'  
variables_and_freqs={'tas':'MS'}
other_search_criteria= {"source": 'CanESM5*', "experiment":'ssp370', "member":'r1i1p1f1'}
xr_combine_kwargs ={'coords': 'minimal', 
                   'data_vars': 'minimal', 
                   'compat': 'override'}

periods = [2028,2029]

cat_ref = xs.search_data_catalogs(data_catalogs=[cat],
                                      variables_and_freqs=variables_and_freqs,
                                      other_search_criteria= other_search_criteria,
                                      periods = periods,
                                      allow_resampling=True, 
                                      allow_conversion = True)

ds_dict = xs.extract_dataset(cat_ref['CMIP6_ScenarioMIP_CCCma_CanESM5_ssp370_r1i1p1f1_global'],
                            variables_and_freqs = variables_and_freqs,
                        xr_combine_kwargs  = xr_combine_kwargs )  `

However, if xs.extract_dataset is wrapped with a Dask client, attributes keys are incorrect (e.g., 'intake_esm_attrs:id': 'CMIP6_ScenarioMIP_CCCma_CanESM5_ssp370_r1i1p1f1_global'):

from dask.distributed import Client
with Client(n_workers=1, threads_per_worker=4,
               memory_limit='12GB') as client:
    ds_dict = xs.extract_dataset(cat_ref['CMIP6_ScenarioMIP_CCCma_CanESM5_ssp370_r1i1p1f1_global'],
                            variables_and_freqs = variables_and_freqs,
                        xr_combine_kwargs  = xr_combine_kwargs ) 
    print(ds_dict['MS'].attrs)

Additional context

Workaround suggested by @aulemahal :

Before xs.extract_dataset, after the "with Client(...) as client:" insert client.run(lambda: xs.__version__).

Contribution

aulemahal commented 1 year ago

https://stackoverflow.com/questions/75837897/dask-worker-has-different-imports-than-main-thread

aulemahal commented 1 year ago

Seems related to how the client workers are created and then on how the dask functions are pickled and sent to them.

In a script, this happens if the xscen import is done after if __name__ == '__main__'. We don't do that usually of course. However, in a notebook, all code is executed after the equivalent.

aulemahal commented 1 year ago

From the answer I got on StackOverflow, I don't see any easy xscen-level solution. The issue is that intake-esm makes use of a global state variable (the options) within a dask Delayed function, although « Dask is generally aiming to be functional/stateless, so that each function call produces results based only on the arguments it is supplied ».

Passing the options as an argument to the Delayed would fix that. On our side, the Client.run hack seems to be good enough.