Ouranosinc / xscen

A climate change scenario-building analysis framework.
https://xscen.readthedocs.io/
Apache License 2.0
15 stars 2 forks source link

Faster search #127

Closed aulemahal closed 1 year ago

aulemahal commented 1 year ago

Pull Request Checklist:

The issue

search_data_catalogs and extract_dataset are very slow when the catalogs are very big. The base case for this PR was raised by @coxipi and replicated by me : search_data_catalog over the MRCC5, with a selection that returned 0 datasets, took 12 min. The same process, but coded through DataCatalog.search() took 2 min.

What kind of change does this PR introduce?

Faster search_data_catalogs and extract_dataset through:

Does this PR introduce a breaking change?

I ran the getting_started notebook and got no error. CQFD,

Seriously, I don't think so. The error raised when we have invalid date strings in catalogs may have changed, but it is still explicit.

Other information:

The unique() improvement could be moved to intake_esm but I don't have the energy.

aulemahal commented 1 year ago

@RondeauG do you have an idea about the failures in my tests? It seems to be the "ensemble reduction" notebook that doesn't work as expected...

aulemahal commented 1 year ago

Woups my bad forget it. It's from a change I made in DataCatalog.unique()

RondeauG commented 1 year ago

Woups my bad forget it. It's from a change I made in DataCatalog.unique()

Yeah, I think xrfreqs=ds_dict.unique("xrfreq"), might be crashing because you changed the type of the output?

Edit: It's almost as if we really need to implement testing... 🙄