Faster search - Githubissues

aulemahal commented 1 year ago

Pull Request Checklist:

[x] pre-commit hooks are installed/active in my local clone ($ pre-commit install)
[x] This PR addresses an already opened issue (for bug fixes / features)
- This PR fixes #
[x] (If applicable) Documentation has been added / updated (for bug fixes / features)
[ ] If a merge request has been made in parallel to this PR in xscen-notebooks, it is merged and the submodules have been updated.
[x] HISTORY.rst has been updated (with summary of main changes)
- [x] Link to issue (:issue:number) and pull request (:pull:number) has been added

The issue

search_data_catalogs and extract_dataset are very slow when the catalogs are very big. The base case for this PR was raised by @coxipi and replicated by me : search_data_catalog over the MRCC5, with a selection that returned 0 datasets, took 12 min. The same process, but coded through DataCatalog.search() took 2 min.

What kind of change does this PR introduce?

Faster search_data_catalogs and extract_dataset through:

Faster DataCatalog.unique: I copied the cat.unique() implementation from intake-esm but instead of computing ALL unique values and then extracting those from the column(s) I want, I only compute for those columns.
Remove logging call that used DataCatalog.nunique() : for the same reason as above, instead of logging :catalog.nunique()['id'], I switched to len(catalog.unique()['id']). That made an improvement of more than 5 min. LOL.
Faster date parsing : yet another implementation of our custom parse_dates to optimize the case when we have dates out of the datetime64[ns] bounds. I went from 50 s to 3 s for simulation.json.
Rewrite of the ensure_correct_time logic. This one is funny. Did you know that any(da > 1) is 10x slower than (da > 1).any() . I didn't. I also added a fast-track for cases where infer_freq works (most of them duh).

Does this PR introduce a breaking change?

I ran the getting_started notebook and got no error. CQFD,

Seriously, I don't think so. The error raised when we have invalid date strings in catalogs may have changed, but it is still explicit.

Other information:

The unique() improvement could be moved to intake_esm but I don't have the energy.

aulemahal commented 1 year ago

@RondeauG do you have an idea about the failures in my tests? It seems to be the "ensemble reduction" notebook that doesn't work as expected...

aulemahal commented 1 year ago

Woups my bad forget it. It's from a change I made in DataCatalog.unique()

RondeauG commented 1 year ago

Woups my bad forget it. It's from a change I made in DataCatalog.unique()

Yeah, I think xrfreqs=ds_dict.unique("xrfreq"), might be crashing because you changed the type of the output?

Edit: It's almost as if we really need to implement testing... 🙄

Ouranosinc / xscen

Faster search #127

Pull Request Checklist:

The issue

What kind of change does this PR introduce?

Does this PR introduce a breaking change?

Other information: