Closed aulemahal closed 1 year ago
@RondeauG do you have an idea about the failures in my tests? It seems to be the "ensemble reduction" notebook that doesn't work as expected...
Woups my bad forget it. It's from a change I made in DataCatalog.unique()
Woups my bad forget it. It's from a change I made in
DataCatalog.unique()
Yeah, I think xrfreqs=ds_dict.unique("xrfreq"),
might be crashing because you changed the type of the output?
Edit: It's almost as if we really need to implement testing... 🙄
Pull Request Checklist:
pre-commit
hooks are installed/active in my local clone ($ pre-commit install
)number
) and pull request (:pull:number
) has been addedThe issue
search_data_catalogs
andextract_dataset
are very slow when the catalogs are very big. The base case for this PR was raised by @coxipi and replicated by me :search_data_catalog
over the MRCC5, with a selection that returned 0 datasets, took 12 min. The same process, but coded throughDataCatalog.search()
took 2 min.What kind of change does this PR introduce?
Faster
search_data_catalogs
andextract_dataset
through:DataCatalog.unique
: I copied thecat.unique()
implementation fromintake-esm
but instead of computing ALL unique values and then extracting those from the column(s) I want, I only compute for those columns.DataCatalog.nunique()
: for the same reason as above, instead of logging :catalog.nunique()['id']
, I switched tolen(catalog.unique()['id'])
. That made an improvement of more than 5 min. LOL.parse_dates
to optimize the case when we have dates out of thedatetime64[ns]
bounds. I went from 50 s to 3 s forsimulation.json
.ensure_correct_time
logic. This one is funny. Did you know thatany(da > 1)
is 10x slower than(da > 1).any()
. I didn't. I also added a fast-track for cases whereinfer_freq
works (most of them duh).Does this PR introduce a breaking change?
I ran the
getting_started
notebook and got no error. CQFD,Seriously, I don't think so. The error raised when we have invalid date strings in catalogs may have changed, but it is still explicit.
Other information:
The
unique()
improvement could be moved tointake_esm
but I don't have the energy.