[ENH] Investigate/improve performance of data checks and pd.MultiIndex operations/iterations

aeon-toolkit / aeon

A toolkit for machine learning from time series

https://aeon-toolkit.org/

BSD 3-Clause "New" or "Revised" License

1.01k stars 118 forks source link

[ENH] Investigate/improve performance of data checks and pd.MultiIndex operations/iterations #37

Closed aiwalter closed 1 month ago

aiwalter commented 1 year ago

Is your feature request related to a problem? Please describe. There seems to be quite a problem related to performance of some data checks and pd.MultiIndex operations.

Describe the solution you'd like Related: https://github.com/sktime/sktime/issues/4139

TonyBagnall commented 1 year ago

does anyone use pd.MultiIndex with forecasting? Its not relevant for classification/regression/clustering

aiwalter commented 1 year ago

MultiIndex is all over the place for using forecasting/transformations with panel data. And such panel data is very common in industry, we have it also a lot at my day job and there I see severe performance issues with it.

aiwalter commented 1 year ago

@ltsaprounis was sharing this link in Slack: https://scikit-learn.org/stable/developers/performance.html#profiling-python-code

it could be a good start point.

aiwalter commented 1 year ago

possibly we have to implement some config to store info that some checks have already been done and dont need to be repeated X times again. We could have a look at newly introduced config from scikit-learn: https://scikit-learn.org/dev/modules/generated/sklearn.set_config.html#sklearn.set_config

aiwalter commented 1 year ago

todo: additionally improve check exception message. Currently the checks are called in base classes and therefore its not possible to see for user directly in which estimator the check was failing. This could also be improved probably by handing over the parent class name(s) or tracking that automatically with some inspect magic?