Closed TonyBagnall closed 1 year ago
Thanks for writing it up with a specific case, it helps to highlight the issue. I'm with you on this being a pot of spaghetti without a clear reason for existing. I see two separate functionalities that I we could use to split it up.
I would be in favor of drastically reducing number of supported formats and delegating conversions elsewhere. With fewer formats we could have predictable input/output types and most probably improve performance by skipping all the introspection/conversion.
Describe the bug
I've had a look at the Transformation module, which tbh I have not looked at much since it was completely refactored. tl;dr: its is bloated, hard to understand and very inefficient. I think we should design around it then remove it. So this is my journey through. Lets suppose I build a ShapeletTransformClassifier which extends BaseClassifier. This consists of a pipeline of ShapeletTransform (extends BaseTransform) and RotationForest (a scikit like classifier in our package).
fit goes to the BaseClassifier fit. This checks the input and finds the meta data, which includes info on data characteristics (number of dimensions, has missing values, unequal length).
this necessarily requires looking at all data, so if the data has n instances, d dimensions and series length m, this is an O(ndm) operation. Fair enough, a price worth paying I think to check input. Data is also converted here if necessary. Then the private fit is called in ShapeletTransformClassifier. This basically does this (with some other bits), where transformer is ShapeletTransform and estimator is rotation forest.
fit_transform goes to BaseTransformer fit_transform, which calls BaseTransformer fit() then transform(). It gets messy now. fit calls this
now _check_X_y is a 250 line function with this docstring
uuuhhhh, oookaaay ..... stepping away from the logic of all that .... two things it does is retrieve the meta data again, and it also clones the data (I think). So we then get out of the base fit, into ShapeletTransform fit, do the actual work, and then go to transform in the base class. This calls check_X_y again, thus doing all that convoluted checking, meta data and cloning of data. So we have passed through a mass of confusing code that uses non standard terminology,
calls obscure, spaghetti code with unknown terms like scitype/mtype, cloned the data twice and performed a linear scan three times, before we actually build a model. And this even before we consider the mess that is the so called Vectorisation! Personally, I do not want to use this code.