aeon-toolkit / aeon

A toolkit for machine learning from time series
https://aeon-toolkit.org/
BSD 3-Clause "New" or "Revised" License
993 stars 118 forks source link

[ENH] Improve the efficiency of the Transformer base class #100

Closed TonyBagnall closed 1 year ago

TonyBagnall commented 1 year ago

Describe the bug

I've had a look at the Transformation module, which tbh I have not looked at much since it was completely refactored. tl;dr: its is bloated, hard to understand and very inefficient. I think we should design around it then remove it. So this is my journey through. Lets suppose I build a ShapeletTransformClassifier which extends BaseClassifier. This consists of a pipeline of ShapeletTransform (extends BaseTransform) and RotationForest (a scikit like classifier in our package).

from sktime.classification.shapelet_based import ShapeletTransformClassifier
from sktime.datasets import load_unit_test

trainx, trainy=load_unit_test(return_type="numpy2d", split="train")
stc = ShapeletTransformClassifier()
stc.fit(trainx,trainy)

fit goes to the BaseClassifier fit. This checks the input and finds the meta data, which includes info on data characteristics (number of dimensions, has missing values, unequal length).

        X, y = self._internal_convert(X, y)
        X_metadata = self._check_classifier_input(X, y)
        missing = X_metadata["has_nans"]
        multivariate = not X_metadata["is_univariate"]
        unequal = not X_metadata["is_equal_length"]
        self._X_metadata = X_metadata
        # Convert data as dictated by the classifier tags
        X = self._convert_X(X)

        # Check this classifier can handle characteristics
        self._check_capabilities(missing, multivariate, unequal)

this necessarily requires looking at all data, so if the data has n instances, d dimensions and series length m, this is an O(ndm) operation. Fair enough, a price worth paying I think to check input. Data is also converted here if necessary. Then the private fit is called in ShapeletTransformClassifier. This basically does this (with some other bits), where transformer is ShapeletTransform and estimator is rotation forest.

        X_t = self._transformer.fit_transform(X, y).to_numpy()
        self._estimator.fit(X_t, y)

fit_transform goes to BaseTransformer fit_transform, which calls BaseTransformer fit() then transform(). It gets messy now. fit calls this

        # check and convert X/y
        X_inner, y_inner = self._check_X_y(X=X, y=y)

now _check_X_y is a 250 line function with this docstring

       Returns
        -------
        X_inner : Series, Panel, or Hierarchical object, or VectorizedDF
                compatible with self.get_tag("X_inner_mtype") format
            Case 1: self.get_tag("X_inner_mtype") supports scitype of X, then
                converted/coerced version of X, mtype determined by "X_inner_mtype" tag
            Case 2: self.get_tag("X_inner_mtype") supports *higher* scitype than X
                then X converted to "one-Series" or "one-Panel" sub-case of that scitype
                always pd-multiindex (Panel) or pd_multiindex_hier (Hierarchical)
            Case 3: self.get_tag("X_inner_mtype") supports only *simpler* scitype than X
                then VectorizedDF of X, iterated as the most complex supported scitype
        y_inner : Series, Panel, or Hierarchical object, or VectorizedDF
                compatible with self.get_tag("y_inner_mtype") format
            Case 1: self.get_tag("y_inner_mtype") supports scitype of y, then
                converted/coerced version of y, mtype determined by "y_inner_mtype" tag
            Case 2: self.get_tag("y_inner_mtype") supports *higher* scitype than y
                then X converted to "one-Series" or "one-Panel" sub-case of that scitype
                always pd-multiindex (Panel) or pd_multiindex_hier (Hierarchical)
            Case 3: self.get_tag("y_inner_mtype") supports only *simpler* scitype than y
                then VectorizedDF of X, iterated as the most complex supported scitype
            Case 4: None if y was None, or self.get_tag("y_inner_mtype") is "None"

            Complexity order above: Hierarchical > Panel > Series

        metadata : dict, returned only if return_metadata=True
            dictionary with str keys, contents as follows
            _converter_store_X : dict, metadata from X conversion, for back-conversion
            _X_mtype_last_seen : str, mtype of X seen last
            _X_input_scitype : str, scitype of X seen last
            _convert_case : str, coversion case (see above), one of
                "case 1: scitype supported"
                "case 2: higher scitype supported"
                "case 3: requires vectorization"

uuuhhhh, oookaaay ..... stepping away from the logic of all that .... two things it does is retrieve the meta data again, and it also clones the data (I think). So we then get out of the base fit, into ShapeletTransform fit, do the actual work, and then go to transform in the base class. This calls check_X_y again, thus doing all that convoluted checking, meta data and cloning of data. So we have passed through a mass of confusing code that uses non standard terminology,

      Returns
        -------
        transformed version of X
        type depends on type of X and scitype:transform-output tag:
            |          | `transform`  |                        |
            |   `X`    |  `-output`   |     type of return     |
            |----------|--------------|------------------------|
            | `Series` | `Primitives` | `pd.DataFrame` (1-row) |
            | `Panel`  | `Primitives` | `pd.DataFrame`         |
            | `Series` | `Series`     | `Series`               |
            | `Panel`  | `Series`     | `Panel`                |
            | `Series` | `Panel`      | `Panel`                |
        instances in return correspond to instances in `X`
        combinations not in the table are currently not supported

calls obscure, spaghetti code with unknown terms like scitype/mtype, cloned the data twice and performed a linear scan three times, before we actually build a model. And this even before we consider the mess that is the so called Vectorisation! Personally, I do not want to use this code.

    def _check_X_y(self, X=None, y=None, return_metadata=False):
        if X is None:
            raise TypeError("X cannot be None, but found None")

        metadata = dict()
        metadata["_converter_store_X"] = dict()

        def _most_complex_scitype(scitypes, smaller_equal_than=None):
            """Return most complex scitype in a list of str."""
            if "Hierarchical" in scitypes and smaller_equal_than == "Hierarchical":
                return "Hierarchical"
            elif "Panel" in scitypes and smaller_equal_than != "Series":
                return "Panel"
            elif "Series" in scitypes:
                return "Series"
            elif smaller_equal_than is not None:
                return _most_complex_scitype(scitypes)
            else:
                raise ValueError("no series scitypes supported, bug in estimator")

        def _scitype_A_higher_B(scitypeA, scitypeB):
            """Compare two scitypes regarding complexity."""
            if scitypeA == "Series":
                return False
            if scitypeA == "Panel" and scitypeB == "Series":
                return True
            if scitypeA == "Hierarchical" and scitypeB != "Hierarchical":
                return True
            return False

        # retrieve supported mtypes
        X_inner_mtype = _coerce_to_list(self.get_tag("X_inner_mtype"))
        y_inner_mtype = _coerce_to_list(self.get_tag("y_inner_mtype"))
        X_inner_scitype = mtype_to_scitype(X_inner_mtype, return_unique=True)
        y_inner_scitype = mtype_to_scitype(y_inner_mtype, return_unique=True)

        ALLOWED_SCITYPES = ["Series", "Panel", "Hierarchical"]
        ALLOWED_MTYPES = self.ALLOWED_INPUT_MTYPES

        # checking X
        X_valid, msg, X_metadata = check_is_scitype(
            X,
            scitype=ALLOWED_SCITYPES,
            return_metadata=True,
            var_name="X",
        )
       X_scitype = X_metadata["scitype"]
        X_mtype = X_metadata["mtype"]
        # remember these for potential back-conversion (in transform etc)
        metadata["_X_mtype_last_seen"] = X_mtype
        metadata["_X_input_scitype"] = X_scitype

        if X_mtype not in ALLOWED_MTYPES:
            raise TypeError("X " + msg_invalid_input)

        if X_scitype in X_inner_scitype:
            case = "case 1: scitype supported"
            req_vec_because_rows = False
        elif any(_scitype_A_higher_B(x, X_scitype) for x in X_inner_scitype):
            case = "case 2: higher scitype supported"
            req_vec_because_rows = False
        else:
            case = "case 3: requires vectorization"
            req_vec_because_rows = True
        metadata["_convert_case"] = case

        # checking X vs tags
        inner_univariate = self.get_tag("univariate-only")
        # we remember whether we need to vectorize over columns, and at all
        req_vec_because_cols = inner_univariate and not X_metadata["is_univariate"]
        requires_vectorization = req_vec_because_rows or req_vec_because_cols
        # end checking X

        if y_inner_mtype != ["None"] and y is not None:

            if "Table" in y_inner_scitype:
                y_possible_scitypes = "Table"
            elif X_scitype == "Series":
                y_possible_scitypes = "Series"
            elif X_scitype == "Panel":
                y_possible_scitypes = "Panel"
            elif X_scitype == "Hierarchical":
                y_possible_scitypes = ["Panel", "Hierarchical"]

            y_valid, _, y_metadata = check_is_scitype(
                y, scitype=y_possible_scitypes, return_metadata=True, var_name="y"
            )
            if not y_valid:
                raise TypeError("y " + msg_invalid_input)

            y_scitype = y_metadata["scitype"]

        else:
            # y_scitype is used below - set to None if y is None
            y_scitype = None
        # end checking y

        # no compabitility checks between X and y
        # end compatibility checking X and y

        # convert X & y to supported inner type, if necessary
        #####################################################

        # convert X and y to a supported internal mtype
        #  it X/y mtype is already supported, no conversion takes place
        #  if X/y is None, then no conversion takes place (returns None)
        #  if vectorization is required, we wrap in VectorizedDF

        # case 2. internal only has higher scitype, e.g., inner is Panel and X Series
        #       or inner is Hierarchical and X is Panel or Series
        #   then, consider X as one-instance Panel or Hierarchical
        if case == "case 2: higher scitype supported":
            if X_scitype == "Series" and "Panel" in X_inner_scitype:
                as_scitype = "Panel"
            else:
                as_scitype = "Hierarchical"
            X = convert_to_scitype(X, to_scitype=as_scitype, from_scitype=X_scitype)
            X_scitype = as_scitype
            # then pass to case 1, which we've reduced to, X now has inner scitype

        # case 1. scitype of X is supported internally
        # case in ["case 1: scitype supported", "case 2: higher scitype supported"]
        #   and does not require vectorization because of cols (multivariate)
        if not requires_vectorization:
            # converts X
            X_inner = convert_to(
                X,
                to_type=X_inner_mtype,
                store=metadata["_converter_store_X"],
                store_behaviour="reset",
            )

            # converts y, returns None if y is None
            if y_inner_mtype != ["None"] and y is not None:
                y_inner = convert_to(
                    y,
                    to_type=y_inner_mtype,
                    as_scitype=y_scitype,
                )
            else:
                y_inner = None

        # case 3. scitype of X is not supported, only lower complexity one is
        #   then apply vectorization, loop method execution over series/panels
        # elif case == "case 3: requires vectorization":
        else:  # if requires_vectorization
            iterate_X = _most_complex_scitype(X_inner_scitype, X_scitype)
            X_inner = VectorizedDF(
                X=X,
                iterate_as=iterate_X,
                is_scitype=X_scitype,
                iterate_cols=req_vec_because_cols,
            )
            # we also assume that y must be vectorized in this case
            if y_inner_mtype != ["None"] and y is not None:
                # raise ValueError(
                #     f"{type(self).__name__} does not support Panel X if y is not "
                #     f"None, since {type(self).__name__} supports only Series. "
                #     "Auto-vectorization to extend Series X to Panel X can only be "
                #     'carried out if y is None, or "y_inner_mtype" tag is "None". '
                #     "Consider extending _fit and _transform to handle the following "
                #     "input types natively: Panel X and non-None y."
                # )
                iterate_y = _most_complex_scitype(y_inner_scitype, y_scitype)
                y_inner = VectorizedDF(X=y, iterate_as=iterate_y, is_scitype=y_scitype)
            else:
                y_inner = None

        if return_metadata:
            return X_inner, y_inner, metadata
        else:
            return X_inner, y_inner
lmmentel commented 1 year ago

Thanks for writing it up with a specific case, it helps to highlight the issue. I'm with you on this being a pot of spaghetti without a clear reason for existing. I see two separate functionalities that I we could use to split it up.

  1. data transformation - in the sense of sklearn transformers, applying transformations to the data itself e.g. z-score normalization
  2. format conversion - this is all the complex scitype/mtype and the rest of made up terms that do all the introspection and format conversions

I would be in favor of drastically reducing number of supported formats and delegating conversions elsewhere. With fewer formats we could have predictable input/output types and most probably improve performance by skipping all the introspection/conversion.