[ENH] Transformer interaction with 2D arrays/datatypes

MatthewMiddlehurst commented 1 year ago

The transformation's module does not currently interact with 2D arrays in an intuitive way for the classification/regression/clustering task. As an example, a 2D numpy array is currently treated as a multivariate single series (n_timepoints, n_series), but someone coming from sklearn who is familiar with their framework will assume it is multiple univariate series (n_cases, n_timepoints).

If this mistake is made, there is a chance that there will be no indication of any problem, as the base class will convert it to a usable format (regardless of intention). For example, this can result in multiple TSFreshRelevantFeatureExtractor objects being fitted on many single series, which makes no sense at all. Even in cases where the output is not effected, i.e. ROCKET, it still makes the transformation grossly inefficient.

In my opinion, the growth and usability of the module is currently constrained by trying to force 2 distinct learning tasks into a single framework. It is not sensible to have the class infer what task the input is when the tasks share valid input datatypes but use them in different ways.

This still needs further discussion on actions to take (if any). In the last developer meeting, there was generally agreement that the current implicit conversion of 2D data is not the design we want. A few options:

Leave it as it is, with 2D input being invalid for classification/regression/clustering.
- IMO If we want to place ourselves as sklearn compatible and user-friendly to its users, 2D input for tasks the packages share is an important feature and we should avoid this option.
Require a flag from the user telling transformers how to process 2D data when it is input
- This would only impact 2D import/datatypes all learning tasks share. I'm not sure exactly how this flag would be given, or how usable it would be for pipeline etc.
Split ML and forecasting tasks by datatype, i.e. ML uses numpy and forecasting Pandas fully
- Forecasting seems to use Pandas exclusively while ML tasks seem to use numpy, this may not be the case for users, however. Making this split would lose a lot of valid input types for both tasks and will still be rather confusing to new users IMO.
Split ML and forecasting tasks by transformer, i.e. have a separate package for both tasks and require explicit conversion between both
- This would be similar to how transformers used to be with panel and series transformers (names can be changed) each with their own acceptable input types and task specific actions. While there would be extra effort required to use these transformers for the opposite task, it should be possible to implement converters between them so that they are still usable.

TonyBagnall commented 1 year ago

Good summary, I will form lists of affected transformers. We could take the opportunity to change the name panel too ...

hadifawaz1999 commented 1 year ago

There is the constraint when using the SAX algorithm as well. If the user chooses to use the algorithm with a 2D numpy array, he should add an axis himself in the between to make the shape (n_cases, n_channels, time_series_length). For deep learners i remember there was an automated way to do that internally. Plus the usage pandas series can be removed, i think most people will use numpy here. Especially that there is no option to return a numpy array.

TonyBagnall commented 11 months ago

this has been resolved by #709 any residual issues related to this should be raised as specific separate issues I think

aeon-toolkit / aeon

[ENH] Transformer interaction with 2D arrays/datatypes #149