angus924 / minirocket

MINIROCKET: A Very Fast (Almost) Deterministic Transform for Time Series Classification
GNU General Public License v3.0
286 stars 32 forks source link

Feature Transformation #24

Open rubbiyasultan opened 1 year ago

rubbiyasultan commented 1 year ago

Hello,

I am trying to run MiniRocket on my dataset, which is basically a SCADA dataset containing data from multiple sensors over period of time. Its a multivariate time series therefore I am using multivariate version of MiniRocket from sklearn. However, the features are not being transformed the way they are supposed to be.

Initially, I ran the following chunk of code on my personal SCADA dataset:

minirocket_multi = MiniRocketMultivariate() X_train_transform = minirocket_multi.fit_transform(X_train) X_test_transform = minirocket_multi.transform(X_test)

This is the output that I am getting,

----------------------Before Transformation------------------------------ X_train: (34992, 25) X_test: (17472, 25) ----------------------After Transformation------------------------------ X_train: (1, 9996) X_test: (1, 9996)

However, I think after transformation the shape X_train and X_test should be (34992, 9996) and (17472, 9996). Could you please help me in this regard? Why is just transforming one single sample, not the rest?

Also, I would like to mention that I have loaded data as using pickle file, containing data in form of pandas dataframe.

with open(train_file, "rb") as f: data_train=pickle.load(f) X_train_wt = data_train.iloc[:, :-1] y_train_wt = data_train.iloc[:, -1] # Last column

Sandy4321 commented 1 year ago

good question

angus924 commented 1 year ago

From what you have said, my understanding is that you have 34,992 time series in your training set, each of length 25 (and, likewise, 17,472 time series in your test set, each of length 25). If so, as you say, you should expect an output shape of [34,992, 9,996] (and [17,472, 9,996]). This suggests that the dataset is univariate, as otherwise the input shape would presumably be [34,992, c, 25] (e.g., for c channels), etc.

If this is correct, you should be using the univariate version of MiniRocket.

However, you also say:

Its a multivariate time series

If this is the case, I would interpret your input dimensions as representing a single time series of length 34,992 with 25 channels (in which case your input should be shaped [1, 25, 34,992], etc).

Basically, we need to clarify the exact format and shape of your data.

Does this help at all?

rubbiyasultan commented 1 year ago

Thank you for your answer. However, I don't understand the input shape part. My timeseries data have 34992 rows/samples and 25 columns/features. I am also trying to run this Multivariate MiniRocket on benchmark dataset PenDigits, but still I am getting lot of errors. I am sharing my code with you. Maybe you could help me out?


from sktime.datasets import load_from_tsfile_to_dataframe

# Specify the path to the .ts file
file_path = "data_benchmark/PenDigits/PenDigits_TRAIN.ts"

# Load the data from the .ts file into a pandas DataFrame
X_train, y_train = load_from_tsfile_to_dataframe(file_path)

# Print the data and target shapes
print("X_train:", X_train.shape)
print("y_train:", y_train.shape)

# Specify the path to the .ts file
file_path = "data_benchmark/PenDigits/PenDigits_TEST.ts"

# Load the data from the .ts file into a pandas DataFrame
X_test, y_test = load_from_tsfile_to_dataframe(file_path)

minirocket_multi = MiniRocketMultivariate()
X_train_transform = minirocket_multi.fit_transform(X_train)
# X_test_transform = minirocket_multi.transform(X_test)

This implementation is similar to what you have provided in the documentation https://github.com/sktime/sktime/blob/main/examples/minirocket.ipynb.

However, I am still getting errors.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[331], line 3
      1 # MiniRocket transformation
      2 minirocket_multi = MiniRocketMultivariate()
----> 3 X_train_transform = minirocket_multi.fit_transform(X_train)

File ~/.local/lib/python3.10/site-packages/sktime/transformations/base.py:620, in BaseTransformer.fit_transform(self, X, y)
    555 """Fit to data, then transform it.
    556 
    557 Fits the transformer to X and y and returns a transformed version of X.
   (...)
    616         Example: i-th instance of the output is the i-th window running over `X`
    617 """
    618 # Non-optimized default implementation; override when a better
    619 # method is possible for a given algorithm.
--> 620 return self.fit(X, y).transform(X, y)

File ~/.local/lib/python3.10/site-packages/sktime/transformations/base.py:439, in BaseTransformer.fit(self, X, y)
    437 # we call the ordinary _fit if no looping/vectorization needed
    438 if not vectorization_needed:
--> 439     self._fit(X=X_inner, y=y_inner)
    440 else:
    441     # otherwise we call the vectorized version of fit
    442     self._vectorize("fit", X=X_inner, y=y_inner)

File ~/.local/lib/python3.10/site-packages/sktime/transformations/panel/rocket/_minirocket_multivariate.py:117, in MiniRocketMultivariate._fit(self, X, y)
    115 *_, n_timepoints = X.shape
    116 if n_timepoints < 9:
--> 117     raise ValueError(
    118         (
    119             f"n_timepoints must be >= 9, but found {n_timepoints};"
    120             " zero pad shorter series so that n_timepoints == 9"
    121         )
    122     )
    123 self.parameters = _fit_multi(
    124     X, self.num_kernels, self.max_dilations_per_kernel, self.random_state_
    125 )
    126 return self

ValueError: n_timepoints must be >= 9, but found 8; zero pad shorter series so that n_timepoints == 9

Also, I would highly appreciate if you could provide detailed methodology/documentation regarding minirocket for multivariate timeseries.

Link to PenDigits dataset: http://www.timeseriesclassification.com/description.php?Dataset=PenDigits

Sandy4321 commented 1 year ago

it is how data looks like in original code image

Sandy4321 commented 1 year ago

maybe '''

Load the data from the .ts file into a pandas DataFrame

X_train, y_train = load_from_tsfile_to_dataframe(file_path) ''' provides different format ?

Sandy4321 commented 1 year ago

may you share similar data to your scada data (especially the same format) for example https://data.world/datasets/scada

Sandy4321 commented 1 year ago

ok I fixed issue ValueError: n_timepoints must be >= 9, but found 8; zero pad shorter series so that n_timepoints == 9 you need to padd data to have more than 8 samples in one time series image

it is padded data image

Sandy4321 commented 1 year ago

by the way , any chances to use data with categorical values for example

green, red, black, brown

rubbiyasultan commented 1 year ago

ok I fixed issue ValueError: n_timepoints must be >= 9, but found 8; zero pad shorter series so that n_timepoints == 9 you need to padd data to have more than 8 samples in one time series image

it is padded data image

Thank you! But did you pad it manually?

rubbiyasultan commented 1 year ago

by the way , any chances to use data with categorical values for example

green, red, black, brown

Yes you can use encode command from sklearn to transform the categorical values.

Sandy4321 commented 1 year ago

Thank you! But did you pad it manually? yes only 3 lines Yes you can use encode command from sklearn tpu transform the categorical values cool thanks may you share code example?

Sandy4321 commented 1 year ago

at least , what is it tpu? but if you have code example with data set to try for multivariate time series with mixture of continues and categorical values pls share

rubbiyasultan commented 1 year ago

Thank you for your answer. However, I don't understand the input shape part. My timeseries data have 34992 rows/samples and 25 columns/features. I am also trying to run this Multivariate MiniRocket on benchmark dataset PenDigits, but still I am getting lot of errors. I am sharing my code with you. Maybe you could help me out?


from sktime.datasets import load_from_tsfile_to_dataframe

# Specify the path to the .ts file
file_path = "data_benchmark/PenDigits/PenDigits_TRAIN.ts"

# Load the data from the .ts file into a pandas DataFrame
X_train, y_train = load_from_tsfile_to_dataframe(file_path)

# Print the data and target shapes
print("X_train:", X_train.shape)
print("y_train:", y_train.shape)

# Specify the path to the .ts file
file_path = "data_benchmark/PenDigits/PenDigits_TEST.ts"

# Load the data from the .ts file into a pandas DataFrame
X_test, y_test = load_from_tsfile_to_dataframe(file_path)

minirocket_multi = MiniRocketMultivariate()
X_train_transform = minirocket_multi.fit_transform(X_train)
# X_test_transform = minirocket_multi.transform(X_test)

This implementation is similar to what you have provided in the documentation https://github.com/sktime/sktime/blob/main/examples/minirocket.ipynb. However, I am still getting errors.

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[331], line 3
      1 # MiniRocket transformation
      2 minirocket_multi = MiniRocketMultivariate()
----> 3 X_train_transform = minirocket_multi.fit_transform(X_train)

File ~/.local/lib/python3.10/site-packages/sktime/transformations/base.py:620, in BaseTransformer.fit_transform(self, X, y)
    555 """Fit to data, then transform it.
    556 
    557 Fits the transformer to X and y and returns a transformed version of X.
   (...)
    616         Example: i-th instance of the output is the i-th window running over `X`
    617 """
    618 # Non-optimized default implementation; override when a better
    619 # method is possible for a given algorithm.
--> 620 return self.fit(X, y).transform(X, y)

File ~/.local/lib/python3.10/site-packages/sktime/transformations/base.py:439, in BaseTransformer.fit(self, X, y)
    437 # we call the ordinary _fit if no looping/vectorization needed
    438 if not vectorization_needed:
--> 439     self._fit(X=X_inner, y=y_inner)
    440 else:
    441     # otherwise we call the vectorized version of fit
    442     self._vectorize("fit", X=X_inner, y=y_inner)

File ~/.local/lib/python3.10/site-packages/sktime/transformations/panel/rocket/_minirocket_multivariate.py:117, in MiniRocketMultivariate._fit(self, X, y)
    115 *_, n_timepoints = X.shape
    116 if n_timepoints < 9:
--> 117     raise ValueError(
    118         (
    119             f"n_timepoints must be >= 9, but found {n_timepoints};"
    120             " zero pad shorter series so that n_timepoints == 9"
    121         )
    122     )
    123 self.parameters = _fit_multi(
    124     X, self.num_kernels, self.max_dilations_per_kernel, self.random_state_
    125 )
    126 return self

ValueError: n_timepoints must be >= 9, but found 8; zero pad shorter series so that n_timepoints == 9

Also, I would highly appreciate if you could provide detailed methodology/documentation regarding minirocket for multivariate timeseries. Link to PenDigits dataset: http://www.timeseriesclassification.com/description.php?Dataset=PenDigits

@angus924 could you please look into this? Also, I tried running MiniRocket SCADA data it giving me bad accuracy on test data (around 55%), I am planning to change the classifier into non-linear one, maybe LSTM. Do you think it would be right approach? To apply feature transformation using MiniRocket and run LSTM on it?

Also, I need to understand the feature transformation in multivariate timeseries. I am running BasicMotion dataset, and this is what I get:

# Load the data
X_train, y_train = load_basic_motions(split="train", return_X_y=True)
X_test, y_test = load_basic_motions(split="test", return_X_y=True)
print("-------------before transformation--------")
print(X_train.shape)
print(X_test.shape)

# MiniRocket transformation
minirocket_multi = MiniRocketMultivariate()
X_train_transform = minirocket_multi.fit_transform(X_train)
X_test_transform = minirocket_multi.transform(X_test)

print("-------------before transformation--------")
print(X_train_transform.shape)
print(X_test_transform.shape)
Output:
-------------before transformation--------
(40, 6)
(40, 6)
-------------before transformation--------
(40, 9996)
(40, 9996)

The BasicMotion dataset has 40 rows(samples) and 6 columns(features), and it is transformed into (40,9996), the kernels are to be applied on each feature, right?