dmbee / seglearn

Python module for machine learning time series:
https://dmbee.github.io/seglearn/
BSD 3-Clause "New" or "Revised" License
571 stars 63 forks source link

How are multivariate time series handled? #43

Closed emial637 closed 4 years ago

emial637 commented 4 years ago

Hi! I'm classifying multivariate time series with seglearn and it works great! Right now I'm trying to learn more about this topic, and I would like to know how the multivariate aspect is handled in seglearn. In the docs, the sliding window method is mentioned, but I'm not able to find any more information on my own. I'd be very thankful if someone could help me out :)

If it is easier to discuss a specific case, this is what I'm doing at the moment:

Data: n_samples = 6000 n_dimensions = 128 Various time lengths for each sample (300 -700)

Labels: 0 / 1

Classifier:

Pype([('segment', PadTrunc(width=700)),
      ('features', FeatureRep()),
      ('scaler', StandardScaler()),
      ('svc', svm.LinearSVC(class_weight='balanced', max_iter=2000))])

To my understanding, I get 11 features per dimension with FeatureRep(), which would result in a 11x128 matrix per sample, which wont match the standard svm input. In other words, how are the 128D-timeseries condensed into something that can be fed into the svm?

emial637 commented 4 years ago

I did some debugging and it seems like it simply is the flattened version of the matrix that is passed to the SVM. I.e. if I use n number of signal features for m dimensional data I'll get a vector representation of length n*m. Please correct me if I'm missing something.

dmbee commented 4 years ago

You are correct. The purpose of the FeatureRep transform is to vectorize the data. The code for this is in FeatureRep.transform

ts = np.column_stack([self.features[f](Xt) for f in self.features])
if Xc is not None:
  fts = np.column_stack([fts, Xc])

The Segment transforms get the data as a 3D tensor [samples x window time x channels]. Then FeatureRep vectorizes it to [samples x features] which is suitable for estimators like SCV.

One suggestion: You might have better performance if you use feature selection in your pipeline. Between scaler and svc. Because you have so many channels / features.

sklearn has feature selection transformers you can put right into the seglearn pipeline: https://scikit-learn.org/stable/modules/feature_selection.html

Good luck David