What the segment class is used for

dmbee / seglearn

Python module for machine learning time series:

https://dmbee.github.io/seglearn/

BSD 3-Clause "New" or "Revised" License

570 stars 63 forks source link

What the segment class is used for #52

Closed orenpapers closed 3 years ago

orenpapers commented 3 years ago

Hello, I can't understand the usage of the segment class, in what cases I need to use this transform and how does it help? I also couldn't find an example as how to incorporate contextual variables? When I run it on toy data - it is very unclear what happened, since X is unchanged by y was reduced to a single value:

# Single multivariate time series with 3 samples of 4 variables
X = [np.array([[0, 1, 2, 3], [4, 5, 6, 7], [8, 9, 10, 11]])]
# Time series target
y = [np.array([True, False, False])]
print("X: " , X)
print("y: " ,y)
segment = Segment(width=3, overlap=1)
X, y, _ = segment.fit_transform(X, y)
print('After segmentation:')
print("X:", X)
print("y: ", y)

X : [array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])]
y : [array([ True, False, False])]
After segmentation:
X: [[[ 0  1  2  3]
  [ 4  5  6  7]
  [ 8  9 10 11]]]
y:  [False]

dmbee commented 3 years ago

There are multiple examples with context data. For instance.

https://dmbee.github.io/seglearn/auto_examples/plot_feature_rep.html#sphx-glr-auto-examples-plot-feature-rep-py

dmbee commented 3 years ago

I suggest you look at some of the other examples and the API documentation, and play around with your toy data to get a sense of how its working. For instance if you segment a time series with 10 samples, using a width of 3 and no overlap, you will get 3 separate segments and 3 target values (assuming y is also a time series). Many ML algorithms require fixed length time series for classification/regression, and with this package you can use sliding window segmentation or padding/truncation to achieve that in your time series pipeline.

orenpapers commented 3 years ago

@dmbee Thanks, I tried to mock many types of data but I can't understand the relation between the output and the documentation. Can you please explain the outputs of the examples?

[array([[ 0,  1,  2,  3,  4,  6],
       [ 4,  5,  6,  7,  8,  9],
       [ 8,  9, 10, 11,  3,  4]])]
[array([ True, False, False])]
After segmentation (width=5, overlap=0.33):
X: []
y:  []

Another example:

[array([[ 0,  1,  2,  3,  4,  6],
       [ 4,  5,  6,  7,  8,  9],
       [ 8,  9, 10, 11,  3,  4]])]
[array([ True, False, False])]
After segmentation (width=3, overlap=0.2):
X: [[[ 0  1  2  3  4  6]
  [ 4  5  6  7  8  9]
  [ 8  9 10 11  3  4]]]
y:  [False]

Why width is 6 ant not 3? why only 1 y value?

orenpapers commented 3 years ago

There are multiple examples with context data. For instance.

https://dmbee.github.io/seglearn/auto_examples/plot_feature_rep.html#sphx-glr-auto-examples-plot-feature-rep-py

@dmbee I saw this example but I am not sure why is this considered as contextual features and not just stacking/adding more features because: u stack the contextual features next to the data features, which means the features vector is flattened. So if u have 15 data features and 5 context features, it will be turned to 20 flat features that will be treated equally, rather than 5 features that are added as context on top of the 15 data features. (Similarly to conditional RNN : https://github.com/philipperemy/cond_rnn). Do you have a way to add the contextual features as condition/context to the data features, rather than just stack them as additional features?

dmbee commented 3 years ago

The context features are broadcast to every segment in the series and separate, they are not flattened together.

orenpapers commented 3 years ago

@dmbee I know, I just mean that for each sample , the contextual features are flattened together with the data features. So, to my understanding from the code (correct me if I'm wrong), if I have 15 data features and 5 context features , I will have now 20 features per vector, but the model won't model them differently - meaning he will treat this as one vector of 20 features and not 15 data features + 5 context features (as done for example in Conditional). right?

dmbee commented 3 years ago

I don't understand your question. I suggest you look at the documentation and the code, it's an open source project.