blue-yonder / tsfresh

Automatic extraction of relevant features from time series:
http://tsfresh.readthedocs.io
MIT License
8.46k stars 1.21k forks source link

Shapelet extraction #111

Closed GillesVandewiele closed 5 years ago

GillesVandewiele commented 7 years ago

One interesting feature with an explanatory ability is shapelet extraction.

Would maybe be interesting to implement within this package? A far from optimal code example by me can be found here

MaxBenChrist commented 7 years ago

Looks interesting. Have you seen the part of our documentation that describes how to implement your own feature calculator?

MaxBenChrist commented 7 years ago

I don't have time to read the paper at the moment. But, from a short glimpse on the paper I am not sure if this is a good fit for tsfresh. Essentially the paper describes a classifier not a feature extraction algorithm. It uses subsequences to do a nearest neighbor search, a k nearest neighbor search. Am I right?

If we deploy this, suddenly tsfresh gets a state. This could be fine for the sklearn transformers but is not a good fit for our convenience functions extract_features and extract_relevant_features

GillesVandewiele commented 7 years ago

It tries to extract subseries of timeseries that are most discriminative for a certain class. So it takes as input a set of time series with corresponding labels and returns shapelets that can be used as features for a classifier.

MaxBenChrist commented 7 years ago

We have the convention that features must be real valued numbers. If I get you right, those shapelets are subsequences of the original time series?

It is not clear to me how one can map those shapelets to a feature?

GillesVandewiele commented 7 years ago

For k classes, you could have a 'best' shapelet for each class and then the distances to each of these classes could be added (k features), or the index of the closest shapelet could be added (1 feature).

The difference maybe is that here, a whole dataset is required to extract the 'best' shapelets instead of just 1 time serie.

MaxBenChrist commented 7 years ago

By this you get a feature calculator with a state. Essentially you have to train the calculator (to get the shapelets) and then predict with it ( to get the distance to the class).

This means that tsfresh would have need to have separated data sets for train / test set. Up until now, the users can extract features for their whole data and do the train/test split by themselves.

Don't get me wrong, I don't say that the shapelet feature is not helpful, but If we implement it, we would have to change our API (the extract_feature method would be replaced by a fit and transform )

MaxBenChrist commented 7 years ago

We already have sklearn based transformers, so we could start with that.

However, the parallelization is incompatible with feature calculators having a state. I don't see an easy solution for that at the moment.

jneuff commented 7 years ago

I haven't looked at the paper yet. tsfresh being stateless has many advantages, but sooner or later we might have to introduce state to support features like this one. We should start thinking about ways to do this. As @MaxBenChrist said, our sklearn-transformers are probably a good starting point.

MaxBenChrist commented 7 years ago

the biggest drawback will be that our APIs become more complicated. Eventually it lead to removing the easy to use extracting_features and extracting_relevant_features methods and instead only offering the transformers.

earthgecko commented 7 years ago

@GillesVandewiele I have watched this one closely, that shapelet extraction looks very interesting, even it if is not totally suited to the stateless of tsfresh, it is very interesting in terms of the specific timeseries domain, thanks for pointing it out. It will be interesting to see if is can be hooked into tsfresh via the sklearn-transformers as just another feature, however I would say it is an interesting feature, so as they say, where there is a will there is a way :)

But I would agree, that do not make the tsfresh APIs more complicated, extension it often preferable to modification in many instances. I studied Botany and when I originally saw this when you posted it and looked at the Kaggle and Eamonn Keogh endorsed paper and I actually wondered to myself...

Could every single timeseries be closest matched to a KNOWN leaf profile, there being a lot of leaves... Is there a data source for ALL KNOWN leaf profiles - images, other?

If there was... every leaf could be converted to a "timeseries", the features for every leaf profile could then be calculated with tsfresh. Then you could compare timeseries on the score of the leaf profiles and compared to the score of the timeseries (perhaps with some scale equality). But you could possibly compare timeseries to known leaves. Timeseries similarity comparisons. However, you can possibly do that simply with tsfresh alone, without shapelet extraction.

However, I think adding some known, natural, evolutionary baselines would be super cool :D

MaxBenChrist commented 7 years ago

We had some discussion about the statelessness before. Personally, I would love to have something like k-Nearest-Neighbour with dynamic time warping as an additional feature calculator.

But, one thing we should not forget is that having a state also means that the parallelisation of tsfresh becomes way, way harder to implement, you would have to parallelize every single feature having a state.

If we had more contributors, we could tackle this. But, at the moment, tsfresh already takes way too long for big data sets. We should focus on getting it faster and getting the parallelization right.

earthgecko commented 7 years ago

Me too but http://www.cs.ucr.edu/~eamonn/meaningless.pdf still applies to k-Nearest-Neighbour I think... I honestly cannot totally remember, all I know is that timeseries are difficult. Timeseries with machine learning are difficult.

I would assert, possibly until tsfresh came along. A small step for...

earthgecko commented 7 years ago

@MaxBenChrist ^^ as I inferred, I forgot :)

earthgecko commented 7 years ago

@MaxBenChrist I reckon it can be hacked, if it was desirable (optional) in a manner that does not change tsfresh's stateless-ness. I reckon tsfresh just made "one small step for...." timeseries and machine learning, imho.

I think that right now the stateless-ness of tsfresh should remain stateless for now. However it is not beyond the realms of possibilities that tsfresh-plugins could not work and do g, f, d with tsfresh.

I would personally like to see the possibilities that the tsfresh in its current stateless-ness stale can do, however I too can see a number of possibilities, in terms of the opportunities tsfresh possibly presents.

These are clever people :) @GillesVandewiele you did a Kaggle, I reckon you could probably do a plugin :)

GillesVandewiele commented 7 years ago

Could every single timeseries be closest matched to a KNOWN leaf profile, there being a lot of leaves... Is there a data source for ALL KNOWN leaf profiles - images, other?

The cool thing about shapelets is that they learn this leaf profile. They search for a subserie in the timeserie that best distinguishes one certain leaf class from all other classes.

These are clever people :) @GillesVandewiele you did a Kaggle, I reckon you could probably do a plugin :)

I could indeed help with the implementation, but I could use some help as well. First off, my code is far from very clean (I never really written clean code in my life before). Moreover, many possible optimization techniques could be implemented to decrease the computational complexity, which is required when the dataset becomes large.

earthgecko commented 7 years ago

@GillesVandewiele well I will definitely be having a play around with Shapelet extraction myself whether around tsfresh or standalone, but probably not until next year now, so until then :)

GillesVandewiele commented 7 years ago

You can definitely start from my kaggle code base for this, many possible optimizations are left (pruning techniques). Would be nice to see an evaluation on the leaf classification dataset? How much could you increase your predictive performance by incorporating shapelet features? Something I would like to investigate, but it is yet impossible given the computational complexity of my current code base...

MaxBenChrist commented 7 years ago

At the moment we have some other, internal challenges to face, see #119.

As soon as we solved them, we will discuss the possibility of stateful feature calculators. Shapelets seem like a good choice for a future feature calculotrs.

ClimbsRocks commented 7 years ago

Would love to see some shapelets in here eventually! But I'm happy to see the team able to make prioritization decisions and maintain focus- that bodes well for the health of this project long-term.

GillesVandewiele commented 7 years ago

Hey everyone,

Just to give everyone a heads-up. I've started implementing different algorithms to discover shapelets in timeseries.

https://github.com/GillesVandewiele/pyShapelets

Ezekiel-Kruglick commented 7 years ago

Just to jump in here.... I've long been a fan of the output of Keogh's group, but I don't see how to implement shapelets in the format of tsfresh - as I think multiple people have expressed above.

Having said that, one could imagine a pre-processing step doing shapelet discovery (K-NN using various distance metrics and sub-segments) which would generate a smaller number of features that we could then extract as part of the main extracting_features paradigm. Those features could be passed in as a parameter combination fed to a feature calculator that would look a lot like the ones already present. I'd be willing to help with that if we find a high level architectural vision we like.

MaxBenChrist commented 7 years ago

Just to jump in here.... I've long been a fan of the output of Keogh's group, but I don't see how to implement shapelets in the format of tsfresh - as I think multiple people have expressed above.

Me too. Dynamic time warping with kNN often achieves amazing accuracies in time series classification tasks. I have not used shapelet approaches so far.

Having said that, one could imagine a pre-processing step doing shapelet discovery (K-NN using various distance metrics and sub-segments) which would generate a smaller number of features that we could then extract as part of the main extracting_features paradigm. Those features could be passed in as a parameter combination fed to a feature calculator that would look a lot like the ones already present. I'd be willing to help with that if we find a high level architectural vision we like.

This sounds very interesting. I like the idea of splitting this task into finding the shapelets and then afterwards extraction of features for the shapelets. In such a way it will not brake our paradigm that the value of one feature does not depend on the other features.

Actually, I have been quietly working on tspreprocess, a package to preprocess the time series, e.g. denoising, compressing etc. It is not yet open source. We could add the finding of the shapelets to that package.

However, at the moment, I am not sure about the following things:

@Ezekiel-Kruglick are you familiar with the literature on shapelets?

Ezekiel-Kruglick commented 7 years ago

@MaxBenChrist @GillesVandewiele

I found this talk be Keogh in which he spends about half the time (second half) talking about shapelets: https://www.youtube.com/watch?v=sD-vvN_st58&t

He notes that shapelets are a special form of unsupervised motif discovery once you've turned the shape into a psuedo time-series.

I have used a variety of Motif discovery algorithms. I think this gives us a potential skeleton:

def method_involving_motifs(x, params):

[params contains, for example, a number N of combinations equal to the number of top most important motifs you want to use and the scenario (motif counts, best match location...)]

  1. Run motif discovery to return N motifs (possibly store them somewhere for the user???)

2) for each parameter and motif combination: Generate feature by finding motif score for the chosen scenario (e.g. count of matches...)

This would fit with our current extractor architecture, doesn't require preprocessing, and uses our standard conventions for combination handling.

Is there a recommended way to pass out the motif details?

This would likely get the param for compute intensive, but at least it avoids kNN (as it should since this isn't unsupervised - although we will be generating best motif detection not best motifs for predicting cases). If we want to do best prediction we need to have access to the classifications at feature extraction... so that would probably imply preprocessing.

Ezekiel-Kruglick commented 7 years ago

@MaxBenChrist - I almost missed your last question. I have used DTW and kNNs a lot and done Motifs of which shapelets are a special case, but so far my specific exposure to "shapelets" have been only through reading papers from Keogh's group. I do have a collection of about 300MB of material from Keogh's group from back a few years when they were handing out data and code.

ClimbsRocks commented 7 years ago

tspreprocess sounds like a great idea! doesn't clutter up the main package with unnecessary complexity, but gives quite a few advanced tools to the power users who seek them out.

@MaxBenChrist : out of curiosity, which "efficient python DTW + kNN implementation" did you end up going with? I'm facing that same struggle.

MaxBenChrist commented 7 years ago

@MaxBenChrist : out of curiosity, which "efficient python DTW + kNN implementation" did you end up going with? I'm facing that same struggle.

I use kNN from sklearn with a small wrapper and an implementation of DTW in C

I will upload it and send you a link to the repo

MaxBenChrist commented 7 years ago

I found this talk be Keogh in which he spends about half the time (second half) talking about shapelets: https://www.youtube.com/watch?v=sD-vvN_st58&t

Interesting talk.

I do have a collection of about 300MB of material from Keogh's group from back a few years when they were handing out data and code.

what kind of code? maybe some blazing fast C implementations? :D

@Ezekiel-Kruglick I agree with what everything you said. do you want to develop a small proof of concept (does not yet to be 100% in tsfresh format) and then we iterate from that on?

Ezekiel-Kruglick commented 7 years ago

@MaxBenChrist Yes, sure, I'll work up a motif discovery and feature extractor algorithm.

Ezekiel-Kruglick commented 7 years ago

Update: I have a working motif extractor. That brought up other higher level issues when I applied it to data like removing duplicates and how to extract counts. I'm working on those higher level issues and building support options now like filtering based on having at least M occurrences over the time series to keep from finding less-useful highly matching pairs.

Ezekiel-Kruglick commented 7 years ago

FYI: I had to code up algorithm myself as the other motif extractors I found using python all had issues that would have prevented us from using for features or had special cases that were okay for those authors but not good enough for a general tool. Currently it's pretty slow.

Ezekiel-Kruglick commented 7 years ago

@MaxBenChrist I've sped up the code a bunch and have some tools around the main functionality. I don't see a dev branch, what is your preferred approach for large amounts of beta-level code? Probably ready to upload in a day or two.

nils-braun commented 7 years ago

Just do a fork and create a pull request - we will handle the rest! Looking forward to your PR :-)

GillesVandewiele commented 7 years ago

Hey guys, sorry for my inactivity. I was on holiday!

I think extracting features from the shapelets itselves will not be that useful, since the shapelets are already small subseries and the minimal distance from a timeserie to different extracted shapelets are already quite discriminative features.

I like your motif extraction @Ezekiel-Kruglick ! Do the motifs/subsequences need to occur multiple times EXACTLY in the timeserie or is the shape enough? Will investigate the code more thoroughly in the nearby future for sure!

If I'm right (correct me otherwise), the motif extraction differs from the shapelet extraction since it only operates using 1 timeserie at a time (just look at patterns that occur multiple times in the same timeserie), while the shapelet extraction requires all (or a subset) of timeseries to calculate information gain etc on?

Ezekiel-Kruglick commented 7 years ago

@GillesVandewiele

"Do the motifs/subsequences need to occur multiple times EXACTLY in the timeserie or is the shape enough?"

Right now I'm using euclidean distance and the user can put in a distance cutoff of their preference when generating features. So the match need not be exact (exact would be distance zero), and is adjustable to whatever the user wants. The distance metric computation is also an accessible property of the motif extractor so it can be replaced by a user function if desired.

"the motif extraction differs from the shapelet extraction since it only operates using 1 timeserie at a time (just look at patterns that occur multiple times in the same timeserie), while the shapelet extraction requires all (or a subset) of timeseries to calculate information gain etc on"

It depends on how you look at it :) The shapelets paper is looking at all of the data at once. Keogh in talks describes shapelets as motifs but with input series based on shape descriptors so we can modify as we like.

This actually brings up an interesting point. @MaxBenChrist et al: do the feature extractors get passed chunked data for a single id at a time? I'm getting much better results applying it to the whole data series all at once first so that I can find motifs that are in multiple IDs. Is there a way to get the whole data column first to generate motifs? Then we would use those motifs for each id to do feature extraction.

MaxBenChrist commented 7 years ago

You can quote other people by appending a ">" to the beginning of the sentence, see https://help.github.com/articles/about-writing-and-formatting-on-github/

This actually brings up an interesting point. @MaxBenChrist et al: do the feature extractors get passed chunked data for a single id at a time? I'm getting much better results applying it to the whole data series all at once first so that I can find motifs that are in multiple IDs. Is there a way to get the whole data column first to generate motifs? Then we would use those motifs for each id to do feature extraction.

I think you are mixing up the finding of the motifs with the calculation of the motif features. The calculation of the feature is done inside tsfresh, with every time series independently. So we have fixed motifs and calculate the feature for every time series. But, the finding of the motifs can be done in any fashion, even on whole data set at once. I would add it to another package, maybe tspreprocess.

Is there a way to get the whole data column first to generate motifs?

I am not sure about that question? you could just extract the value column from the time series container.

Ezekiel-Kruglick commented 7 years ago

@MaxBenChrist - Sorry for using imprecise wording, I meant the way you said it. "Finding" the features in the whole stream meant "picking out data that makes for good features". I think we mean the same thing.

I am not sure about that question? you could just extract the value column from the time series container.

Is there a best practice for that? I notice it can be passed in many forms and I'm not 100% certain my interpretation of where it gets stored as I'm still learning the codebase.

Sorry for delays, some work got in the way

MaxBenChrist commented 7 years ago

Is there a best practice for that? I notice it can be passed in many forms and I'm not 100% certain my interpretation of where it gets stored as I'm still learning the codebase.

No worries. We can later easily convert every tsfresh data format to the format that you picked. I feel that the flat data frame from https://tsfresh.readthedocs.io/en/latest/text/data_formats.html would be best fitting for this task.

Ezekiel-Kruglick commented 7 years ago

@MaxBenChrist I've put some work so far at this branch: https://github.com/Ezekiel-Kruglick/tsfresh/tree/motif_discovery

You'll see new code in the feature_extractors directory (motifs.py) and a set of tests for it. I can use that within an ipython notebook to do motif extraction and feature generation and have done so to test usefulness and default settings. Can you take a look at the bottom of feature_calculators.py and tell me what you would put in there to access that flat data frame? Following the data I don't see how it would get to there or how to pass it as a parameter.

MaxBenChrist commented 7 years ago

@Ezekiel-Kruglick

You'll see new code in the feature_extractors directory (motifs.py) and a set of tests for it. I can use that within an ipython notebook to do motif extraction and feature generation and have done so to test usefulness and default settings. Can you take a look at the bottom of feature_calculators.py and tell me what you would put in there to access that flat data frame? Following the data I don't see how it would get to there or how to pass it as a parameter.

The feature calculators only have access to a singular time series. So your motif eplorer motif_explorer can only access one singular time series. I would not put the explorer into the tsfresh/feature_extraction/feature_calculators.py but in another separate module.

MaxBenChrist commented 7 years ago

The notebook with the example is not checked in, is it? @Ezekiel-Kruglick

I would like to try it. Also let me know once the tsfresh/feature_extraction/motifs.py is ready for a first review.

Ezekiel-Kruglick commented 7 years ago

@MaxBenChrist

The feature calculators only have access to a singular time series.

Okay, that's what I thought earlier, I guess we talked past each other a bit. Since there is a preprocessor being discussed but we don't have an implementation yet would it be appropriate for me to put in function primitives for now that can be called manually for now? Then we can compose them into auto-extractors as we figure out the preprocessor format?

The notebook with the example is not checked in, is it?

My apologies, I'm testing on proprietary data for work so I can't check that in. I can make and check in a notebook once we resolve the above so I know what the API should look like for now.

Ezekiel-Kruglick commented 7 years ago

@MaxBenChrist Okay, there is now an example notebook over on my fork: https://github.com/Ezekiel-Kruglick/tsfresh/tree/motif_discovery

In theory this fork is ready for a pull request as it shouldn't break anything, but it also won't start running by itself as the only thing that calls the motif functions is the example notebook. This would put the motif code and example notebook in the repository so that you can play around with it in your preprocessor. If you think that's a good idea say so and I'll issue the pull request.

I definitely think motif finding should be something the user has to at least select with a parameter as on long datasets it can be very slow.

I can separately think about other features using motifs, right now my only executed example motif feature is counting how many times a given motif is in a strip of data. It might be fun to go through Keoghs papers and see what other features we can generate once motifs are available.

MaxBenChrist commented 7 years ago

@Ezekiel-Kruglick sorry for the delay.

Great work already.

There are a few things that I would change. .e.g

I think the best way to work on that would be to open a pr and then iteratively go over the code there. I want to code on this as well so maybe we should not target the master but a branch "motif" with that pr.

I can separately think about other features using motifs, right now my only executed example motif feature is counting how many times a given motif is in a strip of data. It might be fun to go through Keoghs papers and see what other features we can generate once motifs are available.

Agree, once we have a motif detection, we can think of many feature calculators

Ezekiel-Kruglick commented 7 years ago

Sounds like a good plan. Can you please create the core motif branch? Bitbucket would allow me to issues a pull request to new branch but it appears GitHub does not have that.

Zeke

On August 15, 2017 3:04:53 AM MDT, Maximilian Christ notifications@github.com wrote:

@Ezekiel-Kruglick sorry for the delay.

Great work already.

There are a few things that I would change. .e.g

  • save the found motifs as fc_settings. this allows the addition of found motifs to an existing kind_to_fc_parameters settings object
  • move the motif detection in another submodule

I think the best way to work on that would be to open a pr and then iteratively go over the code there. I want to code on this as well so maybe we should not target the master but a branch "motif" with that pr.

I can separately think about other features using motifs, right now my only executed example motif feature is counting how many times a given motif is in a strip of data. It might be fun to go through Keoghs papers and see what other features we can generate once motifs are available.

Agree, once we have a motif detection, we can think of many feature calculators

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/blue-yonder/tsfresh/issues/111#issuecomment-322418099

MaxBenChrist commented 7 years ago

@GillesVandewiele I created the motif branch. 👍

GillesVandewiele commented 7 years ago

Great! Let me know if I can help! Also, if you ever plan on going for shapelets, I'll gladly integrate my code...

MaxBenChrist commented 7 years ago

Great! Let me know if I can help! Also, if you ever plan on going for shapelets, I'll gladly integrate my code...

Sounds good.

Just to be on the same page: Shapelets are a special kind of Motif. They are those motifs that have the highest discriminative power in a classification task, right?

GillesVandewiele commented 7 years ago

Just to be on the same page: Shapelets are a special kind of Motif. They are those motifs that have the highest discriminative power in a classification task, right?

Yes, they are somewhat similar. Motifs are subseries that occur multiple times, shapelets are subseries that are very discriminant for time series of a specific class (they will probably occur in most timeseries of that class).

The main difference is that in order to extract shapelets, you must have the whole dataset. This is no requirement for motif extraction.

I like the following image: here, we identified a subseries for false nettles that distinguishes them from stinging nettles.

Or similarly for arrowheads --> EDIT: Somehow, the links are to a slideshare instead of images. I'm talking about slide 20 and 24 in that slideshow...

MaxBenChrist commented 7 years ago

Thanks for the explanation.

This sounds more like a direct classifier? Or can you build features from the shapelets?

GillesVandewiele commented 7 years ago

You store the shapelets in some kind of dictionary, and for each timeserie you measure the minimal distance to each of the shapelets in your dictionary.

You thus form a feature vector of length K, where K is the number of shapelets in your dictionary. Another option is to take the argmin of this vector to reduce the number of features.

Moreover, you can use these shapelets to transform your data: paper