blue-yonder / tsfresh

Automatic extraction of relevant features from time series:
http://tsfresh.readthedocs.io
MIT License
8.46k stars 1.21k forks source link

Shapelet extraction #111

Closed GillesVandewiele closed 5 years ago

GillesVandewiele commented 7 years ago

One interesting feature with an explanatory ability is shapelet extraction.

Would maybe be interesting to implement within this package? A far from optimal code example by me can be found here

MaxBenChrist commented 7 years ago

You store the shapelets in some kind of dictionary, and for each timeserie you measure the minimal distance to each of the shapelets in your dictionary.

This sounds very similar to motifs. So, if I get it right: The difference lays not in calculating the features but in finding the shapelet (best discriminative power of singular occurrence of subsequence) or motif (most often occurring subsequence). Later when you call the feature calculator, it does the same: for a given subsequence (either motif or shapelet), it searches for occurrences/distance to the pattern in the given time series. Am I right?

This would imply that we should work on the same branch for motifs and shapelets

GillesVandewiele commented 7 years ago

@MaxBenChrist that's correct! But then again, there's the difference that motif can be calculated per time serie while shapelets need all the timeseries. Will this not form a problem?

MaxBenChrist commented 7 years ago

I finally managed to spend a few hours to work on the motif branch.

@Ezekiel-Kruglick

when looking for motifs, you go over the sequence and then select the current subsequence as motif pattern. But then you do not look backwards for the occurrence of that pattern.

See https://github.com/blue-yonder/tsfresh/blob/0dd40edb36e127aece70d664eabcbd3dc001471d/tsfresh/feature_extraction/motifs.py#L160

Is that intended?

Also, I have some questions, I added them as TODO lines.

Ezekiel-Kruglick commented 7 years ago

@MaxBenChrist - Yes, looking only forwards for the pattern is intended otherwise you get doubles of all the matches. A match between two segments A, B will be symmetric such that the distance metric is the same comparing A--> B and B -->A. If you compare backwards then for every close matching segment pair you'll get both pairs. Similarly there is no need to compare a test segment X against any previous segments because X has already been compared to every segment when that segment looked forward and X was in the forward section. Thus removing backward looking during matching eliminates double occurrences and saves time without losing any information.

I did not think of that at first, I had to think about it and realize it when I was looking at all the double pairs of motifs.

I will go look at the TODO lines.

Ezekiel-Kruglick commented 7 years ago

@MaxBenChrist - I think this is all the questions that weren't addressed above. Let me know when you aren't playing with the code anymore and I can make changes as needed before any merge if you like.

TODO: can we somehow remove the dependence on as_strided? even its own docstrings warns not to use it...

return np.lib.stride_tricks.as_strided(data, shape=dimensions, strides=steplen)

Hah, yes, sorry. I suppose I've been using that trick for years. There are many ways to generate a sliding window and any of them will work. Preferable to do it in numpy for speed. I can fix once things are stable from other work you are doing if you like.

todo: the lists should not be constructed from scratch in every iteration, maybe work with append?

An excellent point, and I see how to do it no problem

todo: why the factor 8? why not 7 or 6

if motif_length * 8 > len(data):

Keogh mentioned it in passing in one of his talks so I tossed that constant in. We could make it an over_rideable parameter with a default value perhaps? Whatever you like

todo: wouldn't 2 * motif_length be enough?

for start in range(len(data) - (3 * motif_length)):

I was finding 2*motif_length got off-by-one in certain cases (wound up with one too few data to compare). Now that I look at it though I may have fixed that two different ways so that might actually work now. I would want to write some tests for the special cases before saying for sure but that might be changeable.

MaxBenChrist commented 7 years ago

@Ezekiel-Kruglick

thank you for the detailed answers.

Let me know when you aren't playing with the code anymore and I can make changes as needed before any merge if you like.

From now own, I will also work with pull requests towards this branch. I will not push commits to the motif branch directly anymore. As only we two are working on that branch at the moment, I think it is fine to use those TODOs to organize ourselves. I see no need to open issues.

I would want to write some tests for the special cases before saying for sure but that might be changeable.

Yes, we should make sure that we test everything properly

Regarding the as_strided, I think it returns a view so it could consume less memory for long time series than a constructed array. But as said, its docstring frightened me a little bit :D

Ezekiel-Kruglick commented 7 years ago

@MaxBenChrist Sounds like a good plan. I'll grab a new branch tonight or Monday and work on it.

MaxBenChrist commented 7 years ago

@Ezekiel-Kruglick

I worked on the motif branch again. I have a good feeling about the current state.

Ezekiel-Kruglick commented 7 years ago

@MaxBenChrist Looking good, I will work from the new one when I add more (spoiler: work just got very busy). As an update I did take a look at the _generate_candidates line with "for start in range(len(data) - 3 motif_length):" and I am now able to replicate the corner case where "2 motif_length" doesn't work but it's a deep issue and I'm still working on understanding it.

I think normalization is probably desired as a default. There can be some special cases where the scale is known to be an important signal so perhaps we can allow that to be turned off? In any case a simple sklearn.preprocessing.normalize is likely all we'd need unless you see something more complicated?

I defer to your opinion on where to put it :)

MaxBenChrist commented 7 years ago

Looking good, I will work from the new one when I add more (spoiler: work just got very busy). As an update I did take a look at the _generate_candidates line with "for start in range(len(data) - 3 motif_length):" and I am now able to replicate the corner case where "2 motif_length" doesn't work but it's a deep issue and I'm still working on understanding it

Perfect

I think normalization is probably desired as a default. There can be some special cases where the scale is known to be an important signal so perhaps we can allow that to be turned off? In any case a simple sklearn.preprocessing.normalize is likely all we'd need unless you see something more complicated?

The normalizing itself is not a problem. But I have not figured out where to normalize. Lets say that you have three time series a, b, c. Now, you concatenate them, normalize them and then look for motifs:

x=[a, b, c]
x = normalize(x)
m = find_motifs(x)

Now, later you look for those motifs inside the individual time series a, b, c, so you will have to normalize them, but you do that individually, so

a = normalize(a)
n = count_motif(a, m)

You know, the normalization on x and on a will behave differently. Worst case, even if the motifs stem from a, you will not find them in a later.

Is that topic touched in one of the papers?

Ezekiel-Kruglick commented 7 years ago

@MaxBenChrist To some degree normalization is one of reasons and tradeoff drivers for SAX, so there is a lot of discussion about how and when to do it across the work from Keogh's lab.

I think we have to normalize on each time series a,b,c so that the concatenation of [a, b, c] is the concatenation of normalized time series events.

That would certainly make sense for the scenarios I've been thinking of and my planned usage. Inverting the question, can we think of any cases where that would mess things up?

MaxBenChrist commented 7 years ago

@MaxBenChrist To some degree normalization is one of reasons and tradeoff drivers for SAX, so there is a lot of discussion about how and when to do it across the work from Keogh's lab. I think we have to normalize on each time series a,b,c so that the concatenation of [a, b, c] is the concatenation of normalized time series events. That would certainly make sense for the scenarios I've been thinking of and my planned usage. Inverting the question, can we think of any cases where that would mess things up?

I spent some time thinking about this. I think, we should not only normalize each time series individually but also forbid motifs that span over the array border.

Here an example with non normalized time series

a = [1, 2]
b = [0, 1, 0, 0]
x = a + b              # x = [1, 2, 0, 1, 0, 0]

in this case, [2, 0, 1] is not a valid motif.

I am sure we can implement that quite easily.

Ezekiel-Kruglick commented 7 years ago

@MaxBenChrist Yes, forbidding motifs from spanning the border is necessary. Can you think of a fast way to do it? I was thinking in the motif generator we could test for the index being within a motif length of the end but testing is nontrivial considering that different samples may have different lengths.

Ezekiel-Kruglick commented 7 years ago

@MaxBenChrist Okay, I have found a reference in a tutorial by Keogh et al that says, "Essentially all datasets must have every subsequence z-normalized." (They then mention "There are a handful of occasions where it does not make sense to z-normalize, but in those cases, similarity search does not make sense either.")

So I would say the plan to normalize each time series agrees with best practices.

MaxBenChrist commented 7 years ago

So I would say the plan to normalize each time series agrees with best practices.

Alright. The normalization will be quite easy to add.

Yes, forbidding motifs from spanning the border is necessary. Can you think of a fast way to do it? I was thinking in the motif generator we could test for the index being within a motif length of the end but testing is nontrivial considering that different samples may have different lengths.

I started sketching a possible way to realize this. You can find it on the multimotifs branch. However, I did not test or even tried to use it yet. Time is limited, as always :-/

In my solution, I am assuming that the data comes in as a list of time series. I flatten this list but save where the borders are. Then i use the existing methods with the flat data sequence.

What do you think?

Ezekiel-Kruglick commented 7 years ago

@MaxBenChrist Sorry for long absence, main job got in the way. I cloned and was looking at working on this, which branch should I work from? I see ongoing work in multmotifs that looks quite significant, is that where I should work?

MaxBenChrist commented 7 years ago

Sorry for long absence, main job got in the way. I cloned and was looking at working on this, which branch should I work from? I see ongoing work in multmotifs that looks quite significant, is that where I should work?

Nice to hear from you again. :D yes, on the multimotif branch I added a mechanism (based on the good_start array) that prevents motifs from spawning over borders. The good_start list contains all indices for the array of concatenated subsequences where an motif can start without spawning over a border.

At the moment it only works for subsequences of the same lenght, but we can easily extend that to different sized sequences.

Ezekiel-Kruglick commented 7 years ago

I think I have resolved the remaining todos. I am having trouble with your added good_start portion, though. The reason it breaks tests are that it expects the data to be in the form of a series of lists (assuming I have that right)? We pass data in as a single long list with sample IDs in another column, is it internally turned into a series of lists when passed to the feature generators? I didn't know that and will need to make some changes if that is the case.

MaxBenChrist commented 7 years ago

We pass data in as a single long list with sample IDs in another column, is it internally turned into a series of lists when passed to the feature generators? I didn't know that and will need to make some changes if that is the case.

Inside tsfresh, we use actually operate on chunks, which are a triple of id, kind, and the time series. So the long column of with the time series data is split into sub series to generate all chunks.

Sooo, it would be beneficial if we also operate on such sub series. However, it is not mandatory to operate on such data structures, we can also operate on a singular time series with another series saving the id.

Ezekiel-Kruglick commented 7 years ago

okay, looks like I need to go learn more about that. Is each chunk guaranteed to be a single id and kind and the whole of that single id and kind? So it's a single "sample" or "experiment", one might say?

MaxBenChrist commented 7 years ago

okay, looks like I need to go learn more about that. Is each chunk guaranteed to be a single id and kind and the whole of that single id and kind? So it's a single "sample" or "experiment", one might say?

yes, exactly.

Ezekiel-Kruglick commented 7 years ago

@MaxBenChrist
I think perhaps we're talking past each other. Didn't we discuss and agree that the initial motif finding would be on the whole data stream instead of chunks specifically because finding motifs shared by different samples is far more interesting than motifs constrained within a single sample? Then we need to pass those motifs into the main feature finding system that breaks things up into chunks and does multiprocessing, at which point we are dealing with chunks and generating chunk-specific features determined by the motifs. Right?

But the good_start portion in the multimotifs branch assumes things are already chunked. So perhaps in the pre-processing module you've been working on (that I don't think I've seen) is there an intermediate stage where the passed in data is a data structure containing all of a particular type of chunks? Is that in the branch I have? If you point me at where you want to call the motif-finding that would be very useful.

MaxBenChrist commented 7 years ago

Ah, I see your point.

I think we should have a Hangout Session and discuss this in person. I would like to explain a few things, writing everything down would take quite some time.

Are you free sometime next week? My mail address that I use for hangout is max.christ@me.com

nils-braun commented 5 years ago

Was there any news on this? If yes, please reopen. If not, we leave it closed.