blue-yonder / tsfresh

Automatic extraction of relevant features from time series:
http://tsfresh.readthedocs.io
MIT License
8.43k stars 1.21k forks source link

Efficient rolling window of time series (e.g. while using temporary results) #86

Open xgdgsc opened 7 years ago

xgdgsc commented 7 years ago

I want to extract features from a rolling window of a table with columns of several timeserieses and do some prediction based on the timeseries in that window. Currently, as far as I understand the doc. I have to extract the timeseries and tile them like in the example, so there would be a lot of duplicate data because the rolling window and doesn' t seem memory efficient. Is there a rolling window API or better ways to do it?

Thanks!

MaxBenChrist commented 7 years ago

we have not yet implemented a rolling window api to do that efficiently.

To implement such a API, one would have to decide for every feature calculator if this calculator can use the result of the last window for the current window. For some features this can be easily done (maximum, mean, ..) but for others this is not trivially possible (median, wavelet coefficients, ...)

ClimbsRocks commented 7 years ago

This is pretty critical for all the problems I'm working on as well.

MaxBenChrist commented 7 years ago

For most use cases that involve to forecast time series this can reduce the time to calculate the features.

But as stated above, for a class of features is it mathematically impossible to use auxiliary results from the last window.

MaxBenChrist commented 7 years ago

If somebody of you wants to implement this, I would be glad to help you with the design decisions. I will probably not have time for this during the next months

MaxBenChrist commented 7 years ago

Maybe we should provide a wrapper for the translation of timestamp, value combinations into rolling window time series.

It is pretty straightforward to implement with http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.shift.html applied to a groupby. (http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.core.groupby.DataFrameGroupBy.shift.html)

Afterwards one has to drop NaNs. I already did this a few times. I have to check if I can dig up some snippets.

MaxBenChrist commented 7 years ago

Lets tackle this, the following code will do the treansformation:

import numpy as np
import pandas as pd

cid = np.repeat([10, 500]*2, 4)
csort = [1,2,3,4]*4

cval = [11, 9, 67, 45] * 4
ckind= np.repeat(["a", "b"], 8)
df = pd.DataFrame({"id": cid, "sort": csort, "val": cval, 'kind': ckind})

n = max(cid)
lst_df = []

for i in range(n):
    df_temp = df.groupby(["id", "kind"]).shift(-i)
    df_temp["id"] = "id=" + df.id.map(str) + ", shift={}".format(i)
    df_temp["kind"] = df.kind
    df_temp.dropna(inplace=True)
    lst_df.append(df_temp)

df_ready = pd.concat(lst_df).reset_index()
MaxBenChrist commented 7 years ago

Before

In [21]: df Out[21]: id kind sort val 0 10 a 1 11 1 10 a 2 9 2 10 a 3 67 3 10 a 4 45 4 500 a 1 11 5 500 a 2 9 6 500 a 3 67 7 500 a 4 45 8 10 b 1 11 9 10 b 2 9 10 10 b 3 67 11 10 b 4 45 12 500 b 1 11 13 500 b 2 9 14 500 b 3 67 15 500 b 4 45

and afterwards

index  sort   val               id kind

0 0 1.0 11.0 id=10, shift=0 a 1 1 2.0 9.0 id=10, shift=0 a 2 2 3.0 67.0 id=10, shift=0 a 3 3 4.0 45.0 id=10, shift=0 a 4 4 1.0 11.0 id=500, shift=0 a 5 5 2.0 9.0 id=500, shift=0 a 6 6 3.0 67.0 id=500, shift=0 a 7 7 4.0 45.0 id=500, shift=0 a 8 8 1.0 11.0 id=10, shift=0 b 9 9 2.0 9.0 id=10, shift=0 b 10 10 3.0 67.0 id=10, shift=0 b 11 11 4.0 45.0 id=10, shift=0 b 12 12 1.0 11.0 id=500, shift=0 b 13 13 2.0 9.0 id=500, shift=0 b 14 14 3.0 67.0 id=500, shift=0 b 15 15 4.0 45.0 id=500, shift=0 b 16 0 2.0 9.0 id=10, shift=1 a 17 1 3.0 67.0 id=10, shift=1 a 18 2 4.0 45.0 id=10, shift=1 a 19 4 2.0 9.0 id=500, shift=1 a 20 5 3.0 67.0 id=500, shift=1 a 21 6 4.0 45.0 id=500, shift=1 a 22 8 2.0 9.0 id=10, shift=1 b 23 9 3.0 67.0 id=10, shift=1 b 24 10 4.0 45.0 id=10, shift=1 b 25 12 2.0 9.0 id=500, shift=1 b 26 13 3.0 67.0 id=500, shift=1 b 27 14 4.0 45.0 id=500, shift=1 b 28 0 3.0 67.0 id=10, shift=2 a 29 1 4.0 45.0 id=10, shift=2 a 30 4 3.0 67.0 id=500, shift=2 a 31 5 4.0 45.0 id=500, shift=2 a 32 8 3.0 67.0 id=10, shift=2 b 33 9 4.0 45.0 id=10, shift=2 b 34 12 3.0 67.0 id=500, shift=2 b 35 13 4.0 45.0 id=500, shift=2 b 36 0 4.0 45.0 id=10, shift=3 a 37 4 4.0 45.0 id=500, shift=3 a 38 8 4.0 45.0 id=10, shift=3 b 39 12 4.0 45.0 id=500, shift=3 b

MaxBenChrist commented 7 years ago

I don't hsve the time to do unit test & think about where to put this, @jneuff @moritzgelb or @nils-braun, can one of you add the snippet to a pr?

nils-braun commented 7 years ago

I can tackle this tomorrow or on Saturday :-) If someone is faster, no problem

nils-braun commented 7 years ago

Ok, I have started a branch and working on this. Still needs some documentation, but will be ready to go on this weekend!

nils-braun commented 7 years ago

This should now be possible in the HEAD version :-) Still needs an example notebook, but you can already read it here: http://tsfresh.readthedocs.io/en/latest/text/rolling.html

nils-braun commented 7 years ago

I leave this issue open, as we may implement a more efficient solution later

ClimbsRocks commented 7 years ago

This is awesome! Thanks for the great work on this, team. You've now allowed me to use tsfresh with an entire new class of projects.