Open xgdgsc opened 7 years ago
we have not yet implemented a rolling window api to do that efficiently.
To implement such a API, one would have to decide for every feature calculator if this calculator can use the result of the last window for the current window. For some features this can be easily done (maximum, mean, ..) but for others this is not trivially possible (median, wavelet coefficients, ...)
This is pretty critical for all the problems I'm working on as well.
For most use cases that involve to forecast time series this can reduce the time to calculate the features.
But as stated above, for a class of features is it mathematically impossible to use auxiliary results from the last window.
If somebody of you wants to implement this, I would be glad to help you with the design decisions. I will probably not have time for this during the next months
Maybe we should provide a wrapper for the translation of timestamp, value combinations into rolling window time series.
It is pretty straightforward to implement with http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.shift.html applied to a groupby. (http://pandas.pydata.org/pandas-docs/version/0.17.0/generated/pandas.core.groupby.DataFrameGroupBy.shift.html)
Afterwards one has to drop NaNs. I already did this a few times. I have to check if I can dig up some snippets.
Lets tackle this, the following code will do the treansformation:
import numpy as np
import pandas as pd
cid = np.repeat([10, 500]*2, 4)
csort = [1,2,3,4]*4
cval = [11, 9, 67, 45] * 4
ckind= np.repeat(["a", "b"], 8)
df = pd.DataFrame({"id": cid, "sort": csort, "val": cval, 'kind': ckind})
n = max(cid)
lst_df = []
for i in range(n):
df_temp = df.groupby(["id", "kind"]).shift(-i)
df_temp["id"] = "id=" + df.id.map(str) + ", shift={}".format(i)
df_temp["kind"] = df.kind
df_temp.dropna(inplace=True)
lst_df.append(df_temp)
df_ready = pd.concat(lst_df).reset_index()
Before
In [21]: df Out[21]: id kind sort val 0 10 a 1 11 1 10 a 2 9 2 10 a 3 67 3 10 a 4 45 4 500 a 1 11 5 500 a 2 9 6 500 a 3 67 7 500 a 4 45 8 10 b 1 11 9 10 b 2 9 10 10 b 3 67 11 10 b 4 45 12 500 b 1 11 13 500 b 2 9 14 500 b 3 67 15 500 b 4 45
and afterwards
index sort val id kind
0 0 1.0 11.0 id=10, shift=0 a 1 1 2.0 9.0 id=10, shift=0 a 2 2 3.0 67.0 id=10, shift=0 a 3 3 4.0 45.0 id=10, shift=0 a 4 4 1.0 11.0 id=500, shift=0 a 5 5 2.0 9.0 id=500, shift=0 a 6 6 3.0 67.0 id=500, shift=0 a 7 7 4.0 45.0 id=500, shift=0 a 8 8 1.0 11.0 id=10, shift=0 b 9 9 2.0 9.0 id=10, shift=0 b 10 10 3.0 67.0 id=10, shift=0 b 11 11 4.0 45.0 id=10, shift=0 b 12 12 1.0 11.0 id=500, shift=0 b 13 13 2.0 9.0 id=500, shift=0 b 14 14 3.0 67.0 id=500, shift=0 b 15 15 4.0 45.0 id=500, shift=0 b 16 0 2.0 9.0 id=10, shift=1 a 17 1 3.0 67.0 id=10, shift=1 a 18 2 4.0 45.0 id=10, shift=1 a 19 4 2.0 9.0 id=500, shift=1 a 20 5 3.0 67.0 id=500, shift=1 a 21 6 4.0 45.0 id=500, shift=1 a 22 8 2.0 9.0 id=10, shift=1 b 23 9 3.0 67.0 id=10, shift=1 b 24 10 4.0 45.0 id=10, shift=1 b 25 12 2.0 9.0 id=500, shift=1 b 26 13 3.0 67.0 id=500, shift=1 b 27 14 4.0 45.0 id=500, shift=1 b 28 0 3.0 67.0 id=10, shift=2 a 29 1 4.0 45.0 id=10, shift=2 a 30 4 3.0 67.0 id=500, shift=2 a 31 5 4.0 45.0 id=500, shift=2 a 32 8 3.0 67.0 id=10, shift=2 b 33 9 4.0 45.0 id=10, shift=2 b 34 12 3.0 67.0 id=500, shift=2 b 35 13 4.0 45.0 id=500, shift=2 b 36 0 4.0 45.0 id=10, shift=3 a 37 4 4.0 45.0 id=500, shift=3 a 38 8 4.0 45.0 id=10, shift=3 b 39 12 4.0 45.0 id=500, shift=3 b
I don't hsve the time to do unit test & think about where to put this, @jneuff @moritzgelb or @nils-braun, can one of you add the snippet to a pr?
I can tackle this tomorrow or on Saturday :-) If someone is faster, no problem
Ok, I have started a branch and working on this. Still needs some documentation, but will be ready to go on this weekend!
This should now be possible in the HEAD version :-) Still needs an example notebook, but you can already read it here: http://tsfresh.readthedocs.io/en/latest/text/rolling.html
I leave this issue open, as we may implement a more efficient solution later
This is awesome! Thanks for the great work on this, team. You've now allowed me to use tsfresh with an entire new class of projects.
I want to extract features from a rolling window of a table with columns of several timeserieses and do some prediction based on the timeseries in that window. Currently, as far as I understand the doc. I have to extract the timeseries and tile them like in the example, so there would be a lot of duplicate data because the rolling window and doesn' t seem memory efficient. Is there a rolling window API or better ways to do it?
Thanks!