Open marketneutral opened 4 years ago
Hi, you're correct that it's a decent effort to modify skits
to handle multiple time series. I believe that you would end up wanting to model the problem similarly to a Vector Autoregression.
One big question that determines the complexity of this is whether or not your multiple time series are sampled at the same time. If they are not, then this problem gets pretty difficult because you have to be very careful in how everything gets merged such that future data is not leaked into the past.
In the example that you've shown, it appears that each time series is sampled at the same time (day). Off the top of my head, I'm not entirely sure the best way to extend skits
to handle this (and I'm in general still not too happy with the skits
API). The simplest way to handle this might be to build something analogous to the scikit-learn FeatureUnion which would take in multiple skits
ForecasterPipeline
s and concatenate both their transformed X
matrices and their transformed y
arrays.
Actually, now that I'm looking at it, I think you could do something like the following. This would allow you to at least fit an estimator that takes in all time series and predicts for all time series. You would then have to write your own forecasting function to generate future forecasts. Hopefully, this can serve as a starting point:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import FeatureUnion
from skits.feature_extraction import AutoregressiveTransformer
from skits.pipeline import ForecasterPipeline
from skits.preprocessing import ReversibleImputer
df = pd.DataFrame(
{
"date": [1, 1, 1, 2, 2, 2],
"ts_name": ["A", "B", "C", "A", "B", "C"],
"Y_t": [1.1, 2.3, 3.1, 1.2, 2.2, 3.3],
}
)
Xts = []
yts = []
for ts_name, group in df.groupby("ts_name"):
pipeline = ForecasterPipeline(
[
(
"features",
FeatureUnion(
[("ar_transformer", AutoregressiveTransformer(num_lags=1))]
),
),
("post_lag_imputer", ReversibleImputer()),
]
)
y = group["Y_t"].values
X = y[:, np.newaxis].copy()
Xt = pipeline.fit_transform(X, y)
yt = pipeline.transform_y(y)
if yt.ndim == 1:
yt = yt[:, np.newaxis]
Xts.append(Xt)
yts.append(yt)
Xt = np.hstack(Xts)
yt = np.hstack(yts)
estimator = LinearRegression()
estimator.fit(Xt, yt)
Thanks so much for your thoughtful reply! That's a really nice pattern. I'll work with that!!
There is a small bug in your example. You need to use vstack
, not hstack
. As:
df = pd.DataFrame(
{
"date": [1, 1, 1, 2, 2, 2, 3, 3 ,3],
"ts_name": ["A", "B", "C", "A", "B", "C","A", "B", "C"],
"Y_t": [1.1, 2.3, 3.1, 1.2, 2.2, 3.3, 1.05, 2.25, 3.35],
}
)
Xts = []
yts = []
for ts_name, group in df.groupby("ts_name"):
pipeline = ForecasterPipeline(
[
(
"features",
FeatureUnion(
[("ar_transformer", AutoregressiveTransformer(num_lags=2))]
),
),
("post_lag_imputer", ReversibleImputer()),
]
)
y = group["Y_t"].values
X = y[:, np.newaxis].copy()
Xt = pipeline.fit_transform(X, y)
yt = pipeline.transform_y(y)
if yt.ndim == 1:
yt = yt[:, np.newaxis]
Xts.append(Xt)
yts.append(yt)
Xt = np.vstack(Xts)
yt = np.vstack(yts)
estimator = LinearRegression()
date ts_name Y_t
0 1 A 1.10
1 1 B 2.30
2 1 C 3.10
3 2 A 1.20
4 2 B 2.20
5 2 C 3.30
6 3 A 1.05
7 3 B 2.25
8 3 C 3.35
And Xt
should be in the shape of (num_dates*count of ts, num_lags)
:
Xt
array([[1.15, 1.15],
[1.15, 1.15],
[1.1 , 1.2 ],
[2.25, 2.25],
[2.25, 2.25],
[2.3 , 2.2 ],
[3.2 , 3.2 ],
[3.2 , 3.2 ],
[3.1 , 3.3 ]])
and you can see that the time series are stacked with lags across the columns.
It is odd though that, for example, in row 2: this is date 2 for ts_name A. There is no lag 2 (it would be before the ts existed). In that case I see you are filling with the average of date 1 and date 2. However, there is a lag 1 in the data. The lag 1 at date 2 for ts A is 1.10. However, the feature matrix contains 1.15, which is again the average of date 1 and date 2. Is this a bug in the AutoregressiveTransformer
?
I actually don't think that there is a bug in the code that I provided, unless I'm totally misunderstanding something. I believe the confusion comes from what we would expect Xt
to look like.
As I understood the original request, you have multiple time series that each have values at the same point in time. You would like to use all of the time series at once to predict each of the time series' values. If that's the case, then I would expect Xt
to have a number of rows equal to the number of time stamps, or number of dates in your example. The number of columns corresponds to the number of "features" which in this case would be the number of lags times the number of different time series. All told, Xt ~ (num_dates, num_ts * num_lags)
.
For your example that you provide, there are 3 different time series and 3 dates.
df = pd.DataFrame(
{
"date": [1, 1, 1, 2, 2, 2, 3, 3 ,3],
"ts_name": ["A", "B", "C", "A", "B", "C","A", "B", "C"],
"Y_t": [1.1, 2.3, 3.1, 1.2, 2.2, 3.3, 1.05, 2.25, 3.35],
}
)
print(df.pivot(index="date", columns="ts_name", values="Y_t"))
# ts_name A B C
# date
# 1 1.10 2.30 3.10
# 2 1.20 2.20 3.30
# 3 1.05 2.25 3.35
You also have num_lags=2
. Xt
should have shape (3, 2*3), and it does.
One thing that might be helpful to see what's going on with the AutoregressiveTransformer
would be to remove the ReversibleImputer
after it so that you can see exactly where the null values are. If you do that, you'll see Xt
is:
[[nan nan nan nan nan nan]
[nan nan nan nan nan nan]
[1.1 1.2 2.3 2.2 3.1 3.3]]
The first two rows are nan
because the AutoregressiveTransformer
fills is nan
unless all lag values are available (this probably ought to be changed to only fill nan
for unavailable lag values!). The last row contains the lags for the third date. The first two columns correspond to the A
time series, the third and fourth correspond to B
, and so on.
Hi Ethan, I attended your talk at PyData NYC -- it was great! I am interested to use
skits
but I have multiple time series of the form:skits
is setup to handle a single time series (theX
matrix must in fact be a vector). Obviously it is huge effort to modify to multiple time series, but I might be able to try if you could give some advice on how you might go about that. Thanks.