EthanRosenthal / skits

scikit-learn-inspired time series
https://www.ethanrosenthal.com/2018/03/22/time-series-for-scikit-learn-people-part2/
MIT License
197 stars 17 forks source link

multiple time series #22

Open marketneutral opened 4 years ago

marketneutral commented 4 years ago

Hi Ethan, I attended your talk at PyData NYC -- it was great! I am interested to use skits but I have multiple time series of the form:

date ts_name Y_t
day1 A 1.1
day1 B 2.3
day1 C 3.1
day2 A 1.2
day2 B 2.2
day2 C 3.3

skits is setup to handle a single time series (the X matrix must in fact be a vector). Obviously it is huge effort to modify to multiple time series, but I might be able to try if you could give some advice on how you might go about that. Thanks.

EthanRosenthal commented 4 years ago

Hi, you're correct that it's a decent effort to modify skits to handle multiple time series. I believe that you would end up wanting to model the problem similarly to a Vector Autoregression.

One big question that determines the complexity of this is whether or not your multiple time series are sampled at the same time. If they are not, then this problem gets pretty difficult because you have to be very careful in how everything gets merged such that future data is not leaked into the past.

In the example that you've shown, it appears that each time series is sampled at the same time (day). Off the top of my head, I'm not entirely sure the best way to extend skits to handle this (and I'm in general still not too happy with the skits API). The simplest way to handle this might be to build something analogous to the scikit-learn FeatureUnion which would take in multiple skits ForecasterPipelines and concatenate both their transformed X matrices and their transformed y arrays.

Actually, now that I'm looking at it, I think you could do something like the following. This would allow you to at least fit an estimator that takes in all time series and predicts for all time series. You would then have to write your own forecasting function to generate future forecasts. Hopefully, this can serve as a starting point:

import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import FeatureUnion

from skits.feature_extraction import AutoregressiveTransformer
from skits.pipeline import ForecasterPipeline
from skits.preprocessing import ReversibleImputer

df = pd.DataFrame(
    {
        "date": [1, 1, 1, 2, 2, 2],
        "ts_name": ["A", "B", "C", "A", "B", "C"],
        "Y_t": [1.1, 2.3, 3.1, 1.2, 2.2, 3.3],
    }
)

Xts = []
yts = []
for ts_name, group in df.groupby("ts_name"):
    pipeline = ForecasterPipeline(
        [
            (
                "features",
                FeatureUnion(
                    [("ar_transformer", AutoregressiveTransformer(num_lags=1))]
                ),
            ),
            ("post_lag_imputer", ReversibleImputer()),
        ]
    )
    y = group["Y_t"].values
    X = y[:, np.newaxis].copy()

    Xt = pipeline.fit_transform(X, y)
    yt = pipeline.transform_y(y)

    if yt.ndim == 1:
        yt = yt[:, np.newaxis]

    Xts.append(Xt)
    yts.append(yt)

Xt = np.hstack(Xts)
yt = np.hstack(yts)

estimator = LinearRegression()

estimator.fit(Xt, yt)
marketneutral commented 4 years ago

Thanks so much for your thoughtful reply! That's a really nice pattern. I'll work with that!!

marketneutral commented 4 years ago

There is a small bug in your example. You need to use vstack, not hstack. As:

df = pd.DataFrame(
    {
        "date": [1, 1, 1, 2, 2, 2, 3, 3 ,3],
        "ts_name": ["A", "B", "C", "A", "B", "C","A", "B", "C"],
        "Y_t": [1.1, 2.3, 3.1, 1.2, 2.2, 3.3, 1.05, 2.25, 3.35],
    }
)

Xts = []
yts = []
for ts_name, group in df.groupby("ts_name"):
    pipeline = ForecasterPipeline(
        [
            (
                "features",
                FeatureUnion(
                    [("ar_transformer", AutoregressiveTransformer(num_lags=2))]
                ),
            ),
            ("post_lag_imputer", ReversibleImputer()),
        ]
    )
    y = group["Y_t"].values
    X = y[:, np.newaxis].copy()

    Xt = pipeline.fit_transform(X, y)
    yt = pipeline.transform_y(y)

    if yt.ndim == 1:
        yt = yt[:, np.newaxis]

    Xts.append(Xt)
    yts.append(yt)

Xt = np.vstack(Xts)
yt = np.vstack(yts)

estimator = LinearRegression()
     date   ts_name Y_t
0   1   A   1.10
1   1   B   2.30
2   1   C   3.10
3   2   A   1.20
4   2   B   2.20
5   2   C   3.30
6   3   A   1.05
7   3   B   2.25
8   3   C   3.35

And Xt should be in the shape of (num_dates*count of ts, num_lags):

Xt

array([[1.15, 1.15],
       [1.15, 1.15],
       [1.1 , 1.2 ],
       [2.25, 2.25],
       [2.25, 2.25],
       [2.3 , 2.2 ],
       [3.2 , 3.2 ],
       [3.2 , 3.2 ],
       [3.1 , 3.3 ]])

and you can see that the time series are stacked with lags across the columns.

It is odd though that, for example, in row 2: this is date 2 for ts_name A. There is no lag 2 (it would be before the ts existed). In that case I see you are filling with the average of date 1 and date 2. However, there is a lag 1 in the data. The lag 1 at date 2 for ts A is 1.10. However, the feature matrix contains 1.15, which is again the average of date 1 and date 2. Is this a bug in the AutoregressiveTransformer?

EthanRosenthal commented 4 years ago

I actually don't think that there is a bug in the code that I provided, unless I'm totally misunderstanding something. I believe the confusion comes from what we would expect Xt to look like.

As I understood the original request, you have multiple time series that each have values at the same point in time. You would like to use all of the time series at once to predict each of the time series' values. If that's the case, then I would expect Xt to have a number of rows equal to the number of time stamps, or number of dates in your example. The number of columns corresponds to the number of "features" which in this case would be the number of lags times the number of different time series. All told, Xt ~ (num_dates, num_ts * num_lags).

For your example that you provide, there are 3 different time series and 3 dates.

df = pd.DataFrame(
    {
        "date": [1, 1, 1, 2, 2, 2, 3, 3 ,3],
        "ts_name": ["A", "B", "C", "A", "B", "C","A", "B", "C"],
        "Y_t": [1.1, 2.3, 3.1, 1.2, 2.2, 3.3, 1.05, 2.25, 3.35],
    }
)

print(df.pivot(index="date", columns="ts_name", values="Y_t"))

#  ts_name     A     B     C
#  date                     
#  1        1.10  2.30  3.10
#  2        1.20  2.20  3.30
#  3        1.05  2.25  3.35

You also have num_lags=2. Xt should have shape (3, 2*3), and it does.

One thing that might be helpful to see what's going on with the AutoregressiveTransformer would be to remove the ReversibleImputer after it so that you can see exactly where the null values are. If you do that, you'll see Xt is:

[[nan nan nan nan nan nan]
 [nan nan nan nan nan nan]
 [1.1 1.2 2.3 2.2 3.1 3.3]]

The first two rows are nan because the AutoregressiveTransformer fills is nan unless all lag values are available (this probably ought to be changed to only fill nan for unavailable lag values!). The last row contains the lags for the third date. The first two columns correspond to the A time series, the third and fourth correspond to B, and so on.