dmbee / seglearn

Python module for machine learning time series:
https://dmbee.github.io/seglearn/
BSD 3-Clause "New" or "Revised" License
571 stars 63 forks source link

Passing data to temporal_split and other functions #45

Closed adalseno closed 4 years ago

adalseno commented 4 years ago

Hi, I was following your example code (simple regression), but I'm stuck. I have a DataFrame of shape (1017, 15). The last column is the target so I created two dfs, one for X (1017, 14) and one for y (1017). I tried to pass those values to temporal_split but I always get an error no matter what I do (passing the df, passing them as lists). For example, passing them as list gives:

KeyError: "None of [Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,\n ...\n 991, 992, 993, 994, 995, 996, 997, 998, 999, 1000],\n dtype='int64', length=1001)] are in the [columns]"

If, on the other hand, I pass them as df I get:

AttributeError: 'DataFrame' object has no attribute 'ts_data'

The same holds true if I manually split the DataFrames and pass them to seg.fit_transform(X_train, y_train) I tried to put the date column in the df as well as in the index but the error is still there. What's wrong?

Info of the Dataframe:

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1017 entries, 896 to 1912
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   date          1017 non-null   datetime64[ns]
 1   id            1017 non-null   object        
 2   price         1017 non-null   float64       
 3   month         1017 non-null   int64         
 4   year          1017 non-null   int64         
 5   event_name_1  1017 non-null   int64         
 6   event_type_1  1017 non-null   int64         
 7   event_name_2  1017 non-null   int64         
 8   event_type_2  1017 non-null   int64         
 9   snap_CA       1017 non-null   int64         
 10  dow           1017 non-null   int64         
 11  is_weekend    1017 non-null   int64         
 12  is_holiday    1017 non-null   int64         
dtypes: datetime64[ns](1), float64(1), int64(10), object(1)
memory usage: 111.2+ KB

I tried to use it with date column or date index or as a list. The same for y: I tried to use it a Series, a Dataframe with date column or date index and list both with or without the date column. As you see there are no NaN values.

dmbee commented 4 years ago

The length of the dataframe or array-like object must correspond to the number of time series in the data set (not the number of samples in a single time series). So if you are working with a single time series, you can just put [ ] around it to make it into a length 1 list.

Most data sets have many time series, and so this the reason for the convention. These details are explained in the user guide

https://dmbee.github.io/seglearn/user_guide.html

Let me know if this fixes your problems.

D

dmbee commented 4 years ago

e.g. a typical X_train with three time series would be shaped like this [(100, 5), (150, 5), (200,5)]

an X_train with one time series would be like this [(100,5)]

I usually use lists or numpy object arrays. If using pandas, again, you'll want each sample to correspond to a time series.

adalseno commented 4 years ago

Thank you very much for your kind and prompt reply. Doing: X_train, X_test, y_train, y_test = temporal_split([x_train.values], [y_train.values], test_size=0.02) did the trick (simply using [x_train] instead does not work). The pipe went fine and fit too, but now I get a new error. If I try: score = pipe.score(X_test, y_test) I get:

ValueError Traceback (most recent call last)

in ----> 1 score = pipe.score(X_test, y_test) ~/opt/anaconda3/envs/joseml/lib/python3.7/site-packages/seglearn/pipe.py in score(self, X, y, sample_weight) 279 """ 280 --> 281 Xt, yt, swt = self._transform(X, y, sample_weight) 282 283 self.N_test = len(yt) ~/opt/anaconda3/envs/joseml/lib/python3.7/site-packages/seglearn/pipe.py in _transform(self, X, y, sample_weight) 139 Xt, yt, swt = transformer.transform(Xt, yt, swt) 140 else: --> 141 Xt = transformer.transform(Xt) 142 143 return Xt, yt, swt ~/opt/anaconda3/envs/joseml/lib/python3.7/site-packages/seglearn/transform.py in transform(self, X) 1072 self._check_if_fitted() 1073 Xt, Xc = get_ts_data_parts(X) -> 1074 check_array(Xt, dtype='numeric', ensure_2d=False, allow_nd=True) 1075 1076 fts = np.column_stack([self.features[f](Xt) for f in self.features]) ~/opt/anaconda3/envs/joseml/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator) 584 " minimum of %d is required%s." 585 % (n_samples, array.shape, ensure_min_samples, --> 586 context)) 587 588 if ensure_min_features > 0 and array.ndim == 2: ValueError: Found array with 0 sample(s) (shape=(0, 200, 11)) while a minimum of 1 is required.

Pipe is:

pipe = Pype([('seg', Segment(width=200, overlap=0.5, y_func=last)),
             ('features', FeatureRep()),
             ('lin', LinearRegression())])

as in your example and X_test and y_test come from temporal_split and obviously they are not empty! For example y_test is: [array([0., 4., 2., 3., 0., 1., 2., 0., 0., 0., 1., 1., 3., 0., 1., 1., 1., 3., 0., 1., 1.])]

dmbee commented 4 years ago

I assume your X_test, y_test is too small to segment with width 200. I probably should add a check in the transformer for that. I'll put that on the todo list.

dmbee commented 4 years ago

It may be this package is not right for your application though based on the data you are looking at. Generally you should have hundreds / thousands of segments to train and test on. Hard to know without knowing more, but you may want to look at methods like ARIMA and such if you just have one series.

adalseno commented 4 years ago

Thanks again. Actually I have several thousands series to analyse (all with the same characteristics that I can easily melt in a single df even though they are mostly independent; and since they are independent and all different ARIMA would require to calculate p and q for each one and that would take to much time; moreover since they have seasonality the model would be a SARIMA, with even more calculations) but before training on the whole dataset I wanted to test it with just one series. I reduced the segment size and now I don't get any more errors but the prediction quality, at least for this one, is not exciting. I will test it on a bunch of series but, in case, what may I do to improve?

dmbee commented 4 years ago

I wouldn't expect good results with one series. Using sliding window segmentation doesn't always make sense for every problem. It's great for some things like earthquakes, activity recognition, where there is no or little time dependency outside the window. Generally, you need to make sure the window length is long enough to incorporate enough dynamics sensible for a prediction. It's important to interpolate the samples (if not regularly sampled) to a fixed sampling rate so the window time is constant. Setting high overlap is a good data augmentation strategy. Concatenating any heuristics available to the calculated features eg season is also very helpful.

Just a few thoughts. Good luck.