Closed adalseno closed 4 years ago
The length of the dataframe or array-like object must correspond to the number of time series in the data set (not the number of samples in a single time series). So if you are working with a single time series, you can just put [ ] around it to make it into a length 1 list.
Most data sets have many time series, and so this the reason for the convention. These details are explained in the user guide
https://dmbee.github.io/seglearn/user_guide.html
Let me know if this fixes your problems.
D
e.g. a typical X_train with three time series would be shaped like this [(100, 5), (150, 5), (200,5)]
an X_train with one time series would be like this [(100,5)]
I usually use lists or numpy object arrays. If using pandas, again, you'll want each sample to correspond to a time series.
Thank you very much for your kind and prompt reply.
Doing:
X_train, X_test, y_train, y_test = temporal_split([x_train.values], [y_train.values], test_size=0.02)
did the trick (simply using [x_train] instead does not work).
The pipe went fine and fit too, but now I get a new error. If I try:
score = pipe.score(X_test, y_test)
I get:
ValueError Traceback (most recent call last)
in ----> 1 score = pipe.score(X_test, y_test) ~/opt/anaconda3/envs/joseml/lib/python3.7/site-packages/seglearn/pipe.py in score(self, X, y, sample_weight) 279 """ 280 --> 281 Xt, yt, swt = self._transform(X, y, sample_weight) 282 283 self.N_test = len(yt) ~/opt/anaconda3/envs/joseml/lib/python3.7/site-packages/seglearn/pipe.py in _transform(self, X, y, sample_weight) 139 Xt, yt, swt = transformer.transform(Xt, yt, swt) 140 else: --> 141 Xt = transformer.transform(Xt) 142 143 return Xt, yt, swt ~/opt/anaconda3/envs/joseml/lib/python3.7/site-packages/seglearn/transform.py in transform(self, X) 1072 self._check_if_fitted() 1073 Xt, Xc = get_ts_data_parts(X) -> 1074 check_array(Xt, dtype='numeric', ensure_2d=False, allow_nd=True) 1075 1076 fts = np.column_stack([self.features[f](Xt) for f in self.features]) ~/opt/anaconda3/envs/joseml/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator) 584 " minimum of %d is required%s." 585 % (n_samples, array.shape, ensure_min_samples, --> 586 context)) 587 588 if ensure_min_features > 0 and array.ndim == 2: ValueError: Found array with 0 sample(s) (shape=(0, 200, 11)) while a minimum of 1 is required.
Pipe is:
pipe = Pype([('seg', Segment(width=200, overlap=0.5, y_func=last)),
('features', FeatureRep()),
('lin', LinearRegression())])
as in your example and X_test and y_test come from temporal_split
and obviously they are not empty!
For example y_test
is:
[array([0., 4., 2., 3., 0., 1., 2., 0., 0., 0., 1., 1., 3., 0., 1., 1., 1., 3., 0., 1., 1.])]
I assume your X_test, y_test is too small to segment with width 200. I probably should add a check in the transformer for that. I'll put that on the todo list.
It may be this package is not right for your application though based on the data you are looking at. Generally you should have hundreds / thousands of segments to train and test on. Hard to know without knowing more, but you may want to look at methods like ARIMA and such if you just have one series.
Thanks again. Actually I have several thousands series to analyse (all with the same characteristics that I can easily melt in a single df even though they are mostly independent; and since they are independent and all different ARIMA would require to calculate p and q for each one and that would take to much time; moreover since they have seasonality the model would be a SARIMA, with even more calculations) but before training on the whole dataset I wanted to test it with just one series. I reduced the segment size and now I don't get any more errors but the prediction quality, at least for this one, is not exciting. I will test it on a bunch of series but, in case, what may I do to improve?
I wouldn't expect good results with one series. Using sliding window segmentation doesn't always make sense for every problem. It's great for some things like earthquakes, activity recognition, where there is no or little time dependency outside the window. Generally, you need to make sure the window length is long enough to incorporate enough dynamics sensible for a prediction. It's important to interpolate the samples (if not regularly sampled) to a fixed sampling rate so the window time is constant. Setting high overlap is a good data augmentation strategy. Concatenating any heuristics available to the calculated features eg season is also very helpful.
Just a few thoughts. Good luck.
Hi, I was following your example code (simple regression), but I'm stuck. I have a DataFrame of shape (1017, 15). The last column is the target so I created two dfs, one for X (1017, 14) and one for y (1017). I tried to pass those values to
temporal_split
but I always get an error no matter what I do (passing the df, passing them as lists). For example, passing them as list gives:If, on the other hand, I pass them as df I get:
The same holds true if I manually split the DataFrames and pass them to
seg.fit_transform(X_train, y_train)
I tried to put the date column in the df as well as in the index but the error is still there. What's wrong?Info of the Dataframe:
I tried to use it with date column or date index or as a list. The same for y: I tried to use it a Series, a Dataframe with date column or date index and list both with or without the date column. As you see there are no NaN values.