Handling multiple multi-variate time-series in a Dataset

hashbangCoder commented 3 years ago

Apologies if this capability already exists, I went through the examples, issues (like #1095) and discussions but couldnt figure out an existing way of (satisfactorily) doing this.

Description

This is related to #695 - A requirement to handle multiple multi-variate time-series wrapped in a ListDataset As an example, consider a dummy data of 10 multi-variate time-series each of (19, 300) dim i.e. length=300 and num_feat =19

data = [{'start': ts, 'target': np.randn(19, 300), 'freq': freq} for _ in range(10)]
dataset = ListDataset(data,   freq=freq, one_dim_target=False)

I cannot catenate all 10 chunks into one, because of irregularly sampling. Example: data is sampled for a month, then offline for 2 months, again sampled for a month... so on. I have cleaned the dataset such that all 10 time-series chunks have the same freq.

Padding, interpolation isnt possible/ideal because of the long sampling breaks in the data. And each individual time-series is sufficiently longer than prediction_len + context_len, so intra-timeseries batch-sampling is not an issue.

Is such a Dataset formulation possible currently? FWIW, I tried doing as shown above and got this

Exception: Reached maximum number of idle transformation calls.
This means the transformation looped over GLUONTS_MAX_IDLE_TRANSFORMS=100 inputs without returning any output.
This occurred in the following transformation: 
gluonts.transform.split.InstanceSplitter(dummy_value=0.0, forecast_start_field="forecast_start", future_length=10, instance_sampler=gluonts.transform.sampler.ExpectedNumInstanceSampler(axis=-1, min_past=338, min_future=10, num_instances=1.0, total_length=0, n=0), is_pad_field="is_pad", lead_time=0, output_NTC=True, past_length=338, start_field="start", target_field="target", time_series_fields=["time_feat", "observed_values"])

mbohlkeschneider commented 3 years ago

Hi @hashbangCoder,

can you illustrate on a data snipped what the problem is? Not sure I got it.

StatMixedML commented 3 years ago

@hashbangCoder

The error

Exception: Reached maximum number of idle transformation calls.

usually comes from if many of your time series are shorter than your forecasting horizon. You might want to try reducing the forecasting horizon and check if the error still appears.

hashbangCoder commented 3 years ago

Thanks for the response @mbohlkeschneider

Apologies for the delay in response. My data is multivariate time-series that is irregularly sampled. For this example, im using a toy dataset. I've created a python code snippet with comments that hopefully explains my issue better, takes about 2-3 mins to run.

I was working with pytorch-ts (which is built on gluonts) but while debugging discovered the issue is with gluonts. So i managed to recreate it with DeepVar model

I'm running this on CPU and Windows OS with python 3.7

import pandas as pd
import numpy as np
from gluonts.dataset.common import ListDataset
from gluonts.model.deepvar import DeepVAREstimator
from gluonts.mx.trainer import Trainer as GluonTrainer

# 1M timestamped dataset of 19 sensor values, but irregularly sampled
sensor_data = pd.DataFrame(np.random.randn(1000000, 19))
# `sample_inds` is list of tuples `[(start_ind1, end_ind1), (start_ind2, end_ind2), etc]` where each tuple is start and end index for subsampling by slicing from `sensor_data`
# such that within each subsequence (eg: `sensor_data.iloc[start_ind1: end_ind1, :]`) timestamps are evenly spaced
# all sub-sequences are of same length (assume 300)
# for this example, sample_inds is randomly generated
sample_inds = [(0, 300)]
# generate 100 sub-sequences by randomly sampling indices such that every sub-seq len == 300
for _ in range(100):
    low = np.random.randint(sample_inds[-1][1], sample_inds[-1][1] + 1000)
    high = low + 300
    sample_inds.append((low, high))

# assume freq is 1min, prediction/forecast length = 50, context length = 250
# so the model looks at 250mins of data (250 samples) and predicts 50 mins; total len = 300
freq = '1min'
prediction_len = 50

# create train dataset
forecast_train = []
ts = pd.Timestamp('24th Aug 2009')
for start_ind, end_ind in sample_inds:
    # randomly increasing timestamp, irrelevant for example
    ts = ts + pd.Timedelta(f'{np.random.randint(5, 10)}D')
    forecast_train.append(
        {'start': ts, 'target': sensor_data.iloc[start_ind: end_ind - prediction_len, :].values.transpose(),
         'freq': freq})

estimator = DeepVAREstimator(target_dim=19,
                                 prediction_length=50,
                                 context_length=250,
                                 freq="1min",
                                 trainer=GluonTrainer(epochs=10))
print('start training')
forecast_train = ListDataset(forecast_train, freq=freq, one_dim_target=False)
predictor = estimator.train(forecast_train)

I get this error

  File "C:\Users\hashbangcoder\AppData\Roaming\Python\Python37\site-packages\gluonts\transform\_base.py", line 142, in __call__
    f"Reached maximum number of idle transformation calls.\n"
Exception: Reached maximum number of idle transformation calls.
This means the transformation looped over GLUONTS_MAX_IDLE_TRANSFORMS=100 inputs without returning any output.
This occurred in the following transformation:
gluonts.transform.split.InstanceSplitter(dummy_value=0.0, forecast_start_field="forecast_start", future_length=50, instance_sampler=gluonts.transform.sampler.ExpectedNumInstanceSampler(axis=-1, min_past=251, min_future=50, num_instances=1.0, total_length=0, n=0), is_pad_field="is_pad", lead_time=0, output_NTC=True, past_length=251, start_field="start", target_field="target", time_series_fields=["time_feat", "observed_values"])

I hope you understood why i cannot concatenate all the sub-sequences together or interpolate values b/w sub-seqs due to irregular sampling of sensor_data So my question boils down to - Is there any way i can combine multiple multi-variate time-series in a single ListDataset?

hashbangCoder commented 3 years ago

Also @StatMixedML , my forecasting horizon is 50 and my time-series are of length 250. Even with forecast horizon = 20, its the same error. I suspect its due to the way Sampler works with multiple time-series in ListDataset.

I'm thinking if I do concatenate all my sub-sequences into a single 2D target and use a custom Sampler (subclassing InstanceSampler) that can sample at the indices corresponding to start of my sub-seqs and with exact same lengths, it should work?

StatMixedML commented 3 years ago

@hashbangCoder Are you using MultivariateGrouper and grouper_train for grouping the data? For an example on how to use it, see here

mbohlkeschneider commented 3 years ago

Hi @hashbangCoder,

Thank you for the snipped. The transformation fails because the InstanceSplitter cannot draw long enough samples from your data. Your data has 250 timestamps and with context_length=250 and prediction_length=50 you are asking to draw samples of length 300. Thus, the transformation fails. You can set the parameter pick_incomplete=True in the DeepVAREstimator. Then the samples will be padded. However, I would suggest to reduce the context_length. 250 is quite high and probably a model with much lower context_length will do better and run a lot faster.

hashbangcoder-v2 commented 3 years ago

Hello @mbohlkeschneider, unfortunately my @hashbangCoder account has issues with 2FA device and im unable to login from there. So im using an older account.

I understand the issue is in ExpectedNumInstanceSampler and can lower my context_len and prediction_len and test it out.

        a, b = self._get_bounds(ts)
        window_size = b - a + 1

I do have a few followup questions on ExpectedNumInstanceSampler and history_length used in DeepVAR (if you dont mind).

As I understand context_len (i.e. what the RNN sees) is increased by self.lags_seq which depends on the freq. I dont understand why this is needed? Over and above the user specified context_len.

self.history_length = self.context_length + max(self.lags_seq)

The ExpectedNumInstanceSampler slices input seqs for RNN with seq_start_index > history_length. If i have freq=1H, and context_len=24 then the sampler slices with indices b/w (24 + 168) & total_len - prediction_len.

self.train_sampler = (
        train_sampler
        if train_sampler is not None
        else ExpectedNumInstanceSampler(
            num_instances=1.0,
            min_past=0 if pick_incomplete else self.history_length,
            min_future=prediction_length,
        )
    )

can we exclude the extra lags_seq added to context_len?

mbohlkeschneider commented 3 years ago

You can set lags_seq=1 to just use the previous value as a lag. However, I would advice to the reduce the context length and keep the lags.

hashbangcoder-v2 commented 3 years ago

Thanks a lot for your help. Will try this to see how well it works.

yash4gandhi commented 1 year ago

Hey! Even I am facing the same issue, Is the issue resolved?

awslabs / gluonts

Handling multiple multi-variate time-series in a Dataset #1342

Description