awslabs / gluonts

Probabilistic time series modeling in Python
https://ts.gluon.ai
Apache License 2.0
4.63k stars 755 forks source link

Pandas-based datasets #1930

Closed rsnirwan closed 2 years ago

rsnirwan commented 2 years ago

Many users have tabular datasets in csv, parquet, excel, ... . By having a Dataset-class which works directly on tabular data users do not need to convert data to gluonts-dataentry. That would make it way easier for new users to use gluonts. Also old users can skip the conversion part and can directly start modeling.

User-facing API (Example)

Here is how an user can interact with gluonts:

import pandas as pd
from gluonts.model.deepar import DeepAREstimator
from gluonts.dataset.common import LongDataset, WideDataset # see long/wide-table definition below

df = pd.read_csv("long_df.csv")
dataset = LongDataset(data=df, target="target", timestamp="time", item_id="item", freq="1h")
# dataset = WideDataset(data=df, freq="1h")

model = DeepAREstimator(freq="1h", prediction_length=20)
predictor = model.train(dataset)
predictions = list(predictor.predict(dataset))

Dataframe examples

The tabular data the user has can have many different formats. In the following I ll discuss two of them. Long- and the Wide-format, which will cover 99% of the cases, I guess.

Long-table

A DataFrame with timestamp, target and item_id columns.

time target item stat_cat_1 dyn_real_1
1750-01-01 00:00:00 -0.21 A 0 0.79
1750-01-01 01:00:00 -0.33 A 0 0.59
1750-01-01 02:00:00 -0.33 A 0 0.39
1750-01-01 01:00:00 -1.24 B 1 -0.60
1750-01-01 02:00:00 -1.37 B 1 -0.91

Wide-table

A DataFrame with timestamps as index and each column corresponds to a time series. item_id is the column name.

Target

A B
1750-01-01 00:00:00 -0.21 NaN
1750-01-01 01:00:00 -0.33 1.94
1750-01-01 02:00:00 -0.33 2.28

Dynamic features (cat and real)

Dynamic features are provided in a separate dataframes of the same size as the target-dataframe. For mulitple features multiple dataframes are provided. A B
1750-01-01 00:00:00 0.79 NaN
1750-01-01 01:00:00 0.59 -0.60
1750-01-01 02:00:00 0.39 -0.91

Static features (cat and real)

Static features are also provided in a separate dataframe. For multiple features we have multiple rows in the same dataframe. A B
static_cat_1 0 1

API

As in the example above we can provide a LongDataset and a WideDataset.

For the LongDataset we have all in one dataframe and the user only need to provide information about what the values in a column represent (target, timestamp, item_id, feature):

dataset = LongDataset(
  data: pd.DataFrame,
  target: str,
  timestamp: str,
  item_id: str,
  freq: str,
  feat_dynamic_real: List[str] = None,
  feat_dynamic_cat: List[str] = None,
  feat_static_real: List[str] = None,
  feat_static_cat: List[str] = None,
  transform_data: bool = False,  # see the section 'handling prediction-data' below
)

For the WideDataset we split targets and features into different dataframes and provide the appropriate dataframe directly as input (remember that timestamps are the index and each column corresponds to a time series):

dataset = WideDataset(
  target: pd.DataFrame,
  freq: str,
  feat_dynamic_real: List[pd.Datframe] = None,
  feat_dynamic_cat: List[pd.Dataframe] = None,
  feat_static_real: List[pd.DataFrame] = None,  # List[pd.Series] is also fine
  feat_static_cat: List[pd.DataFrame] = None,  # List[pd.Series] is also fine
  transform_data: bool = False, # see the section 'handling prediction-data' below
)

Gluonts-internal representation

We can implement one Dataset class for the long-format and one for wide-format. Inheriting from gluonts.dataset.common.Dataset and implementing __iter__ and __len__ will allow a direct interaction with models and predictors. There is no need for the user to know about DataEntry. All conversions are taken care of by the new Datasets-classes.

For long format the __iter__ is more or less map(long_to_dataentry, df.groupby(item_id)). For wide format the __iter__ is more of less map(wide_to_dataentry, target.iteritems(), dynfeat1.iteritems(), statfeat1.iteritems(), ...).

long_to_dataentry and wide_to_dataentry will contain all the logic needed for the conversion, handling missing values and other type of pre/post processing.

Handling prediction-data (transform-data)

Prediction-data here is some data where target length is not the same as dynamic feature length.

In the case there are no dynamic features. We don't have to do anything in addition to the above mentioned procedure. If dynamic features are present, the processing is slightly different for prediction-data. The flag transform_data (is there any standardized name for this?) will indicate if the 'slightly different' processing is triggered.

Todos:

jaheba commented 2 years ago

Thanks a lot for the proposal!

I'm a bit confused by the names, are long and wide common terms? If I understand it correctly, these are many-table or single-table approaches?

How would your approach work for models with custom columns? For example, some models use past_feat_dynamic_real for features that are only available for past data, but not for making predictions.

Maybe we can implement the initial versions in the nursery and migrate it once we feel we are happy with the design.

lostella commented 2 years ago

In the WideDataset example, maybe you could make freq optional, in case this can be taken from the DataFrame index itself.

Also one could think of a third format where target and features are coupled in the same DataFrame, and multiple such DataFrame objects are given (one for each entry in the dataset): this is exactly the situation that the long-format will end up into, since there a .groupby will be necessary (yielding multiple DataFrame objects, one per group). Since it's necessary for the long-format case, why not expose it first-class and have the long-format rely on it?

Also, we should be careful in the wide-format case not to trick the user into thinking that all columns are modeled jointly (i.e. a multivariate model).

My suggestion is to further split the effort, maybe leaving the WideDataset case aside at first.

rsnirwan commented 2 years ago

Thanks for the comments @jaheba .

I'm a bit confused by the names, are long and wide common terms?

Yes, these are common terms. They are also called stacked (long) and unstacked (wide) data. If thinking in only term of target-values, we can either stack them on top of each other (long) or have them side-by-side (wide). See: wikipedia and pandas functions: melt, pivot, wide_to_long.

If I understand it correctly, these are many-table or single-table approaches?

The Long-dataframe is a single dataframe containing all the data. In the wide data-frame the data is split into target and features. Here we have multiple tables. As @lostella proposed, we can also think of an split by 'dataentry', which essentially is a mix of the both.

How would your approach work for models with custom columns? For example, some models use past_feat_dynamic_real for features that are only available for past data, but not for making predictions.

This is the same as the 'target' for the prediction. Here, 'target' is also shorter than, e.g., 'dynamic_feat_real'. This only becomes relevant when the transform_data is set to True.

For long-table:

For wide-table:

Maybe we can implement the initial versions in the nursery and migrate it once we feel we are happy with the design.

Yes. Sound good!

rsnirwan commented 2 years ago

Thanks for the comments @lostella

In the WideDataset example, maybe you could make freq optional, in case this can be taken from the DataFrame index itself.

Yes, will do.

Also one could think of a third format where target and features are coupled in the same DataFrame, and multiple such DataFrame objects are given (one for each entry in the dataset): this is exactly the situation that the long-format will end up into, since there a .groupby will be necessary (yielding multiple DataFrame objects, one per group). Since it's necessary for the long-format case, why not expose it first-class and have the long-format rely on it?

The long_to_dataentry function will work on groupby obects. This is common bases of what you propose as the 'third dataset' and the LongDataset. So, the LongDataset can inherit from the other or we can have a standalone long_to_dataentry-function that is used by both classes. I'll think about it.

My suggestion is to further split the effort, maybe leaving the WideDataset case aside at first.

The data I am working on is actually in the Wide-format. So, I am keen to work on that too :) . But I ll split the implementation into two branches, one for wide and one for long.

lostella commented 2 years ago

Related to #418

huibinshen commented 2 years ago

Thanks for the effort Raj!

The long format seems contain everything we need and it looks to me simpler to grasp from a user's perspective than the wide format. What are the main motivations to support a wide format?

rsnirwan commented 2 years ago

Thanks for the question @huibinshen

There are several reasons to split data into wide, some of which might become more interesting when you scale to multiple thousand time series or very long time series. From my perspective:

huibinshen commented 2 years ago

Thanks for the question @huibinshen

There are several reasons to split data into wide, some of which might become more interesting when you scale to multiple thousand time series or very long time series. From my perspective:

* Major one: it splits data into different tables for target and features so you can add, update, remove single features without touching others. However, you have to handle multiple tables now.
  * If you want to add, e.g., local weather as dynamic feature you just create a new wide-dataframe. No need to think about or touch the rest.

* Minor: it is more compact and has less redundancies

  * Timestamp is not duplicated as the number of time series but as the number of dynamic features.
  * Same is true for item_id and static features. Static features are even O(1) in the length of time series.

* When getting data from different data sources and its quite natural for me to keep them separate.

Thanks for sharing it @RSNirwan! We need to weigh it against the con sides. I could currently think of the following:

  1. It is more error pruning as extra efforts are needed to make sure different data frames are aligned correctly.
  2. A bit more complexity for the users to manager more data frames than just 1.

Thanks again for the great work!

rsnirwan commented 2 years ago

There are some issues with the WideDataset as proposed here. Thanks @lostella for pointing this out.

The case of multiple multivariate time series can be easily handled by the LongDataset by having mulitple columns for the target and changing target: str to target: Union[str, List[str]]. This way targets split over multiple columns in the long case can be considered to be multivariate.

Since, in the Wide-case the data in multiple columns is considered to be univariate (as proposed above), it is hard to map multiple multivariate time series to the wide format.

Therefore, I would propose to have the LongDataset as the default dataframe format. We can call it DataFrameDataset, which can be merged into the main package. I will anyway work on the WideDataset, because I need it. If we want to make it available to the broader audience, we can put it to the nursery first and merge it to the main package later, if needed.