Closed rsnirwan closed 2 years ago
Thanks a lot for the proposal!
I'm a bit confused by the names, are long
and wide
common terms? If I understand it correctly, these are many-table or single-table approaches?
How would your approach work for models with custom columns? For example, some models use past_feat_dynamic_real
for features that are only available for past data, but not for making predictions.
Maybe we can implement the initial versions in the nursery
and migrate it once we feel we are happy with the design.
In the WideDataset
example, maybe you could make freq
optional, in case this can be taken from the DataFrame index itself.
Also one could think of a third format where target and features are coupled in the same DataFrame, and multiple such DataFrame objects are given (one for each entry in the dataset): this is exactly the situation that the long-format will end up into, since there a .groupby
will be necessary (yielding multiple DataFrame objects, one per group). Since it's necessary for the long-format case, why not expose it first-class and have the long-format rely on it?
Also, we should be careful in the wide-format case not to trick the user into thinking that all columns are modeled jointly (i.e. a multivariate model).
My suggestion is to further split the effort, maybe leaving the WideDataset
case aside at first.
Thanks for the comments @jaheba .
I'm a bit confused by the names, are long and wide common terms?
Yes, these are common terms. They are also called stacked (long) and unstacked (wide) data. If thinking in only term of target-values, we can either stack them on top of each other (long) or have them side-by-side (wide). See: wikipedia and pandas functions: melt, pivot, wide_to_long.
If I understand it correctly, these are many-table or single-table approaches?
The Long-dataframe is a single dataframe containing all the data. In the wide data-frame the data is split into target and features. Here we have multiple tables. As @lostella proposed, we can also think of an split by 'dataentry', which essentially is a mix of the both.
How would your approach work for models with custom columns? For example, some models use past_feat_dynamic_real for features that are only available for past data, but not for making predictions.
This is the same as the 'target' for the prediction. Here, 'target' is also shorter than, e.g., 'dynamic_feat_real'. This only becomes relevant when the transform_data
is set to True
.
For long-table:
past_feat_dynamic_real: List[str]
and remove future values when iterating through the datasetFor wide-table:
past_feat_dynamic_real: List[pd.Dataframe]
that has same restrictions as the target. I.e. if transform_data
is True
we ignore future values.Maybe we can implement the initial versions in the nursery and migrate it once we feel we are happy with the design.
Yes. Sound good!
Thanks for the comments @lostella
In the WideDataset example, maybe you could make freq optional, in case this can be taken from the DataFrame index itself.
Yes, will do.
Also one could think of a third format where target and features are coupled in the same DataFrame, and multiple such DataFrame objects are given (one for each entry in the dataset): this is exactly the situation that the long-format will end up into, since there a .groupby will be necessary (yielding multiple DataFrame objects, one per group). Since it's necessary for the long-format case, why not expose it first-class and have the long-format rely on it?
The long_to_dataentry
function will work on groupby obects. This is common bases of what you propose as the 'third dataset' and the LongDataset. So, the LongDataset can inherit from the other or we can have a standalone long_to_dataentry
-function that is used by both classes. I'll think about it.
My suggestion is to further split the effort, maybe leaving the WideDataset case aside at first.
The data I am working on is actually in the Wide-format. So, I am keen to work on that too :) . But I ll split the implementation into two branches, one for wide and one for long.
Related to #418
Thanks for the effort Raj!
The long format seems contain everything we need and it looks to me simpler to grasp from a user's perspective than the wide format. What are the main motivations to support a wide format?
Thanks for the question @huibinshen
There are several reasons to split data into wide, some of which might become more interesting when you scale to multiple thousand time series or very long time series. From my perspective:
Thanks for the question @huibinshen
There are several reasons to split data into wide, some of which might become more interesting when you scale to multiple thousand time series or very long time series. From my perspective:
* Major one: it splits data into different tables for target and features so you can add, update, remove single features without touching others. However, you have to handle multiple tables now. * If you want to add, e.g., local weather as dynamic feature you just create a new wide-dataframe. No need to think about or touch the rest. * Minor: it is more compact and has less redundancies * Timestamp is not duplicated as the number of time series but as the number of dynamic features. * Same is true for item_id and static features. Static features are even O(1) in the length of time series. * When getting data from different data sources and its quite natural for me to keep them separate.
Thanks for sharing it @RSNirwan! We need to weigh it against the con sides. I could currently think of the following:
Thanks again for the great work!
There are some issues with the WideDataset as proposed here. Thanks @lostella for pointing this out.
The case of multiple multivariate time series can be easily handled by the LongDataset by having mulitple columns for the target and changing target: str
to target: Union[str, List[str]]
. This way targets split over multiple columns in the long case can be considered to be multivariate.
Since, in the Wide-case the data in multiple columns is considered to be univariate (as proposed above), it is hard to map multiple multivariate time series to the wide format.
Therefore, I would propose to have the LongDataset as the default dataframe format. We can call it DataFrameDataset
, which can be merged into the main package. I will anyway work on the WideDataset, because I need it. If we want to make it available to the broader audience, we can put it to the nursery first and merge it to the main package later, if needed.
Many users have tabular datasets in csv, parquet, excel, ... . By having a Dataset-class which works directly on tabular data users do not need to convert data to gluonts-dataentry. That would make it way easier for new users to use gluonts. Also old users can skip the conversion part and can directly start modeling.
User-facing API (Example)
Here is how an user can interact with gluonts:
Dataframe examples
The tabular data the user has can have many different formats. In the following I ll discuss two of them. Long- and the Wide-format, which will cover 99% of the cases, I guess.
Long-table
A DataFrame with timestamp, target and item_id columns.
Wide-table
A DataFrame with timestamps as index and each column corresponds to a time series. item_id is the column name.
Target
Dynamic features (cat and real)
Static features (cat and real)
API
As in the example above we can provide a LongDataset and a WideDataset.
For the LongDataset we have all in one dataframe and the user only need to provide information about what the values in a column represent (target, timestamp, item_id, feature):
For the WideDataset we split targets and features into different dataframes and provide the appropriate dataframe directly as input (remember that timestamps are the index and each column corresponds to a time series):
Gluonts-internal representation
We can implement one Dataset class for the long-format and one for wide-format. Inheriting from
gluonts.dataset.common.Dataset
and implementing__iter__
and__len__
will allow a direct interaction with models and predictors. There is no need for the user to know aboutDataEntry
. All conversions are taken care of by the new Datasets-classes.For long format the
__iter__
is more or lessmap(long_to_dataentry, df.groupby(item_id))
. For wide format the__iter__
is more of lessmap(wide_to_dataentry, target.iteritems(), dynfeat1.iteritems(), statfeat1.iteritems(), ...)
.long_to_dataentry
andwide_to_dataentry
will contain all the logic needed for the conversion, handling missing values and other type of pre/post processing.Handling prediction-data (transform-data)
Prediction-data here is some data where target length is not the same as dynamic feature length.
In the case there are no dynamic features. We don't have to do anything in addition to the above mentioned procedure. If dynamic features are present, the processing is slightly different for prediction-data. The flag
transform_data
(is there any standardized name for this?) will indicate if the 'slightly different' processing is triggered.Todos: