awslabs / gluonts

Probabilistic time series modeling in Python
https://ts.gluon.ai
Apache License 2.0
4.55k stars 748 forks source link

Provide better ways to integrate timeseries data (categorical/real etc) for wide dataframes into Dataset #2151

Open strakehyr opened 2 years ago

strakehyr commented 2 years ago

As mentioned in #2140, it would drastically improve accessibility to provide a way to pass a wide DF with a dict or lists for feat_dynamic_cat, feat_static_cat etc. that would reliably be faster than what I (and probably others) am using right now which is:

for i in df_long.index:
    col = df_long.at[i, 'variable']
    if col in feat_dynamic_cat:
        cat = 1
    else: cat = 0
    pos = df.columns.get_loc(col)
    trgt = df.at[df_long.at[i, 'Time'], outputs[0]]
    df_long.loc[i, 'target'] = trgt
    df_long.loc[i, 'feat_dynamic_cat'] = cat
    df_long.loc[i, 'feat_static_cat'] = feat_static_cats[pos]

maybe .from_wide_dataframe could be the method?

lostella commented 2 years ago

@strakehyr could you expand on this idea? Where would you think it is best to specify all the additional features? In separate dataframes?

strakehyr commented 2 years ago

So, if the dimensions of the wide DF are [num_timeseries, length], you should be able to provide feat_dynamic_cat, or feat_static_cat1 (feat_static_cat2...) in a DF of dimensions = [num_timeseries x number of categorical qualities]. Therefore you mark every time-series. The columns of this new DF should be equal to the wide DF, and there should be a specific index name to differentiate feat_static_cat_N from feat_dynamic_cat.

strakehyr commented 2 years ago

Also maybe an index for target. That would be marking (with 1) the target time-series (and 0 for covariates).

lostella commented 2 years ago

So you would provide multiple dataframes, specifically:

Is this what you mean?

rsnirwan commented 2 years ago

I wrote down my thoughts on this in another issues I'll link below.

strakehyr commented 2 years ago

So you would provide multiple dataframes, specifically:

  • a “target” one, shape time_length x num_series
  • one for static features, shape num_series x num_features (each column with its own dtype, so that for example we can distinguish easily between numerical vs categorical columns)
  • one for each dynamic feature, shape time_length x num_series

Is this what you mean?

So you would provide multiple dataframes, specifically:

  • a “target” one, shape time_length x num_series
  • one for static features, shape num_series x num_features (each column with its own dtype, so that for example we can distinguish easily between numerical vs categorical columns)
  • one for each dynamic feature, shape time_length x num_series

Is this what you mean?

Indeed. Just a way for the Dataset to assign each time-series its own qualities starting from a wide DF.