Input Data-Format Definitions

jaheba commented 4 years ago

Goal

At the moment we have no way to define what data-format a model supports.

We also assume that all algorithm use similar fields and that their datatypes are identical. This has the advantage that a single dataset can be used across multiple algorithms.

Still, to achieve more flexibility, it is desirable to have an explicit way to define what data-format a given model has. In other words, we want to decouple the user's input data-format from the model's input data-format.

Considerations

Field names

In GluonTS we use generic names to describe the function of a given field. For example, we use target to describe the time-series which should be predicted into the future.

However, in concrete cases it makes sense to have more descriptive naming. E.g., when dealing with sales-data, one wants to predict the sales for a product and thus use such a name for the target-field. Similarly, the price could be encoded as such as a dynamic feature:

{"sales": [10, 12, 11, "..."], "price": [5.99, 5.49, 5.99, "..."]}

TC vs CT layout

Two-dimensional data can be layout in two ways. With respect to time-series, we can either use a TC-layout were for every time-point we have multiple values:

xs = [(1a, 1b), (2a, 2b), ...]

Or use a CT-layout where for every "channel" we have independent time-series:

xs = ([1a, 2a, ...], [1b, 2b, ...])

In some contexts one layout might feel more natural than the other. Still, they are exchangeable.

Proposal

Each algorithm defines a data-model it requires. We offer type-specifiers to describe what kind of data-layout the algorithm expects:

C: a static (time independent) array
T: a single time-series
TC: a time series, containing multiple values at each time-point
CT: a time-series for each channel

For example, DeepAR could define such a model like this:

class DeepARModel:
    target: T[float]
    feat_static_cat: Optional[C]
    feat_dynamic_real: Optional[CT[float]]

Then, given a dataset, we use a second model to describe the layout of the dataset:

class MyDataModel:
    sales: T[int]
    price: T[float]

    name_mapping = {
        "sales": "target",
        "price": "dynamic_feat"
    }

We define transformations for T -> {TC, CT}, TC -> {CT} and CT -> {TC}:

def ct_to_tc(xs):
    return xs.transpose()

def tc_to_ct(xs):
    return xs.transpose()

def t_to_tc(xs):
    return xs.reshape((-1, 1))

def t_to_ct(xs):
    return xs.reshape((1, -1))

conv_map = {
    (TC, CT): tc_to_ct,
    (CT, TC): tc_to_ct,
    (T, CT): t_to_ct,
    (T, TC): t_to_tc,
}

And a conversion function which does the mapping of the types:

def convert(data, given, wanted):
    # input and output types are the same
    # thus, we can just return the input
    if given == wanted:
        return data

    fn = conv_map[given, wanted]
    return fn(data)

Given the input, we can then translate sales: T -> target: T and price: T -> feat_dyanmic_real: CT.

lostella commented 4 years ago

I like this! Further things that could be done with data models include:

Validating the provided model against the required one (i.e. the provided fields can be mapped and converted to the required ones)
Validating the actual dataset (say, that a JSON file conforms to a specified model)
The name_mapping could be assumed to be the identity in case it’s not specified

jaheba commented 4 years ago

Thanks, 1. 2. and 3. are good points.

Another thing we might think about is how we want to store the schema-definition with the dataset. Using a json-file or a python-file?

Also, should we allow to map multiple T into CT/TC? E.g. if I have multiple features and I want to add them to feat_dynamic_real, how would I do that?

lostella commented 4 years ago

Also, should we allow to map multiple T into CT/TC? E.g. if I have multiple features and I want to add them to feat_dynamic_real, how would I do that?

I guess then we should store the reverse map:

name_map = {
    “feat_dynamic_real”: “price”
}

which allows you to do

name_map = {
    “feat_dynamic_real”: [“price”, “other_feature”]
}

The way fields are combined I guess would be by stacking along the C axis anyway.

lostella commented 4 years ago

The name_mapping could be assumed to be the identity in case it’s not specified

Or maybe, the default name_map could be inferred by the field types: fields having a T axis are dynamic, fields only having a C axis are static. If they are also described in terms of float vs int, then also real vs categorical can be inferred.

Edit. Almost: we still need to be able to distinguish between target and non-target tensors, for which a minimal name_map is required

Could we handle cardinalities as well with data models?

jaheba commented 4 years ago

I think I’m against making things too implicit. Having the identity function as the default sounds like a good compromise.

And, shouldn’t C imply fixed size across all entries?

lostella commented 4 years ago

And, shouldn’t C imply fixed size across all entries?

Yes, the length of the C axis is fixed across entries, but not across fields. Conversely, the length of the T axis is common across fields, but not across entries.

jaheba commented 4 years ago

Are T and C good names, or should we maybe think about more descriptive ones?

benidis commented 4 years ago

Also, should we allow to map multiple T into CT/TC? E.g. if I have multiple features and I want to add them to feat_dynamic_real, how would I do that?

I guess then we should store the reverse map:
name_map = {
    “feat_dynamic_real”: “price”
}
which allows you to do
name_map = {
    “feat_dynamic_real”: [“price”, “other_feature”]
}
The way fields are combined I guess would be by stacking along the C axis anyway.

I think the way the fields are combined should be model dependent, i.e., what is the schema that each model expects. From a user's perspective I guess something like

class MyDataModel:
    sales: T[int]
    price: T[float]
    other_feature: T[float]

    name_mapping = {
        "sales": "target",
        "price": "dynamic_feat",
        "other_feature": "dynamic_feat"
    }

or

stacked_feats = [price, other_feature]  # or whatever type/format we want, e.g. np.concat()

class MyDataModel:
    sales: T[int]
    stacked_feats: CT[float]

    name_mapping = {
        "sales": "target",
        "stacked_feats": "dynamic_feat"
    }

should be equivalent. Then, given this input schema, each model should concatenate in an appropriate for the model way the same fields (e.g. concatenate the "dynamic_feat" fields in TC or CT), and apply the appropriate conversion mappings.

Apart from this, I believe that various validations (shapes of the fields, minimum required field names (e.g. target and start)), and maybe informative messages about the provided fields and the fields that a model uses (e.g. inform when a field is provided but the model ignores it - this goes along the lines of #329 ) are necessary to have and really easy to include given that with this approach the dataset schema and the model input schema are clearly defined.

jaheba commented 4 years ago

A problem I see is that there is no explicit order when defining multiple instances for the same field using the dict approach.

Also, when we have different algorithms with different input-schemas, how should we support that? For example SageMaker DeepAR uses cat, while the gluon-ts version uses feat_static_cat. Would we want to have algorithm-dependent schemas for the input-data?

benidis commented 4 years ago

You can get "an" order from the first entry and stick to that.

At the moment all gluon-ts models follow the FieldName approach so the name mapping is consistent among all algorithms (which is probably the best from a user's perspective). If we want to extend this to SageMaker DeepAR -which is another story- then we need to be careful with backwards compatibility so probably the best would be to keep cat. However, we can always have an internal mapping that maps cat to feat_static_cat and use exactly the same logic as in gluon-ts. This may need some thought on what is the best approach but I think the overhead would be minimal.

jaheba commented 4 years ago

You can get "an" order from the first entry and stick to that.

The problem I see is that it is not explicit to the user. Thus, I would prefer an approach where it is truly explicit.

I don't like that FieldName approach. I think each algorithm should essentially be free to use its own field names, although it still makes sense to be mostly consistent. Is there any benefit aside from consistency to enforce the same fields across algorithms?

lostella commented 4 years ago

Is there any benefit aside from consistency to enforce the same fields across algorithms?

Keeping a "standard" set of field names allows you to re-use the same data model for potentially different algorithms.

This said: in the reverse mapping I proposed above the order is specified. The following

name_map = {
    “feat_dynamic_real”: [“price”, “other_feature”]
}

means that "feat_dynamic_real" is obtained by stacking "price" and "other_feature" in that order.

jaheba commented 4 years ago

Keeping a "standard" set of field names allows you to re-use the same data model for potentially different algorithms.

Right, but that would count towards consistency for me.

I also like the reverse mapping because it is unambiguous.

vafl commented 4 years ago

I like this. A few comments:

The input schema for the models is really useful and allows us to validate and cast types in a cleaner way
In this input schema we can also support the start field or other types
While the data model for the dataset is also useful, it should not be required. You should be able to just provide the dataset in the format that the model expects without specifying further.
In my view the dataset schema -> model schema mapping should perhaps just be a utility that the user uses to transform his dataset (on the fly) before calling the model.

lostella commented 4 years ago

In my view the dataset schema -> model schema mapping should perhaps just be a utility that the user uses to transform his dataset (on the fly) before calling the model.

I like that, that makes it explicit: I think we should minimize hiding too many mechanisms behind the components interfaces.

jaheba commented 4 years ago

How and where do you define those transformations?

parimuns commented 4 years ago

Hello I have multiple (22) time series (univariate ) and i want to use DeepAR and DeepState estimators for forecasting.How can I convert them to GluonTs data friendly ?`

Date	Time	A_2	A_4	A_5	A_6	A_7	A_9	A_10	A_11	A_12	A_14	A_15	A_16	A_17	A_18	A_20	A_21
01-01-2019	04:05:00	0.651061	0.440445	0.409166	0.376522	0.263646	0.264463	0.22	0.007619	0.86442	0.161058	0.305602	0.416569	0.213269	0.511716	0.73544	0.316216
01-01-2019	04:10:00	0.653333	0.408271	0.365794	0.35942	0.270151	0.264463	0.23	0.001905	0.850519	0.153016	0.279832	0.382094	0.249052	0.493575	0.761421	0.333333
01-01-2019	04:15:00	0.653333	0.381645	0.384822	0.344928	0.268698	0.264463	0.27	0.001905	0.855574	0.157354	0.254902	0.359414	0.276247	0.47997	0.720284	0.330631
01-01-2019	04:20:00	0.63697	0.346143	0.445309	0.393913	0.275479	0.286501	0.246667	0.001905	0.842936	0.169735	0.192997	0.329153	0.297717	0.477702	0.722447	0.253153
01-01-2019	04:25:00	0.649242	0.311751	0.465614	0.391014	0.286115	0.30854	0.25	0.007619	0.846727	0.16582	0.201261	0.361624	0.340657	0.441421	0.716674	0.263063

` @jaheba @lostella

awslabs / gluonts