Open jaheba opened 4 years ago
I like this! Further things that could be done with data models include:
Thanks, 1. 2. and 3. are good points.
Another thing we might think about is how we want to store the schema-definition with the dataset. Using a json-file or a python-file?
Also, should we allow to map multiple T
into CT/TC
? E.g. if I have multiple features and I want to add them to feat_dynamic_real
, how would I do that?
Also, should we allow to map multiple
T
intoCT/TC
? E.g. if I have multiple features and I want to add them tofeat_dynamic_real
, how would I do that?
I guess then we should store the reverse map:
name_map = {
“feat_dynamic_real”: “price”
}
which allows you to do
name_map = {
“feat_dynamic_real”: [“price”, “other_feature”]
}
The way fields are combined I guess would be by stacking along the C axis anyway.
- The name_mapping could be assumed to be the identity in case it’s not specified
Or maybe, the default name_map could be inferred by the field types: fields having a T axis are dynamic, fields only having a C axis are static. If they are also described in terms of float vs int, then also real vs categorical can be inferred.
Edit. Almost: we still need to be able to distinguish between target and non-target tensors, for which a minimal name_map is required
Could we handle cardinalities as well with data models?
I think I’m against making things too implicit. Having the identity function as the default sounds like a good compromise.
And, shouldn’t C imply fixed size across all entries?
And, shouldn’t C imply fixed size across all entries?
Yes, the length of the C axis is fixed across entries, but not across fields. Conversely, the length of the T axis is common across fields, but not across entries.
Are T
and C
good names, or should we maybe think about more descriptive ones?
Also, should we allow to map multiple
T
intoCT/TC
? E.g. if I have multiple features and I want to add them tofeat_dynamic_real
, how would I do that?I guess then we should store the reverse map:
name_map = { “feat_dynamic_real”: “price” }
which allows you to do
name_map = { “feat_dynamic_real”: [“price”, “other_feature”] }
The way fields are combined I guess would be by stacking along the C axis anyway.
I think the way the fields are combined should be model dependent, i.e., what is the schema that each model expects. From a user's perspective I guess something like
class MyDataModel:
sales: T[int]
price: T[float]
other_feature: T[float]
name_mapping = {
"sales": "target",
"price": "dynamic_feat",
"other_feature": "dynamic_feat"
}
or
stacked_feats = [price, other_feature] # or whatever type/format we want, e.g. np.concat()
class MyDataModel:
sales: T[int]
stacked_feats: CT[float]
name_mapping = {
"sales": "target",
"stacked_feats": "dynamic_feat"
}
should be equivalent. Then, given this input schema, each model should concatenate in an appropriate for the model way the same fields (e.g. concatenate the "dynamic_feat"
fields in TC
or CT
), and apply the appropriate conversion mappings.
Apart from this, I believe that various validations (shapes of the fields, minimum required field names (e.g. target
and start
)), and maybe informative messages about the provided fields and the fields that a model uses (e.g. inform when a field is provided but the model ignores it - this goes along the lines of #329 ) are necessary to have and really easy to include given that with this approach the dataset schema and the model input schema are clearly defined.
A problem I see is that there is no explicit order when defining multiple instances for the same field using the dict approach.
Also, when we have different algorithms with different input-schemas, how should we support that? For example SageMaker DeepAR uses cat
, while the gluon-ts version uses feat_static_cat
. Would we want to have algorithm-dependent schemas for the input-data?
You can get "an" order from the first entry and stick to that.
At the moment all gluon-ts models follow the FieldName
approach so the name mapping is consistent among all algorithms (which is probably the best from a user's perspective). If we want to extend this to SageMaker DeepAR -which is another story- then we need to be careful with backwards compatibility so probably the best would be to keep cat
. However, we can always have an internal mapping that maps cat
to feat_static_cat
and use exactly the same logic as in gluon-ts. This may need some thought on what is the best approach but I think the overhead would be minimal.
You can get "an" order from the first entry and stick to that.
The problem I see is that it is not explicit to the user. Thus, I would prefer an approach where it is truly explicit.
I don't like that FieldName
approach. I think each algorithm should essentially be free to use its own field names, although it still makes sense to be mostly consistent. Is there any benefit aside from consistency to enforce the same fields across algorithms?
Is there any benefit aside from consistency to enforce the same fields across algorithms?
Keeping a "standard" set of field names allows you to re-use the same data model for potentially different algorithms.
This said: in the reverse mapping I proposed above the order is specified. The following
name_map = {
“feat_dynamic_real”: [“price”, “other_feature”]
}
means that "feat_dynamic_real" is obtained by stacking "price" and "other_feature" in that order.
Keeping a "standard" set of field names allows you to re-use the same data model for potentially different algorithms.
Right, but that would count towards consistency for me.
I also like the reverse mapping because it is unambiguous.
I like this. A few comments:
start
field or other typesIn my view the dataset schema -> model schema mapping should perhaps just be a utility that the user uses to transform his dataset (on the fly) before calling the model.
I like that, that makes it explicit: I think we should minimize hiding too many mechanisms behind the components interfaces.
How and where do you define those transformations?
Hello I have multiple (22) time series (univariate ) and i want to use DeepAR and DeepState estimators for forecasting.How can I convert them to GluonTs data friendly ?`
Date | Time | A_1 | A_2 | A_3 | A_4 | A_5 | A_6 | A_7 | A_8 | A_9 | A_10 | A_11 | A_12 | A_13 | A_14 | A_15 | A_16 | A_17 | A_18 | A_19 | A_20 | A_21 | A_22 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
01-01-2019 | 04:05:00 | 0 | 0.651061 | 0 | 0.440445 | 0.409166 | 0.376522 | 0.263646 | 0 | 0.264463 | 0.22 | 0.007619 | 0.86442 | 0 | 0.161058 | 0.305602 | 0.416569 | 0.213269 | 0.511716 | 0 | 0.73544 | 0.316216 | 0 |
01-01-2019 | 04:10:00 | 0 | 0.653333 | 0 | 0.408271 | 0.365794 | 0.35942 | 0.270151 | 0 | 0.264463 | 0.23 | 0.001905 | 0.850519 | 0 | 0.153016 | 0.279832 | 0.382094 | 0.249052 | 0.493575 | 0 | 0.761421 | 0.333333 | 0 |
01-01-2019 | 04:15:00 | 0 | 0.653333 | 0 | 0.381645 | 0.384822 | 0.344928 | 0.268698 | 0 | 0.264463 | 0.27 | 0.001905 | 0.855574 | 0 | 0.157354 | 0.254902 | 0.359414 | 0.276247 | 0.47997 | 0 | 0.720284 | 0.330631 | 0 |
01-01-2019 | 04:20:00 | 0 | 0.63697 | 0 | 0.346143 | 0.445309 | 0.393913 | 0.275479 | 0 | 0.286501 | 0.246667 | 0.001905 | 0.842936 | 0 | 0.169735 | 0.192997 | 0.329153 | 0.297717 | 0.477702 | 0 | 0.722447 | 0.253153 | 0 |
01-01-2019 | 04:25:00 | 0 | 0.649242 | 0 | 0.311751 | 0.465614 | 0.391014 | 0.286115 | 0 | 0.30854 | 0.25 | 0.007619 | 0.846727 | 0 | 0.16582 | 0.201261 | 0.361624 | 0.340657 | 0.441421 | 0 | 0.716674 | 0.263063 | 0 |
` @jaheba @lostella
Goal
At the moment we have no way to define what data-format a model supports.
We also assume that all algorithm use similar fields and that their datatypes are identical. This has the advantage that a single dataset can be used across multiple algorithms.
Still, to achieve more flexibility, it is desirable to have an explicit way to define what data-format a given model has. In other words, we want to decouple the user's input data-format from the model's input data-format.
Considerations
Field names
In GluonTS we use generic names to describe the function of a given field. For example, we use
target
to describe the time-series which should be predicted into the future.However, in concrete cases it makes sense to have more descriptive naming. E.g., when dealing with sales-data, one wants to predict the
sales
for a product and thus use such a name for thetarget
-field. Similarly, the price could be encoded as such as a dynamic feature:TC vs CT layout
Two-dimensional data can be layout in two ways. With respect to time-series, we can either use a
TC
-layout were for every time-point we have multiple values:Or use a
CT
-layout where for every "channel" we have independent time-series:In some contexts one layout might feel more natural than the other. Still, they are exchangeable.
Proposal
Each algorithm defines a data-model it requires. We offer type-specifiers to describe what kind of data-layout the algorithm expects:
C
: a static (time independent) arrayT
: a single time-seriesTC
: a time series, containing multiple values at each time-pointCT
: a time-series for each channelFor example, DeepAR could define such a model like this:
Then, given a dataset, we use a second model to describe the layout of the dataset:
We define transformations for
T -> {TC, CT}
,TC -> {CT}
andCT -> {TC}
:And a conversion function which does the mapping of the types:
Given the input, we can then translate
sales: T -> target: T
andprice: T -> feat_dyanmic_real: CT
.