In the numerical feature types like REAL, we could have some descriptive statistics like min/max/avg/std to increase the expressiveness of the schema. This way, we can
Use it for data validation on inference time. For example, a tranformer can perform the task of feature data validation on received data points. When a feature is not within the range defined by min/max values, it can log the error accordingly, for example increase an outlier counter/metric.
Use the trained data distribution information to compare it against calculated distributions of inference requests batches. For example using some KL based distance method to increase a skew/drift detection counter/metric.
Similarly to the numerical, store the distribution of the category_map.
Data presence
In all feature types, define an attribute to specify whether a feature is supposed to be mandatory for inference or not. For example if there are no missing values on a particular feature during training time, most probably we'd like to require this feature in the inference request. A transformer performing the data validation task can handle this error and increase an anomaly detection counter/metric.
Data ranges
In the numerical feature types like
REAL
, we could have some descriptive statistics like min/max/avg/std to increase the expressiveness of the schema. This way, we canSimilarly to the numerical, store the distribution of the
category_map
.Data presence
In all feature types, define an attribute to specify whether a feature is supposed to be mandatory for inference or not. For example if there are no missing values on a particular feature during training time, most probably we'd like to require this feature in the inference request. A transformer performing the data validation task can handle this error and increase an anomaly detection counter/metric.