SeldonIO / ml-prediction-schema

Generic schema structure for machine learning model predictions
Apache License 2.0
13 stars 3 forks source link

Data ranges and presence #4

Open theofpa opened 3 years ago

theofpa commented 3 years ago

Data ranges

In the numerical feature types like REAL, we could have some descriptive statistics like min/max/avg/std to increase the expressiveness of the schema. This way, we can

  1. Use it for data validation on inference time. For example, a tranformer can perform the task of feature data validation on received data points. When a feature is not within the range defined by min/max values, it can log the error accordingly, for example increase an outlier counter/metric.
  2. Use the trained data distribution information to compare it against calculated distributions of inference requests batches. For example using some KL based distance method to increase a skew/drift detection counter/metric.

Similarly to the numerical, store the distribution of the category_map.

Data presence

In all feature types, define an attribute to specify whether a feature is supposed to be mandatory for inference or not. For example if there are no missing values on a particular feature during training time, most probably we'd like to require this feature in the inference request. A transformer performing the data validation task can handle this error and increase an anomaly detection counter/metric.