arangoml / arangopipe

ArangoML Pipeline is a common and extensible Metadata Layer for Machine Learning Pipelines based on ArangoDB.
122 stars 13 forks source link

Attribute drift detection #124

Open rajivsam opened 4 years ago

rajivsam commented 4 years ago

Implement a feature to detect attribute drift detection. We have features at a dataset (joint distribution) level, it looks like azure can do this at the attribute level. This is not difficult to do. It requires the following, check the nature of the attribute: (1) If it is continuous (numeric)- the numpy dtype should be float, use the kolmogorov-smirnov 2 sample test to see if the attribute distribution in the training data and the data received in deployment have the same distribution: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ks_2samp.html (2) If it is categorical - the numpy dtype is object, use the chi-square test of independence: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html We need a contingency table to do this. We can get this using the group by functionality from pandas: https://stackoverflow.com/questions/29901436/is-there-a-pythonic-way-to-do-a-contingency-table-in-pandas

Note: Check https://towardsdatascience.com/how-to-compare-two-distributions-in-practice-8c676904a285 to see if a completely discrete non-parametric test makes sense.