arangoml / arangopipe

ArangoML Pipeline is a common and extensible Metadata Layer for Machine Learning Pipelines based on ArangoDB.
122 stars 13 forks source link

the validity of employing ml model for the covariate shift calculation in the example [Arangopipe_Feature_ext2_output.ipynb] #167

Open tomaszek0 opened 2 years ago

tomaszek0 commented 2 years ago

Hi, The approach in the example is a simple way to demonstrate the covariate shift. Thank you for an informative description of the covariate shift detection problem and your work. However, I am a bit confused. What is the sense to engage a machine learning model to solve a problem that is solved at a start by "human learning", ie. predetermined dividing of data on two groups according to generated histogram? This histogram showed us already the covariate shift in the dataset. For me, dividing the dataset into a reference group (a group with lat values less than -119) and a group representing the whole dataset makes more sense. I know that there is demonstrated a simple example, but the addition of a dataset example with a hided covariate shift would be helpful (the breast cancer dataset is a classic and very easy binary classification dataset from sklearn.datasets).

rajivsam commented 2 years ago

Hi @tomaszek0 , thanks for the question. So the idea is something like this. In real world models, dataset shift and covariate shift occur because of changing business conditions, for example, customer tastes change, market forces change etc. In this case we are simulating two such datasets from different conditions. In the real world, this would happen organically. Does that help? Your request for a real-world example is noted.

tomaszek0 commented 2 years ago

Hi @rajivsam , Simpler = Better (https://towardsdatascience.com/the-limitations-of-machine-learning-a00e0c3040c6). Here is a similar approach to a covariate shift detection as shown by Du Phan (https://medium.com/data-from-the-trenches/a-primer-on-data-drift-18789ef252a6). I feel that in the case of "Arangopipe-feature..." some additional explanation of the features drift should have been given by computing the drift values as an equivalent of feature importance (we should be able to indicate the feature that discriminates the corrupted (shifted) samples from the reference (stationary) samples (for example https://docs.seldon.io/projects/alibi-detect/en/latest/examples/cd_spot_the_diff_mnist_wine.html).

rajivsam commented 2 years ago

@tomaszek0 noted. In fact, you choose to implement drift detection with logistic regression, then you get what you are referring to. The choice of the classifier is really a design and application preference. Interest in this feature is noted.