Feature/dss52 model drift: backend v1

du-phan commented 5 years ago

Backend implementation with 3 components:

a Preprocessor object that mimics doctor's behaviour.
a DriftAnalyzer object that train a drift model and return a list of metrics:
- Original feature importance vs Drift feature importance.
- AUC score of the drift model.
- Original test set prediction proba vs New test set prediction proba.

Exemple:

dataiku.use_plugin_libs('model-drift')
from dku_drifter import Drifter

model_id = '5HExUjQ1'
test_set = 'unlabeled_customers_within_segments_prepared'

drifter = Drifter(model_id, test_set)
drifter.train_drift_model()
drift_metrics = drifter.generate_drift_metrics()

dsleo commented 5 years ago

I think we are saying the same thing! usually won’t have a new_test_df as large as the original test_df but for learning the drift model we need to be of similar size. So we’d need to down sample the original test_df indeed... that’s what I need in the original code, but maybe you’ve totally discarded it ?

And for a v2, we could bootstrap sample to get a more robust estimation of drift score.

Le 27 juil. 2019 à 12:18 +0200, Du Phan notifications@github.com, a écrit :

@du-phan commented on this pull request. In python-lib/dku_drifter/drifter.py:

+

self.model_handler = self._get_model_handler()

self.drift_clf = None

self.train_X = None

self.train_Y = None

self.test_X = None

self.test_Y = None

def _get_model_handler(self):

my_data_dir = os.environ['DIP_HOME']

saved_model_version_id = get_saved_model_version_id(self.model_id)

model_handler = get_model_info_handler(saved_model_version_id, my_data_dir)

return model_handler

def concatenate_new_and_original_data(self): One option is to force to sample original_test_df as many rows as new_test_df This seems like a very strong constraint to me, and in practice I don't think we can have a new_test_df as big as original_test_df. We can implement a downsampling mechanism in which we downsample the test set that has more rows. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

du-phan commented 5 years ago

Ah I just saw that, I missed it the first time I read your codes

du-phan commented 5 years ago

New version that takes into account the feedbacks of Joachim and Léo. A new object, ModelAccessor is added to decouple the logic between dku model_handler and the DriftAnalyzer. The new API is as follow:

dataiku.use_plugin_libs('model-drift')
from dku_drifter import DriftAnalyzer, ModelAccessor
from commons import get_model_handler

model_id = '5HExUjQ1'
test_set = 'unlabeled_customers_within_segments_prepared'
new_test_df = dataiku.Dataset(test_set).get_dataframe()

model = dataiku.Model(model_id)
model_handler = get_model_handler(model)
model_accessor = ModelAccessor(model_handler)

drifter = DriftAnalyzer(model_accessor)
drift_features, drift_clf = drifter.train_drift_model(new_test_df)
drift_metrics = drifter.generate_drift_metrics(new_test_df, drift_features, drift_clf)

dataiku / dss-plugin-model-drift

Feature/dss52 model drift: backend v1 #1