NannyML / nannyml

nannyml: post-deployment data science in python
https://www.nannyml.com/
Apache License 2.0
1.9k stars 136 forks source link

Reference Drift Metrics #426

Open emrynHofmannElephant opened 2 days ago

emrynHofmannElephant commented 2 days ago

When calculating univariate drift, you "fit" the drift on the reference. How are the drift metrics of the chunks in the reference data then calculated? - Are they compared to the overall distribution of the reference data?

jakubnml commented 2 days ago

Yes, that's how it is done currently and we are aware it is not the optimum way. Good job on spotting that though 👏

So the correct way is: when calculating drift metric for a chunk which is a subset of the reference data, the observations that belong to that chunk should be "removed" from the reference data for the comparison. Just like in Cross Validation. Otherwise the some of the drift metrics are lower than they really should, because one dataset (reference chunk) is a subset of the other (whole reference). As an effect, in an extreme situation, one may have perfectly iid data, but the drift metrics on reference chunks will be lower than on monitored (analysis) data - yet with iid data they shouldn't.

We plan to fix this. Either by enforcing the new correct way or making it the default one, but keeping both and making the old way optional as it sometimes may be beneficial because of its lower computational cost. I can't say exactly when because our current focus is on research related to performance estimation methods.

Before we fix it, if you really want, you can hack it on your own - by fitting calculator multiple times on subsets of reference data that do not contain the reference chunk of interest.