Open LifeBoey opened 1 year ago
Hi @LifeBoey, you might find this blog useful https://www.evidentlyai.com/blog/data-drift-detection-large-datasets
There, we generate artificial drift and then explore how each statistical test reacts to it.
There is also a notebook with all the code, including the code where we created artificial drift: https://colab.research.google.com/drive/1EFFcs0wDzToxSR6nw1umXDgPyeoP_Uk6
Hi there,
I've been exploring data drift detection and have been wanting to test how good evidently is at determining how much a given dataset has drifted. However, my main concern right now is wondering how to generate drifted data in the first place, and how much to skew them, so that I can get evidently to detect how much drift was applied to them.
So let's say I have a tabular dataframe like this, where I want to drift just the feature of Age.
What are the types of ways to artificially create a drifted dataset from a given dataset?
What I've been doing is splitting it into 2 extreme ranges (e.g. one set of <50 Age and one set of >=50 Age), and then mixing the two datasets more and more to create "less" drift. But supposedly for tabular data would something simpler do the trick, such as applying a uniform difference to all the Ages of one dataset work? Applying a random noise to all of the Ages, the noise following some normal distribution? What other standard techniques could be used to apply drift in this manner, and of a degree that can be varied?
Thank you!