evidentlyai / evidently

Evaluate and monitor ML models from validation to production. Join our Discord: https://discord.com/invite/xZjKRaNp8b
Apache License 2.0
4.86k stars 545 forks source link

Drifted Tabular Data Generation #328

Open LifeBoey opened 1 year ago

LifeBoey commented 1 year ago

Hi there,

I've been exploring data drift detection and have been wanting to test how good evidently is at determining how much a given dataset has drifted. However, my main concern right now is wondering how to generate drifted data in the first place, and how much to skew them, so that I can get evidently to detect how much drift was applied to them.

So let's say I have a tabular dataframe like this, where I want to drift just the feature of Age.

adult df

What are the types of ways to artificially create a drifted dataset from a given dataset?

What I've been doing is splitting it into 2 extreme ranges (e.g. one set of <50 Age and one set of >=50 Age), and then mixing the two datasets more and more to create "less" drift. But supposedly for tabular data would something simpler do the trick, such as applying a uniform difference to all the Ages of one dataset work? Applying a random noise to all of the Ages, the noise following some normal distribution? What other standard techniques could be used to apply drift in this manner, and of a degree that can be varied?

Thank you!

elenasamuylova commented 1 year ago

Hi @LifeBoey, you might find this blog useful https://www.evidentlyai.com/blog/data-drift-detection-large-datasets

There, we generate artificial drift and then explore how each statistical test reacts to it.

There is also a notebook with all the code, including the code where we created artificial drift: https://colab.research.google.com/drive/1EFFcs0wDzToxSR6nw1umXDgPyeoP_Uk6