Closed eyalcha closed 1 year ago
Hi @eyalcha ,
Thank you for sharing the results you got and raising questions. Let me try to comment on that.
In both cases (with Evidently and external tool), you do get drift detected using a K-S test with 0.05 significance. I believe that mathematically everything seems to work correctly. Practically, since the first dataset has only a small number of observations with only two unique values and you compare it against a dataset with a much larger range and 22 unique values. It warrants an alert - as they look quite different.
Generally, there is a standard recommendation to have at least 30 observations in your datasets to apply statistical tests, although it is just a heuristic.
In your example, the main thing is the way you used to create a sample. I believe if you try a random sampling strategy you will be able to get the opposite result.
I ran such an experiment, and drift was not detected. Here is a code sample:
#Imports
import pandas as pd
from sklearn import datasets
from evidently.report import Report
from evidently.metrics import ColumnDriftMetric
#Iris dataset
iris_data = datasets.load_iris(as_frame='auto')
iris = iris_data.frame
iris_ref = iris
iris_cur = iris.sample(n=8, replace=False, random_state=42)
#Drift Detection
data_drift_report = Report(metrics=[
ColumnDriftMetric(stattest='ks', column_name='petal width (cm)'),
])
data_drift_report.run(reference_data=iris_ref, current_data=iris_cur)
data_drift_report
Hi,
I am running very simple case of Iris data drift detection with minimal number of samples as current samples. I get drift detection although I am not excepting to get any drift detection.
My questions are:
Current data - petal width (cm)
Reference data - iris
This is the result when using this link https://www.aatbio.com/tools/kolmogorov-smirnov-k-s-test-calculator: