shorten the time it takes to calculate data drift

evidentlyai / evidently

Evidently is an open-source ML and LLM observability framework. Evaluate, test, and monitor any AI-powered system or data pipeline. From tabular data to Gen AI. 100+ metrics.

https://www.evidentlyai.com/evidently-oss

Apache License 2.0

4.96k stars 558 forks source link

shorten the time it takes to calculate data drift #506

Open yairVanti opened 1 year ago

yairVanti commented 1 year ago

currently if i have a big reference data + test data to check drift - it takes few minutes to complete the drift test. the files sizes are each about 13MB with shape (1400,380) since the drift check is calculated by comparing histograms of both data sets - why cant the reference data histogram be calculated beforehand (in training phase of ML model for example) , keep the hjistograms , and just calculate the test data histograms when we want to check the drift (in inference time) , i believe it will shorten the time it takes considerably.

emeli-dral commented 1 year ago

Hi @yairVanti ,

Thank you for sharing this idea with us. We also look at this direction. I believe we won’t be able to calculate histograms for reference data during the training because we do not have access to the model training phase. But we might be able to accept calculated histograms for reference as input for drift calculation.

Generally, I agree that it makes sense to calculate reference-based statistics (not only histograms) beforehand and save them to use later for many reasons. At least it looks reasonable for monitoring purposes if the reference dataset is not changing that often, and one can reuse reference-based statistics for many fresh current datasets.

Unfortunately, we do not have the option to save reference-based statistics beforehand yet.

I also noticed that you have quite a lot of data. Maybe you could consider using sampling to speed up calculations a bit. I mean, sample both reference and current data and use those samples to estimate a drift?

MainRo commented 1 year ago

Would you also consider using histograms for the current dataset? I have some monitoring data already aggregated with histograms, and it would be great to use it directly. If you have thought about how to do it, I am interested in your plans and I may contribute.

elenasamuylova commented 1 year ago

Hi @MainRo,

Apologies for the delay here! We are considering implementing this in the future, by allowing to working with config files - there is no confirmed implementation design yet, but we are exploring this.

hkristof03 commented 11 months ago

Hi @emeli-dral ,

I just started to use this library but I do not understand why the reference dataset statistics are not / cannot be saved by default, because even with small tables, like 50-60MBs it takes a lot of time to get the reports, even for just one column with text data.

elenasamuylova commented 11 months ago

Hi @hkristof03 - we plan to implement the ability to separate the reference dataset - However, since it requires a major library-wide change to 100+ metrics and tests, this is not a small task. It will be addressed it in later versions.

However, specifically for model-based text data drift detection, the reference dataset is always required. In this case, Evidently trains a classifier model, so this metric cannot be computed without the reference dataset.