evidentlyai / evidently

Evidently is ​​an open-source ML and LLM observability framework. Evaluate, test, and monitor any AI-powered system or data pipeline. From tabular data to Gen AI. 100+ metrics.
https://www.evidentlyai.com/evidently-oss
Apache License 2.0
5.2k stars 586 forks source link

DaTa Drift | Dataset volume #710

Open nagasaipureti opened 1 year ago

nagasaipureti commented 1 year ago

Hi Team, Good Day!

We are trying to implement Data Drift detection in our project. In the Latest documentation of evidently, we found that, for large datasets ,we have sample them before passing it to evidently. Please specify , an approximate size of dataset, at what threshold size we have to turn sampling before passing it to evidently.

elenasamuylova commented 1 year ago

Hi @nagasaipureti,

I am afraid we cannot give a precise answer here.

The performance varies based on your infrastructure, the number of rows/columns, and the exact metrics used (e.g., some drift detection methods are faster than others).

Also, the need for sampling may vary based on whether you run reports ad hoc in your notebook (when waiting too long for a report to appear might be inconvenient) or run an automated pipeline (when it's more acceptable that the computation will take some time).

I'd suggest running a few tests on your sample datasets to develop your heuristics here.