evidentlyai / evidently

Evidently is ​​an open-source ML and LLM observability framework. Evaluate, test, and monitor any AI-powered system or data pipeline. From tabular data to Gen AI. 100+ metrics.
https://www.evidentlyai.com/evidently-oss
Apache License 2.0
5.2k stars 586 forks source link

Data drift volume issues. Any workaround ? #763

Open YassineR opened 1 year ago

YassineR commented 1 year ago

Hello,

I'm currently using Evidently to perform a data drift analysis on my dataset. The dataset has a total shape of (356,251 rows, 797 columns) for both reference and current data.

When I execute the Run() function in Evidently, it seems to run indefinitely. To give you an idea of the issue:

When I analyze 50 columns, it takes approximately 3 minutes. However, when I increase the number of columns to 100, the process takes about 23 minutes. I'm wondering if there's a workaround for this situation. One idea I had is to break down the analysis into smaller chunks, perhaps 50 columns at a time, and then merge the results into a single comprehensive report.

Additionally, having a progress bar feature would be extremely helpful to monitor the analysis's progress, especially in cases where it takes a significant amount of time.

Any guidance or suggestions would be greatly appreciated.

Thank you

ketangangal commented 1 year ago

True when try to do column level metric test, it takes lot of time if you have more columns . Try :

  1. feature selection
  2. PCA
feldlime commented 8 months ago

@YassineR Datadrift test time depends a lot on

So it's possible that your columns are just different. To check this you can run test separately for each column and measure the time.