Data drift volume issues. Any workaround ?

YassineR commented 1 year ago

Hello,

I'm currently using Evidently to perform a data drift analysis on my dataset. The dataset has a total shape of (356,251 rows, 797 columns) for both reference and current data.

When I execute the Run() function in Evidently, it seems to run indefinitely. To give you an idea of the issue:

When I analyze 50 columns, it takes approximately 3 minutes. However, when I increase the number of columns to 100, the process takes about 23 minutes. I'm wondering if there's a workaround for this situation. One idea I had is to break down the analysis into smaller chunks, perhaps 50 columns at a time, and then merge the results into a single comprehensive report.

Additionally, having a progress bar feature would be extremely helpful to monitor the analysis's progress, especially in cases where it takes a significant amount of time.

Any guidance or suggestions would be greatly appreciated.

Thank you

ketangangal commented 1 year ago

True when try to do column level metric test, it takes lot of time if you have more columns . Try :

feature selection
PCA

feldlime commented 8 months ago

@YassineR Datadrift test time depends a lot on

which test you're using
type of column (numerical or categorical)
structure of data, and particularly number of unique values for cat. columns

So it's possible that your columns are just different. To check this you can run test separately for each column and measure the time.

evidentlyai / evidently

Data drift volume issues. Any workaround ? #763