evidentlyai / evidently

Evidently is ​​an open-source ML and LLM observability framework. Evaluate, test, and monitor any AI-powered system or data pipeline. From tabular data to Gen AI. 100+ metrics.
https://www.evidentlyai.com/evidently-oss
Apache License 2.0
5.24k stars 592 forks source link

DataDriftProfileSection Filtered Results #317

Open RJ2494 opened 2 years ago

RJ2494 commented 2 years ago

There should be an option to choose what we want as data drift profile result. At present, we get the following fields for each feature -

Suppose I don't want current_small_hist and ref_small_hist in my output. I should be having an option to skip the calculatios for these fields as it takes longer to generate the data drift output . Consider the following code snippet -

https://github.com/evidentlyai/evidently/blob/0f279d3d908b20d6df47b88bff8800bbdf3d516e/src/evidently/calculations/data_drift.py#L281

ref_counts = feature_ref_data.value_counts(sort=False)
cur_counts = feature_cur_data.value_counts(sort=False)
keys = set(ref_counts.keys()).union(set(cur_counts.keys()))
for key in keys:
    if key not in ref_counts:
        ref_counts.loc[key] = 0
    if key not in cur_counts:
        cur_counts.loc[key] = 0

For a high cardinality categorical feature having thousands of categories (e.g. zipcode, ip address etc.), this loop takes longer to get the count of all keys for reference and current data which in turn delays the process. It is just used to calculate current_small_hist and ref_small_hist in data drift profile output. Therefore we should have an option to skip this calculation.

RJ2494 commented 2 years ago

@emeli-dral