evidentlyai / evidently

Evidently is ​​an open-source ML and LLM observability framework. Evaluate, test, and monitor any AI-powered system or data pipeline. From tabular data to Gen AI. 100+ metrics.
https://www.evidentlyai.com/evidently-oss
Apache License 2.0
5.46k stars 603 forks source link

Error using TestSuites for numerical data #1122

Open jeric250 opened 6 months ago

jeric250 commented 6 months ago

Hi there, first time opening an issue so bear with me (and let me know if more info is needed).

Basic information: Package version used: 0.4.20 Operating system and version: macOS VSCode Programming language and version used: Python 3.12.2

Code snippet:

from evidently.calculations.stattests import StatTest
from evidently.test_suite import TestSuite
from evidently.tests import *

data_drift_dataset_tests = TestSuite(tests=[
    TestShareOfDriftedColumns(stattest='psi'),
])

# ref_df: represents reference pandas DataFrame data (only numerical features)
# curr_df: represents current pandas DataFrame data (only numerical features)
data_drift_dataset_tests.run(reference_data=ref_df, current_data=curr_df)
data_drift_dataset_tests

The above code is based on Evidently documentation: https://github.com/evidentlyai/evidently/blob/main/examples/how_to_questions/how_to_specify_stattest_for_a_testsuite.ipynb

Error message: image

The above code snippet takes in only numerical data in a pandas DataFrame (data type of 'float64', 'int64'). When I use the exact same code for only categorical data (data type of 'object','category'), the above code works fine with a report generated.

I checked whether the numerical data used contain any weird values, and it doesn't seem to be the case. For example, to find records with non-numeric values: ref_df[~ref_df.applymap(np.isreal).all(1)]

What am I missing? Any advice?

elenasamuylova commented 6 months ago

Hi @jeric250, could you try to run pd.to_numeric on your input columns?

jeric250 commented 6 months ago

Thanks @elenasamuylova for responding so quickly. Forgot to mention, I did try pd.to_numeric as well, something like: ref_df = ref_df.apply(pd.to_numeric, errors='coerce') However, the same error still occurred. There's also no null values in the dataset as well.

When I tried to test on a single numerical column, I get the same error as well.

# test on AGE column, represent age of people (e.g. 32, 40)
data_drift_column_report = Report(metrics=[
    ColumnDriftMetric('AGE'),
    ColumnValuePlot('AGE'),  
])

data_drift_column_report.run(reference_data=ref_df, current_data=curr_df)
data_drift_column_report

Error: UFuncTypeError: ufunc 'multiply' did not contain a loop with signature matching types (dtype('<U14'), dtype('float64')) -> None

Same error when I tried DataDriftTable:

data_drift_dataset_report = Report(metrics=[
    DataDriftTable(num_stattest='wasserstein', cat_stattest='psi'),    
])

data_drift_dataset_report.run(reference_data=ref_df, current_data=curr_df)
data_drift_dataset_report

When I limit DataDriftTable to just categorical columns, it works fine with a report generated.

rezan21 commented 1 month ago

@jeric250

I found out that the UFuncTypeError when using evidently.ai is oddly related to the index of the dataframes passed as reference_data or current_data. If your dataframes have a named index, it will cause the error: "UFuncTypeError: ufunc 'multiply' did not contain a loop with signature matching types (dtype('<U20'), dtype('float64')) -> None"

Solution: To address this, remove (drop) the index from the dataframe:

x = df.copy()
x.reset_index(drop=True, inplace=True) # <- remove index
report = Report(metrics=[ColumnDriftMetric(column_name="premium")]) # 'premium' is an arbitrary feature in my dataset
report.run(reference_data=x, current_data=x) # <- note: you should set reference_data and current_data accordingly 
report

Hope this helps!