evidentlyai / evidently

Evaluate and monitor ML models from validation to production. Join our Discord: https://discord.com/invite/xZjKRaNp8b
Apache License 2.0
4.84k stars 541 forks source link

Error using TestSuites for numerical data #1122

Open jeric250 opened 1 month ago

jeric250 commented 1 month ago

Hi there, first time opening an issue so bear with me (and let me know if more info is needed).

Basic information: Package version used: 0.4.20 Operating system and version: macOS VSCode Programming language and version used: Python 3.12.2

Code snippet:

from evidently.calculations.stattests import StatTest
from evidently.test_suite import TestSuite
from evidently.tests import *

data_drift_dataset_tests = TestSuite(tests=[
    TestShareOfDriftedColumns(stattest='psi'),
])

# ref_df: represents reference pandas DataFrame data (only numerical features)
# curr_df: represents current pandas DataFrame data (only numerical features)
data_drift_dataset_tests.run(reference_data=ref_df, current_data=curr_df)
data_drift_dataset_tests

The above code is based on Evidently documentation: https://github.com/evidentlyai/evidently/blob/main/examples/how_to_questions/how_to_specify_stattest_for_a_testsuite.ipynb

Error message: image

The above code snippet takes in only numerical data in a pandas DataFrame (data type of 'float64', 'int64'). When I use the exact same code for only categorical data (data type of 'object','category'), the above code works fine with a report generated.

I checked whether the numerical data used contain any weird values, and it doesn't seem to be the case. For example, to find records with non-numeric values: ref_df[~ref_df.applymap(np.isreal).all(1)]

What am I missing? Any advice?

elenasamuylova commented 1 month ago

Hi @jeric250, could you try to run pd.to_numeric on your input columns?

jeric250 commented 1 month ago

Thanks @elenasamuylova for responding so quickly. Forgot to mention, I did try pd.to_numeric as well, something like: ref_df = ref_df.apply(pd.to_numeric, errors='coerce') However, the same error still occurred. There's also no null values in the dataset as well.

When I tried to test on a single numerical column, I get the same error as well.

# test on AGE column, represent age of people (e.g. 32, 40)
data_drift_column_report = Report(metrics=[
    ColumnDriftMetric('AGE'),
    ColumnValuePlot('AGE'),  
])

data_drift_column_report.run(reference_data=ref_df, current_data=curr_df)
data_drift_column_report

Error: UFuncTypeError: ufunc 'multiply' did not contain a loop with signature matching types (dtype('<U14'), dtype('float64')) -> None

Same error when I tried DataDriftTable:

data_drift_dataset_report = Report(metrics=[
    DataDriftTable(num_stattest='wasserstein', cat_stattest='psi'),    
])

data_drift_dataset_report.run(reference_data=ref_df, current_data=curr_df)
data_drift_dataset_report

When I limit DataDriftTable to just categorical columns, it works fine with a report generated.