evidentlyai / evidently

Evidently is ​​an open-source ML and LLM observability framework. Evaluate, test, and monitor any AI-powered system or data pipeline. From tabular data to Gen AI. 100+ metrics.
https://www.evidentlyai.com/evidently-oss
Apache License 2.0
5.13k stars 575 forks source link

KEY ERROR 1 while using classification preset #701

Closed sameeryadav closed 1 year ago

sameeryadav commented 1 year ago
image
sameeryadav commented 1 year ago

its working fine around one week ago -

image
elenasamuylova commented 1 year ago

Hi @sameeryadav,

Could you share a bit more about the structure of your data: where is the target and prediction columns, how are they named, and what type are they?

You might need to pass the column_mapping object when you run the Report (line 3 on your screenshot). If you do not pass the column mapping, Evidently will try to parse the data automatically expecting a standard schema (e.g. target to be called "target").

Here are the details on column mapping: https://docs.evidentlyai.com/user-guide/input-data/column-mapping#prediction-column-s-in-classification

sameeryadav commented 1 year ago

@elenasamuylova ,Thankyou for your response, Here are the details of data and their types

image image

I also dropped the extra columns from current_df before passing it to classification preset I also tried this on older version 0.3.3 but issue did not resolved.

elenasamuylova commented 1 year ago

Hi @sameeryadav,

Could you also check the following:

1. Evidently version

import evidently
print(evidently.__version__) 

2. Unique value counts in target and prediction

The error might happen if the unique values in target and prediction columns do not match.

current_df.target.value_counts()

and

current_ref.prediction.value_counts()

If this is not the source of the issue - please send the complete error trace (it appears that some part of it is missing from the screenshot). Are there any known changes to the dataset between last and this week on your side?

Everything appears to work correctly on our simple test datasets, so we'd need some more information to know how to reproduce it.

sameeryadav commented 1 year ago

Hi, @elenasamuylova

image

the full error information -

KeyError Traceback (most recent call last) File :1 ----> 1 performance_report.show(mode='inline')

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3f46fea-f94c-4027-85d3-8c78354699ba/lib/python3.9/site-packages/evidently/suite/base_suite.py:169, in Display.show(self, mode) 168 def show(self, mode="auto"): --> 169 dashboard_id, dashboard_info, graphs = self._build_dashboard_info() 170 template_params = TemplateParams( 171 dashboard_id=dashboard_id, 172 dashboard_info=dashboard_info, 173 additional_graphs=graphs, 174 ) 175 # pylint: disable=import-outside-toplevel

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3f46fea-f94c-4027-85d3-8c78354699ba/lib/python3.9/site-packages/evidently/report/report.py:171, in Report._build_dashboard_info(self) 169 # set the color scheme from the report for each render 170 renderer.color_options = color_options --> 171 html_info = renderer.render_html(test) 173 for info_item in html_info: 174 for additional_graph in info_item.get_additional_graphs():

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3f46fea-f94c-4027-85d3-8c78354699ba/lib/python3.9/site-packages/evidently/metrics/classification_performance/classification_quality_metric.py:74, in ClassificationQualityMetricRenderer.render_html(self, obj) 73 def render_html(self, obj: ClassificationQualityMetric) -> List[BaseWidgetInfo]: ---> 74 metric_result = obj.get_result() 75 target_name = metric_result.target_name 76 result = []

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3f46fea-f94c-4027-85d3-8c78354699ba/lib/python3.9/site-packages/evidently/base_metric.py:184, in Metric.get_result(self) 182 result = self._context.metric_results.get(self, None) 183 if isinstance(result, ErrorResult): --> 184 raise result.exception 185 if result is None: 186 raise ValueError(f"No result found for metric {self} of type {type(self).name}")

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3f46fea-f94c-4027-85d3-8c78354699ba/lib/python3.9/site-packages/evidently/suite/base_suite.py:393, in Suite.run_calculate(self, data) 391 logging.debug(f"Executing {type(calculation)}...") 392 try: --> 393 calculations[calculation] = calculation.calculate(data) 394 except BaseException as ex: 395 calculations[calculation] = ErrorResult(ex)

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3f46fea-f94c-4027-85d3-8c78354699ba/lib/python3.9/site-packages/evidently/metrics/classification_performance/classification_quality_metric.py:46, in ClassificationQualityMetric.calculate(self, data) 44 raise ValueError("The columns 'target' and 'prediction' columns should be present") 45 target, prediction = self.get_target_prediction_data(data.current_data, data.column_mapping) ---> 46 current = calculate_metrics( 47 data.column_mapping, 48 self._confusion_matrix_metric.get_result().current_matrix, 49 target, 50 prediction, 51 ) 53 reference = None 54 if data.reference_data is not None:

File /local_disk0/.ephemeral_nfs/envs/pythonEnv-e3f46fea-f94c-4027-85d3-8c78354699ba/lib/python3.9/site-packages/evidently/calculations/classification_performance.py:316, in calculate_metrics(column_mapping, confusion_matrix, target, prediction) 311 if len(prediction.labels) == 2: 312 confusion_by_classes = calculate_confusion_by_classes( 313 np.array(confusion_matrix.values), 314 confusion_matrix.labels, 315 ) --> 316 conf_by_pos_label = confusion_by_classes[pos_label] 317 precision = metrics.precision_score(target, prediction.predictions, pos_label=pos_label) 318 recall = metrics.recall_score(target, prediction.predictions, pos_label=pos_label)

KeyError: 1

performance_report.as_dict() KeyError: 1

Here the value_counts is diffrent for reference_df and current_df but it should not causing the error

elenasamuylova commented 1 year ago

Thanks @sameeryadav, could you share the prediction value counts (not only target) to double-check?

cc @mike0sv to help figure out what might be wrong here.

sameeryadav commented 1 year ago

Sure @elenasamuylova @mike0sv

image
sameeryadav commented 1 year ago

Hi @elenasamuylova I am stuck with this error for several days ,I am using it in a live project . Could you please help me out to resolve this error

mike0sv commented 1 year ago

Hi @sameeryadav ! The error itself is probably caused by target value being the wrong type (str insead of int or vice versa). It's probably our internal bug, however you can try casting target column explicitly to one of those types (try both). I will investigate further in the meantime. Also, what version of evidenlty you are using?

sameeryadav commented 1 year ago

Hi, @mike0sv

I tried in both version 0.4.0 & 0.3.3

I also changed the dtype as str but issue still there

image

and I also found this while trying to solve the error - May be it can help you The KeyError you are encountering is likely related to the pos_label variable when calculating the conf_by_pos_label. To fix this issue, you should ensure that the pos_label is correctly set when calculating the classification metrics.

In the ClassificationQualityMetric class, where you calculate the current metrics, you should provide a value for the pos_label parameter when calling the calculate_metrics function. The pos_label parameter is the label of the positive class in your classification problem. It is used in metrics like precision and recall.

To do this, you can modify the calculate method in the ClassificationQualityMetric

mike0sv commented 1 year ago

I could not reproduce your issue in my environment.

from evidently.report import Report
from evidently import ColumnMapping
from evidently.metrics.classification_performance.classification_quality_metric import ClassificationQualityMetric
import pandas as pd

ref = cur = pd.DataFrame([
    {"a": 1, "b": 1},
    {"a": 1, "b": 1},
])

report = Report([
    ClassificationQualityMetric()
])
report.run(
    current_data=cur, 
    reference_data=ref, 
    column_mapping=ColumnMapping(target="a", prediction="b", target_names={0: "aa", 1: "bb"}, pos_label=1)
)

report.show()

Can you run this and confirm that it works? If it does, can you modify it a bit with your data so it does not run?

sameeryadav commented 1 year ago

Hey @mike0sv I tried your above code in my env that's runs fine but when i tried with sample of my data got the error same again. One thing I also found that when I am giving pos_label=1 then got KEY Error 1 and when give pos_label=0 then got KEY ERROR 0.

Due to some client restrictions, I could not expose the original data to re-create the error in your example code. I can try to explain how I prepared my dataset

1.Joins

  1. filling null values with 0
  2. Converting pyspark dataframes into pandas.
  3. Renaming columns
  4. mapping true:1 false:0 in our reference df
  5. after that my reference_df & current_df having only three columns{'cust_id,'prediction','target'} attaching snipets of code. Note: If your code snippet ran successfully in my enivronment ,Does it mean there is an issue with my my data(but I made sure they are in as that format required by evidently) So ,that is why I am not able to find the reason behind the error. image image image

I also checked with same thing with my current_df sample got the same error

mike0sv commented 1 year ago

Can you put your data in my example and my data into yours? I mean, instead of

ref = cur = pd.DataFrame([
    {"a": 1, "b": 1},
    {"a": 1, "b": 1},
])

put something like ref = cur = ref_df[["target", "prediction"]][:2]? If this fails, it means something is off with your data (probably types as I said before). Also, try to run your example on my example data above - if this fails, it means something is off with report configuration

master-pro commented 1 year ago

this bug about existing zeros in target or prediction columns

if we try this dataframe ref = cur = pd.DataFrame([ {"a": 1, "b": 1}, {"a": 1, "b": 1}, ]) everything works fine

but if we use this data frame

`ref = cur = pd.DataFrame([
    {"a": 0, "b": 0},
    {"a": 1, "b": 1},
])`

confusion_matrix label will be string ['1','0'] and pos_label is int = 1 in function calculate_metrics rise error keyError: 1 in line 319 in evidently\calculations\classification_performance.py conf_by_pos_label = confusion_by_classes[pos_label]

but when all values are one this function doesn't call

mike0sv commented 1 year ago

I tried with this data and it works for me :/ @master-pro can you share full code and what version are you on?

master-pro commented 1 year ago

I tried with this data and it works for me :/ @master-pro can you share full code and what version are you on?

@mike0sv that's weird, the problem is, when importing mlflow library before evidently:

`import mlflow

from evidently.report import Report from evidently import ColumnMapping from evidently.metrics.classification_performance.classification_quality_metric import ClassificationQualityMetric import pandas as pd

ref = cur = pd.DataFrame([ {"a": 0, "b": 0}, {"a": 1, "b": 1}, ])

report = Report([ ClassificationQualityMetric() ]) report.run( current_data=cur, reference_data=ref, column_mapping=ColumnMapping(target="a", prediction="b", pos_label=1) )

print(report.as_dict())`

evidently==0.4.0 mlflow==2.5.0

if you move import mlflow to the end of import section, the problem solved have no time to investigate why mlflow caused this error

mike0sv commented 1 year ago

Ok, I successfully reproduced this, will investigate

sameeryadav commented 1 year ago

Hi, @mike0sv

Do let me know the solution of the issue.

elenasamuylova commented 1 year ago

Hi @sameeryadav, could you confirm if you also use MLflow or import any other additional libraries before running Evidently? What is the Jupyter environment you run it in (e.g. Jupyter notebook, Databricks notebook, AWS Sagemaker notebook)?

sameeryadav commented 1 year ago

Hi, @elenasamuylova I am using Azure databricks -DBR 12.2LTSML, Spark3.3.2,Scala2.12

I am not using mlflow in my notebook but some additional libraries like-json,spark ,functions ,datetime modules are there in my notebook

mike0sv commented 1 year ago

I think I solved the mystery - it seems that mlflow uses typing annotations and wrong annotation was cached by @lru_cache of typing.List which in turn broke our code. Details are here https://github.com/pydantic/pydantic/issues/7022

@sameeryadav @master-pro can you install from this PR and confirm that the problem is solved? https://github.com/evidentlyai/evidently/pull/712

elenasamuylova commented 1 year ago

Hi @sameeryadav, @master-pro , the fix is now in the new Evidently version (0.4.1.) Could you check if this solves the issue for you?

anh-le-profinit commented 1 year ago

Hi, I've encountered the same issue and for me, upgrade to 0.4.1 worked

elenasamuylova commented 1 year ago

Thanks for sharing @anh-le-profinit!

sameeryadav commented 1 year ago

Hi @sameeryadav, could you confirm if you also use MLflow or import any other additional libraries before running Evidently? What is the Jupyter environment you run it in (e.g. Jupyter notebook, Databricks notebook, AWS Sagemaker notebook)?

sameeryadav commented 1 year ago

It worked for me ,Thankyou So much guys @mike0sv & @elenasamuylova @ We can close this issue now