evidentlyai / evidently

Evidently is ​​an open-source ML and LLM observability framework. Evaluate, test, and monitor any AI-powered system or data pipeline. From tabular data to Gen AI. 100+ metrics.
https://www.evidentlyai.com/evidently-oss
Apache License 2.0
4.96k stars 558 forks source link

Unable to understand Evidently report plot #741

Open ananda-duetto opened 11 months ago

ananda-duetto commented 11 months ago

Dear Evidently team, I am using Evidently to compare 2 timeseries arrays to check if there is drifting over time. I use the following code snippet:

data_drift_report = Report(metrics=[
    DataDriftPreset(),
])
# Passing on the current and reference curves for comparison
data_drift_report.run(current_data=df_july, reference_data=df_jan, column_mapping=None)
data_drift_report

Where df_july and df_jan are current and reference dataframes which have 7 columns (Sunday - Saturday columns) to compare. Each column is a time series data. I get a nice report where each of the column pairs are compared and KS p-values are obtained. Upon clicking on each of the column comparison plot, I noticed it gives the Data Drift and Data Distribution plot. In the data drift plot, there is the current data plot and a green band below with a bold line. I am curious what does the green band and the bold green line mean or indicate? Or is there a documentation of the function where I can look into the details to understand the output plots. Thank you. I am attaching the snapshot for your kind reference:

Screenshot 2023-08-25 at 10 31 29 AM

elenasamuylova commented 11 months ago

Hi @ananda-duetto,

It appears you are using an earlier Evidently version - could you upgrade to the latest one? It will include additional explanations on the legend.

Screenshot 2023-08-30 at 15 50 57

You can also check the description of the Data Drift Report in the docs https://docs.evidentlyai.com/presets/data-drift#4.-data-drift-by-feature

ananda-duetto commented 10 months ago

Hello @elenasamuylova,

Thanks very much for letting me know regarding the new version and for explaining about the bold green and green band. Me and my team would prefer using the old version because in it we can see the plot of the time series (like how it progresses). It seems in the new version, the actual curve plot is gone, just the mean is plotted. In our understanding, plotting just the mean of the reference plot is not providing any information at all but rather taking away the information of the nature or behavior of the curve.

ananda-duetto commented 10 months ago

I have a question on the old version plot. How can I get rid of the bold green line and green band in the old version? Is there a flag or parameter which I can use to do so? Thank you.

elenasamuylova commented 10 months ago

Hi @ananda-duetto,

1. Seeing the value plot with complete data

If you work with reasonably small datasets and want to keep all the raw data on the plot, you can also achieve this in the new version by passing the raw data parameter as an option. In this case, there will be no aggregation (it will look like the "old" plot - but the report will be large in size if you pass a large dataset). https://docs.evidentlyai.com/user-guide/customization/report-data-aggregation

report = Report(
    metrics=[
      DataDriftPreset(),
    ],
    options={"render": {"raw_data": True}}
  )
report.run(reference_data=df_ref, current_data=df_cur)
report

2. Getting rid of the green line and green band.

In all Evidently versions, the reference dataset on this plot inside the Data Drift report has been represented by the green line / green band. There is no way to get rid of it, unless you create an entirely custom metric with your own visualization. https://docs.evidentlyai.com/user-guide/customization/add-custom-metric-or-test

Could you share a bit more about what you are trying to achieve? Do you want to see only the current dataset distribution? Evidently has multiple other metrics that include distribution visualization (such as DataQualityPreset, ColumnDistributionMetric etc.) that you might find useful - that would show only one dataset if you prefer.

ananda-duetto commented 10 months ago

Hello @elenasamuylova,

Thanks very much for your prompt reply. Really appreciate it. Please see replies below:

  1. That is right, we are not working with a huge dataset and would prefer seeing the curve rather than the overall mean or summary. Thank you for sharing the code snippet. If you scroll up when I posted the 1st question in this conversation you will notice that I had the same code snippet and it gave me the nice curve plot. I didn't have the options={"render": {"raw_data": True}} because I wanted the line plot instead of the point plot or scatter plot. We liked the resulting plot very much but was curious regarding what the bold green line and green band meant and you answered it clearly. Thanks.

  2. I see and good to know what you mentioned. So, the bold green line and green band stays in the plot. Yes, you are right, in the test we are testing for drifting of two curves (reference and current) and in the visualization we wish to see both the curves. But we are able to see the current curve, green bold line and green band. That is the reason I wanted to know what the green band and green line mean and can we get rid of it or not. And you answered both of them. Thanks again.

  3. This leads to my last two questions in this discussion: Can we compare drifting of two timeseries datasets as is with providing weights to the points. Like 2 time series along with weight vector mentioned in the parameters. I checked the "Data drift parameters" section in the documentation but didn't find and and so wanted to check with you. Can you please let me know when possible? Thanks.

  4. Last question is: In the plot if you scroll above (in my 1st question), I do see the current curve being plotted (which is great). I also see a slight error band or confidence band or variation band kind of thing along with the current curve. It is not throughout the current plot but in parts of the plot. Can you please let me know what this error band or confidence band or variation band means? Thanks.

Thanks again for your help in understanding the plots and parameters.

elenasamuylova commented 10 months ago

Hi @ananda-duetto,

Question 1 and 4. Explaining the plot.

Copying the initial plot to clarify:

On this plot, the data IS aggregated. It shows the mean value of the feature binned into 150 bins. The slightly visible "pink band" shows 1 standard deviation of the value inside a given bin.

Basically, the only difference between the visualization on the screenshot and the default visualization you get with the current Evidently version - is that now it has the legend explaining the plot. The contents of the visualization is the same, and it shows the mean value.

If you want to see the raw data, it can only be achieved through the "raw_data": True option. It will appear as a scatter plot.

(Some backstory: This "raw_data": True" version used to be default in earlier Evidently versions until we added the aggregated visuals. Basically, the screenshot you posted initially refers to the interim version where the default visualization was "new" and aggregated, but the legend was "old" and partially referred to the version that showed the scatter plot).

Q3. Comparing two time series.

I am not sure I correctly understand the type of visualization you want to add: could you explain how you'd expect the "weight" parameter to work? Maybe you have an example of the plot?

Here is what we have that might be related:

  1. The following visualization in the DataQualityPreset. It is also available as ColumnSummaryMetric for individual columns: https://docs.evidentlyai.com/presets/data-quality#how-it-looks. It requires the datetime index, and shows the mean value of a numerical feature over time for reference and current.

image

  1. The ColumnValuePlotmetric (the default aggregated version):

    Screenshot 2023-09-15 at 18 29 16
  2. The ColumnValuePlotmetric (with raw_data set as true):

    Screenshot 2023-09-15 at 19 28 10

In this case it is pretty hard to make anything of it due to large number of data points.