NannyML / nannyml

nannyml: post-deployment data science in python
https://www.nannyml.com/
Apache License 2.0
1.97k stars 139 forks source link

Incorrect Thresholds and Confidence Bands for Regression Metrics #127

Closed nikml closed 1 year ago

nikml commented 2 years ago

Describe the bug The thresholds and confidence bands for some regression metrics can

newplot

What is wrong with the above plot?

To Reproduce Steps to reproduce the behavior:

  1. Download the UCI Superconductivity data available here
  2. Run the following code:
import datetime as dt
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import json
import boto3
import nannyml as nml
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error
import plotly.graph_objects as go

# change location below as appropriate for your machine
data = pd.read_csv("/var/home/nannyml/Downloads/superconduct/train.csv", header=0)

features = list(data.columns)[:-1]
data = data.assign(partition = 'train')
data.loc[data.shape[0]//3:, 'partition'] = 'reference'
data.loc[data.shape[0]//3+1:(data.shape[0] - data.shape[0]//3), 'partition'] = 'analysis'

gbm = GradientBoostingRegressor(random_state=14)
gbm.fit(
    X=data.loc[data.partition == 'train', features],
    y=data.loc[data.partition == 'train', 'critical_temp']
)
data = data.assign(y_pred = gbm.predict(X=data[features]))

reference = data.loc[data.partition == 'reference', :].reset_index(drop=True)
analysis = data.loc[data.partition == 'analysis', :].reset_index(drop=True)

estimator = nml.DLE(
    feature_column_names=features,
    y_pred='y_pred',
    y_true='critical_temp',
    # timestamp_column_name='timestamp',
    metrics=['mae', 'mse'],
    chunk_size=data.shape[0]//30,
    tune_hyperparameters=False
)

estimator.fit(reference)
results = estimator.estimate(analysis)

metric_fig = results.plot(kind='performance', metric='mse', plot_reference=False)
metric_fig.show()
  1. Inspect Resulting plot

Expected behavior

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

jsandroos commented 1 year ago

This is one of those issues where there's a simple fix and several fixes with different degrees of 'correctness':

The simple solution is to truncate the limits on the plot at zero and be done with it.

However this neglects the profile of the underlying distribution. A more complete way then is to move the truncation to the underlying distribution and then calculate the bands based on percentiles on that distribution. Even more correct would be a recalculation of the distribution accounting for the output domain being limited to >= 0

kshitiz305 commented 1 year ago

HI @jsandroos @nikml @nnansters @baskervilski @rfrenoy I am a python developer for the last four years and am looking forward to contribute to some of the open source. I have experience in building some data oriented products using python as the main base programming language. Additionally I have hand on experience in FastAPI. I am a quick learner and also have some bandwidth to contribute to the project.

May I know from where I setup the code and start my contribution process.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.