deel-ai / puncc

đź‘‹ Puncc is a python library for predictive uncertainty quantification using conformal prediction.
https://deel-ai.github.io/puncc/
280 stars 16 forks source link

Unexpected behaviour: SplitCP seems to ignore my pretrained model #36

Closed lmossina closed 11 months ago

lmossina commented 11 months ago

I am training my predictor (LinearRegression()) with my own data, and then I want to create a conformal predictor with SplitCP that takes my pre-trained model and does conformalization.

Here is my problem:

from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression

X_fit, y_fit = make_regression(n_samples=200, n_features=1, noise=50, random_state=42, bias=200)
X_cal, y_cal = make_regression(n_samples=200, n_features=1, noise=50, random_state=42, bias=200)
X_test, y_test = make_regression(n_samples=100, n_features=1, noise=50, random_state=42, bias=200)

mod = LinearRegression()
mod.fit(X_fit, y_fit)
print(mod.coef_)

> [85.88287056]

So my mod has been trained correctly. I give two different cases:

Case 1

base = BasePredictor(mod, is_trained=True)
cp = SplitCP(base)
cp.fit(X_calib=X_cal, y_calib=y_cal)

Now, I expect SplitCP to have "learned" that my mod is ready for prediction (via manually setting is_trained=True). However, the predicted values returned by cp.predict are unexpected, while cp.predictor.predict(..) behaves as expected (returns predictions of mod).

Good:

internal_call_pred = cp.predictor.predict(X_test)
print(internal_call_pred[:10])
> [287.12357943 214.6184216  116.30331871 234.13103249 165.98971046
 262.76792038 167.34292778 253.7391835  259.67508504 293.32885547]

Bad:

preds, lo, hi = cp.predict(X_test, 0.1)
print(preds[:10])
>[ nan  nan -inf  nan -inf  nan -inf  nan  nan  nan]

Remark: during a previous run, the code celle above was returning small values around 0.0, so maybe it is returning something that is not initialized properly.

Case 2

On the other hand, this seems to work correctly:

base = BasePredictor(mod, is_trained=True)
cp = SplitCP(base, train=False)
cp.fit(X_calib=X_cal, y_calib=y_cal)
internal_call_pred = cp.predictor.predict(X_test)
print(internal_call_pred[:10])
>[287.12357943 214.6184216  116.30331871 234.13103249 165.98971046
 262.76792038 167.34292778 253.7391835  259.67508504 293.32885547]
preds, lo, hi = cp.predict(X_test, 0.1)
print(preds[:10])
>[287.12357943 214.6184216  116.30331871 234.13103249 165.98971046
 262.76792038 167.34292778 253.7391835  259.67508504 293.32885547]

Problem:

  1. Either is_trained in BasePredictor or train in SplitCP is redundant, or the latter ignore the first.
  2. It is unexpected that cp.predict returns dummy values when the call to the underlying sklearn fitted model via cp.predict.predict(...) still works as expected
M-Mouhcine commented 11 months ago

Here is the point of having two arguments:

In principle, you can retrain a model that is already pretrained, which aligns with the way you wrote the code. However, the training dataset has not been supplied (only calibration data). Puncc failed to catch this inconsistency.

I'll fix this bug by adding a consistency check between the two arguments in that case.

M-Mouhcine commented 11 months ago

Could you please replicate your test using the most recent version 4c7c568?

M-Mouhcine commented 11 months ago

This bug is fixed. I'm closing the issue.