Unexpected behaviour: SplitCP seems to ignore my pretrained model

lmossina commented 11 months ago

I am training my predictor (LinearRegression()) with my own data, and then I want to create a conformal predictor with SplitCP that takes my pre-trained model and does conformalization.

Here is my problem:

from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression

X_fit, y_fit = make_regression(n_samples=200, n_features=1, noise=50, random_state=42, bias=200)
X_cal, y_cal = make_regression(n_samples=200, n_features=1, noise=50, random_state=42, bias=200)
X_test, y_test = make_regression(n_samples=100, n_features=1, noise=50, random_state=42, bias=200)

mod = LinearRegression()
mod.fit(X_fit, y_fit)
print(mod.coef_)

> [85.88287056]

So my mod has been trained correctly. I give two different cases:

Case 1

base = BasePredictor(mod, is_trained=True)
cp = SplitCP(base)
cp.fit(X_calib=X_cal, y_calib=y_cal)

Now, I expect SplitCP to have "learned" that my mod is ready for prediction (via manually setting is_trained=True). However, the predicted values returned by cp.predict are unexpected, while cp.predictor.predict(..) behaves as expected (returns predictions of mod).

Good:

internal_call_pred = cp.predictor.predict(X_test)
print(internal_call_pred[:10])
> [287.12357943 214.6184216  116.30331871 234.13103249 165.98971046
 262.76792038 167.34292778 253.7391835  259.67508504 293.32885547]

Bad:

preds, lo, hi = cp.predict(X_test, 0.1)
print(preds[:10])
>[ nan  nan -inf  nan -inf  nan -inf  nan  nan  nan]

Remark: during a previous run, the code celle above was returning small values around 0.0, so maybe it is returning something that is not initialized properly.

Case 2

On the other hand, this seems to work correctly:

base = BasePredictor(mod, is_trained=True)
cp = SplitCP(base, train=False)
cp.fit(X_calib=X_cal, y_calib=y_cal)

internal_call_pred = cp.predictor.predict(X_test)
print(internal_call_pred[:10])
>[287.12357943 214.6184216  116.30331871 234.13103249 165.98971046
 262.76792038 167.34292778 253.7391835  259.67508504 293.32885547]

preds, lo, hi = cp.predict(X_test, 0.1)
print(preds[:10])
>[287.12357943 214.6184216  116.30331871 234.13103249 165.98971046
 262.76792038 167.34292778 253.7391835  259.67508504 293.32885547]

Problem:

Either is_trained in BasePredictor or train in SplitCP is redundant, or the latter ignore the first.
It is unexpected that cp.predict returns dummy values when the call to the underlying sklearn fitted model via cp.predict.predict(...) still works as expected

M-Mouhcine commented 11 months ago

Here is the point of having two arguments:

The argument train of deel.puncc.regression.SplitCP specifies if the user wants the model to be trained when calling fit on the CP wrapper. For many wrappers in regression module, train is set to True by default.
The is_trained argument in the predictor serves as an indicator of whether the predictor has been trained or not. This state variable helps identify potential inconsistencies in the usage of wrappers, such as attempting to utilize an underlying model that is untrained and will not be trained within the wrapper.

In principle, you can retrain a model that is already pretrained, which aligns with the way you wrote the code. However, the training dataset has not been supplied (only calibration data). Puncc failed to catch this inconsistency.

I'll fix this bug by adding a consistency check between the two arguments in that case.

M-Mouhcine commented 11 months ago

Could you please replicate your test using the most recent version 4c7c568?

M-Mouhcine commented 11 months ago

This bug is fixed. I'm closing the issue.

deel-ai / puncc