deel-ai / puncc

👋 Puncc is a python library for predictive uncertainty quantification using conformal prediction.
https://deel-ai.github.io/puncc/
280 stars 16 forks source link

[Bug]: Instance variable is_trained is used incorrectly #49

Open fink-stanislav opened 7 months ago

fink-stanislav commented 7 months ago

Module

Prediction (API)

Contact Details

stanislau.fink@gmail.com

Current Behavior

DualPredictor is supposed to be initialized with is_trained parameter that should be a list of booleans. There are a few issues related to this variable:

  1. Using lists as default values is not recommended because it can lead to unwanted side effects. However, side effects may never occur if the parameter is not updated within constructor, see more here. You can use tuple instead.
  2. The instance variable is_trained is not updated after method fit is completed:
for count, model in enumerate(self.models):
    if not self.is_trained[count]:
        model.fit(X, y, **dictargs[count])
#end of method

Instead, you may want to rewrite it like this:

for count, model in enumerate(self.models):
    if not self.is_trained[count]:
        model.fit(X, y, **dictargs[count])
        self.is_trained[count] = True
#end of method
  1. Exception can be not thrown when it should in some cases when is_trained is a part of condition of if statement:
...
if self.train:
    if self.splitter is None:
        raise RuntimeError(
                "The splitter argument is None but train is set to "
                + "True. Please provide a correct splitter to train "
                + "the underlying model."
        )
    logger.info(f"Fitting model on fold {i+cached_len}")
    predictor.fit(X_fit, y_fit, **kwargs)  # Fit K-fold predictor

    # Make sure that predictor is already trained if train arg is False
elif self.train is False and predictor.is_trained is False:
    raise RuntimeError(
        "'train' argument is set to 'False' but model is not pre-trained"
    )

else:  # Skipping training
    logger.info("Skipping training.")
...

In elif statement predictor.is_trained is used as boolean but in fact it can be a list if predictor is an instance of DualPredictor. In this case it will be True if the list is not empty even though a model can still be not trained.

The solution suggested in point 2 is only partial. I think the best way would be to keep is_trained variable private (rename it to _is_trained) and introduce a property is_trained which will behave as instance variable but will be implemented as a method under the hood. For example for DualPredictor it can look like this:

@property
def is_trained(self) -> bool:
    return self._is_trained[0] and self._is_trained[1]

Expected Behavior

is_trained is expected to be boolean is_trained is expected to change after models are trained

Version

v0.9

Environment

No response

Relevant log output

No response

To Reproduce

I do not have example of a code that actually fails because of that. The test named test_locally_adaptive_cp actually runs the part of the code with aforementioned elif statement.

M-Mouhcine commented 7 months ago

Thanks @fink-stanislav for posting this issue and suggesting fixes.

  1. I agree that switching to tuples would be a better option.
  2. We already considered updating the is_trained status, but we must ensure it has no side effects elsewhere. Basically, this status serves as a double check during the conformalization process: the attribute train associated with an instance of ConformalPredictor (defined in module deel.puncc.api.conformalization) needs to be consistent with the training status of the underlying predictor. I will look further into that.
  3. When predictor.is_trained is an iterable, the test predictor.is_trained is False is indeed inconsistent. Using an attribute getter, as you suggest, looks like a great solution 👍 !

We will keep you updated on the progress regarding these issues. Thanks again for your findings.

M-Mouhcine commented 6 months ago

Hi @fink-stanislav

We have fixed the issue that you reported. However, regarding your first point, switching to tuples would cause a break in the current API. Therefore, I will reopen this issue and consider it an enhancement for the future version 0.8.x.

Thank you for your valuable contribution, and please let us know if you have any further concerns or suggestions.