SelfExplainML / PiML-Toolbox

PiML (Python Interpretable Machine Learning) toolbox for model development & diagnostics
https://selfexplainml.github.io/PiML-Toolbox
Apache License 2.0
912 stars 109 forks source link

predict after get_model() #52

Closed srbPhy closed 2 months ago

srbPhy commented 2 months ago

Hi,

It seems model.predict() gives different results than using exp.get_model().predict(). I tried multiple things, but am unable to figure out what's causing these differences. Could you please help me with this?

Please see the following example:

from piml import Experiment
from piml.models import XGB2Regressor

exp = Experiment(highcode_only=True)
exp.data_loader(data='BikeSharing', silent=True)
exp.data_prepare(target='cnt', task_type='regression', test_ratio=0.2, random_state=0, silent=True)

model = XGB2Regressor()
exp.model_train(model=model, name='XGB2')

print(model.predict(exp.get_data(test=True)[0]))
print(exp.get_model("XGB2").predict(exp.get_data(test=True)[0]))
[-0.04393188  0.03837352  0.4268577  ...  0.02106261 -0.00260242
  0.34881094]
[0.03791483 0.00768625 0.05819578 ... 0.01468494 0.06016384 0.03554394]
ZebinYang commented 2 months ago

Hi @srbPhy

The object returned by exp.get_model("XGB2") is a pipeline, i.e., data preprocessing + model. While the returns of exp.get_data are the preprocessed data.

The following example will output the consistent prediction results.

from piml import Experiment
from piml.models import XGB2Regressor

exp = Experiment(highcode_only=True)
exp.data_loader(data='BikeSharing', silent=True)
exp.data_prepare(target='cnt', task_type='regression', test_ratio=0.2, random_state=0, silent=True)

model = XGB2Regressor()
exp.model_train(model=model, name='XGB2')

print(model.predict(exp.get_data(test=True)[0]))
print(exp.get_model("XGB2").estimator.predict(exp.get_data(test=True)[0]))
print(exp.get_model("XGB2").predict(exp.get_raw_data().test_x))
srbPhy commented 2 months ago

Thanks very much for your response! I was wondering how to use data other than test/train for predictions. Your third print answers that as well. If you can, it would be very helpful to include some of the examples with predict in the user guide.

srbPhy commented 2 months ago

I have another quick question. My understanding is that PiML normalizes the data using min-max scalar at data preparation stage. Is there a quick way to inverse scale the output of predict? Currently, I am pulling min/max values from numerical data summary for inversion, but I was wondering if there is a smarter way to do this.

ZebinYang commented 2 months ago

@srbPhy

We don't have an API for doing so. But instead of pulling the min max values from the summary table, you can extract them using the following codes.

ymin = exp._Experiment__data_api.dataset.yt.ntransformer.named_steps.scaling.data_min_
ymax = exp._Experiment__data_api.dataset.yt.ntransformer.named_steps.scaling.data_max_
srbPhy commented 2 months ago

Thanks! This clears a few other inconsistencies I was observing. It seems min/max values from data summary are slightly different from the ones obtained using the code you shared. Does data prepare also remove outliers? Would it be possible for you to share an end-to-end sample project code with me at saurabhbansal20@gmail.com? I looked at all the examples on your website, but none seem to discuss the predict method.

ZebinYang commented 2 months ago

@srbPhy

The difference is that the data_summary calculates the statistics using the whole data, while in data preparation, only the training set is used for fitting the scaler.

At this stage, we cannot share any of the source codes.