SelfExplainML / PiML-Toolbox

PiML (Python Interpretable Machine Learning) toolbox for model development & diagnostics
https://selfexplainml.github.io/PiML-Toolbox
Apache License 2.0
912 stars 109 forks source link

Elasticnet inconsistencies #42

Closed xloffree closed 11 months ago

xloffree commented 12 months ago

This is a follow up from issue #26 .

Screenshot 2023-07-29 at 7 04 05 PM Screenshot 2023-07-29 at 7 04 17 PM Screenshot 2023-07-29 at 7 04 28 PM Screenshot 2023-07-29 at 7 04 51 PM Screenshot 2023-07-29 at 7 06 26 PM

When using GLMRegressor, I get very different results than from when I use the method shown in the README, which gives a very poor R2 and says all coefficients are 0. I am certain that this is wrong because we have run elastic net on this data in R and python and have gotten similar results that show a much better R2. The results from the glmRegressor look more reasonable. I am trying to use piml to compare many models at once on the same train/test split (the elastic net models with specific parameters included). We would like to use the results from PiML in an upcoming paper, but this problem with PiML's elastic net is preventing us from being able to do so. Please let me know if there is a solution to this.

In summary, I want to find a way to run elastic net using the same train/test split as when I run all of the other models in PiML. Is there a way to do this?

Thank you

ZebinYang commented 12 months ago

Hi, in the default piml workflow, all the variables are standardized to be within 0 and 1 before modeling. The results for the elastic net can be very different for data with different scales (even with the same regularization strength).

In the screenshot you provided, a GLMRegressor is fitted on the raw data, so the coefficients look reasonable. However, if you want to get similar results using the piml workflow, you may need to decrease the regularization strengths accordingly.

xloffree commented 11 months ago

Ok I see, thank you. So the GLMRegressor is a normal glm function whereas the piml workflow standardizes the data? Therefore, the GLM results from piml workflow and the results from GLMRegressor should not be the same?

Thank you!

ZebinYang commented 11 months ago

GLMRegressor is a high-code estimator, and you may use it anywhere just like other machine learning models.

PiML workflow is the pipeline of data preprocessing + GLMRegressor, so their results are different.

You can get the same results if you do manual data standardizing and then apply it to GLMRegressor.