ModelOriented / forester

Trees are all you need
https://modeloriented.github.io/forester/
GNU General Public License v3.0
112 stars 15 forks source link

strange behaviour: perfect performance #36

Closed matteodefelice closed 2 years ago

matteodefelice commented 2 years ago

Perhaps this is linked to https://github.com/ModelOriented/forester/issues/34

I am using the Heart Failure Prediction Dataset (https://www.kaggle.com/fedesoriano/heart-failure-prediction/).

This is my code:

library(forester)
library(tidyverse)
library(here)
library(DALEX) # this is needed for the function model_performance (issue #34)
library(rsample)

df = read.csv(here('heart.csv'))
df_split = initial_split(df)

best_model <- forester(data = training(df_split),
                       data_test = testing(df_split),
                       target = "HeartDisease",
                       type = "classification",
                       metric = "precision",
                       tune = FALSE)

This is the output:

__________________________
FORESTER
Original shape of train data frame: 688 rows, 12 columns
_____________
NA values
There is no NA values in your data.
__________________________
CREATING MODELS
--- Ranger model has been created ---
Parameter 'cat_features' is meaningless because column types are taken from data.frame.
Please, convert categorical columns to factors manually.
--- Catboost model has been created ---
--- Xgboost model has been created ---
Warning in (function (params = list(), data, nrounds = 100L, valids = list(),  :
  lgb.train: Found the following passed through '...': learning_rate, objective. These will be used, but in future releases of lightgbm, this warning will become an error. Add these to 'params' instead. See ?lgb.train for documentation on how to call this function.
--- LightGBM model has been created ---
__________________________
COMPARISON
Results of compared models:

model       precision      recall          f1    accuracy         auc
---------  ----------  ----------  ----------  ----------  ----------
Ranger      1.0000000   1.0000000   1.0000000   1.0000000   1.0000000
XGboost     0.9870466   0.9921875   0.9896104   0.9883721   0.9878701
Catboost    0.9740260   0.9765625   0.9752926   0.9723837   0.9718339
LightGBM    0.9589744   0.9739583   0.9664083   0.9622093   0.9606634
The best model based on precision metric is Ranger.

So, according to forester the ranger model is perfect. Obviously, if I compute the metrics on the testing dataset I get normal values (<1). What is happening here?

lhthien09 commented 2 years ago

@matteodefelice Thank you for your info. That results looks skeptically perfect. Our team will check and inform you later.

lhthien09 commented 2 years ago

@matteodefelice thank you for your report issue. We fixed the problem with latest commit to the package. We forgot to update data_test to explainer structure. Now it works well.

image