ModelOriented / forester

Trees are all you need
https://modeloriented.github.io/forester/
GNU General Public License v3.0
108 stars 14 forks source link

tibble causes all colums to drop during pre_rm_static_cols cal #99

Closed mmoramarco closed 1 year ago

mmoramarco commented 1 year ago

I attempted to test the forester package on the concrete dataset from the modeldata package (I wanted to mirror some of the tutorials/models from the tidymodels book).

> library(forester)

Attaching package: ‘forester’

The following object is masked from ‘package:base’:

    save

> data(concrete, package = "modeldata")
> class(concrete)
[1] "tbl_df"     "tbl"        "data.frame"
> best_model <- train(concrete, "compressive_strength")
✔ Type guessed as:  regression 

 -------------------- CHECK DATA REPORT -------------------- 

The dataset has 1030 observations and 9 columns, which names are: 
cement; blast_furnace_slag; fly_ash; water; superplasticizer; coarse_aggregate; fine_aggregate; age; compressive_strength; 

With the target value described by a column compressive_strength.

✔ No static columns. 

✔ No duplicate columns.

✔ No target values are missing. 

✔ No predictor values are missing. 

✔ No issues with dimensionality. 

✔ No strongly correlated, by Spearman rank, pairs of numerical values. 

✖ These obserwation migth be outliers due to their numerical columns values: 
 100 123 13 146 169 18 25 26 27 3 31 32 34 35 36 4 42 43 5 554 560 57 572 585 605 61 611 617 62 621 623 64 66 67 7 757 77 770 793 799 815 821 874 937 ;

✔ Target data is evenly distributed. 

✔ Columns names suggest that none of them are IDs. 

✔ Columns data suggest that none of them are IDs. 

 -------------------- CHECK DATA REPORT END -------------------- 

Error in `df[, i]`:
! Can't subset columns past the end.
ℹ Location 1 doesn't exist.
ℹ There are only 0 columns.
Run `rlang::last_error()` to see where the error occurred.
> preprocessing(concrete, "compressive_strength")
Error in `df[, i]`:
! Can't subset columns past the end.
ℹ Location 1 doesn't exist.
ℹ There are only 0 columns.
Run `rlang::last_error()` to see where the error occurred.
> pre_rm_static_cols(concrete, "compressive_strength")
# A tibble: 1,030 × 0
# ℹ Use `print(n = ...)` to see more rows

I believe the issue is the tibble format itself when passed into the following chunk of pre_rm_static_cols

del <- c()
for (i in 1:ncol(data)) {
  if (length(unique(data[, i])) == 1) {
    del <- c(del, i)
  }
}
del

This function, when acted on a tibble returns a tibble with one column resulting in a length one (thus that column is marked for removal).

> unique(concrete[,1])
# A tibble: 278 × 1
   cement
    <dbl>
 1   540 
 2   332.
 3   199.
 4   266 
 5   380 
 6   475 
 7   428.
 8   190 
 9   304 
10   140.
# … with 268 more rows
# ℹ Use `print(n = ...)` to see more rows
> length(unique(concrete[,1]))
[1] 1

If the tibble is coerced to a data frame first. The subset returns a vector which then has a length of 278 is that column is not marked for deletion.

> unique(as.data.frame(concrete)[,1])
  [1] 540.0 332.5 198.6 266.0 380.0 475.0 427.5 190.0 304.0 139.6 342.0 237.5 349.0 310.0 485.0 374.0 313.3 425.0 375.0 469.0 388.6
 [22] 531.3 318.8 401.8 362.6 323.7 379.5 286.3 439.0 389.9 337.9 222.4 233.8 194.7 190.7 212.1 230.0 190.3 166.1 168.0 213.7 213.8
 [43] 229.7 238.1 250.0 212.5 212.6 212.0 231.8 251.4 181.4 182.0 168.9 290.4 277.1 295.7 251.8 249.1 252.3 246.8 275.1 297.2 213.5
 [64] 277.2 218.2 214.9 218.9 376.0 500.0 315.0 505.0 451.0 516.0 520.0 528.0 385.0 500.1 450.1 397.0 333.0 334.0 405.0 200.0 145.0
 [85] 160.0 234.0 285.0 356.0 275.0 165.0 178.0 167.4 172.4 173.5 167.0 173.8 446.0 387.0 355.0 491.0 424.0 202.0 284.0 359.0 436.0
[106] 289.0 393.0 480.0 255.0 158.8 239.6 238.2 181.9 193.5 255.5 272.8 220.8 382.5 210.7 295.8 203.5 381.4 228.0 316.1 135.7 339.2
[127] 290.2 170.3 186.2 252.5 339.0 236.0 277.0 254.0 307.0 225.0 325.0 300.0 400.0 350.0 250.2 157.0 141.3 166.8 122.6 183.9 102.0
[148] 108.3 305.3 116.0 133.0 173.0 192.0 153.0 288.0 331.0 238.0 296.0 297.0 281.0 382.0 295.0 302.0 525.0 252.0 322.0 522.0 273.0
[169] 162.0 154.0 147.0 152.0 144.0 159.0 305.0 151.0 142.0 298.0 321.0 366.0 280.0 156.0 318.0 287.0 326.0 132.0 164.0 314.0 140.0
[190] 265.0 166.0 276.0 149.0 261.0 237.0 313.0 155.0 146.0 148.0 262.0 158.0 143.0 260.0 336.0 150.0 135.0 136.0 184.0 236.9 154.8
[211] 145.9 133.1 151.6 153.1 139.9 149.5 299.8 148.1 326.5 152.7 261.9 158.4 150.7 272.6 259.9 312.9 336.5 144.8 143.7 330.5 134.7
[232] 266.2 312.7 145.7 143.8 298.1 155.2 147.8 145.4 136.4 255.3 153.6 146.5 151.8 309.9 143.6 303.6 374.3 158.6 152.6 304.8 150.9
[253] 141.9 297.8 321.3 279.8 252.1 164.6 155.6 160.2 317.9 287.3 325.6 355.9 322.5 164.2 313.8 321.4 139.7 288.4 298.2 264.5 159.8
[274] 276.4 322.2 148.5 159.1 260.9
> length(unique(as.data.frame(concrete)[,1]))
[1] 278

I'm not sure if there is a larger effect of coercing a tibble into a traditional data.frame before processing but that seems to resolve the issue.

best_model <- train(as.data.frame(concrete), "compressive_strength")
HubertR21 commented 1 year ago

This behaviour is not unexpected, because in the documentation we say precisely that required formats are data.frame and matrix. Nevertheless i will add casting to the data.frame from tibble in the next version. This transformation won't have any negative impact on the performance of the package.