easystats / performance

:muscle: Models' quality and performance metrics (R2, ICC, LOO, AIC, BF, ...)
https://easystats.github.io/performance/
GNU General Public License v3.0
965 stars 87 forks source link

check_model fails if dependent variable is labelled #727

Closed sjewo closed 1 month ago

sjewo commented 1 month ago

Hi there,

i run i a bug with labelled data, similar to https://github.com/easystats/performance/issues/629 .

check_model() will fail if the dependend variable is labelled

library(labelled)
library(performance)
library(see)

var_label(mtcars$wt) <- "Weight (1000 lbs)"
var_label(mtcars$mpg) <- "Miles/(US) gallon"
mtcars$am <- labelled(mtcars$am, c("automatic" = 0, "manual" = 1))

# this variable causes the error
mtcars$mpg <- labelled(mtcars$mpg, c("21" = 21))

m <- lm(mpg ~ wt + cyl + gear + disp + am, data = mtcars)

check_model(m)
> check_model(m)
Error: `check_model()` returned following error: Can't combine `..1` <character> and `..2` <double>.

If the error message does not help identifying your problem, another reason why `check_model()` failed might be that models of class `lm` are not yet
  supported.
> sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.5

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Berlin
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] see_0.8.4            performance_0.11.0.9 labelled_2.13.0     

loaded via a namespace (and not attached):
 [1] vctrs_0.6.5       cli_3.6.2         rlang_1.1.3       forcats_1.0.0     haven_2.5.4       generics_0.1.3    glue_1.7.0        colorspace_2.1-0  datawizard_0.10.0
[10] hms_1.1.3         scales_1.3.0      fansi_1.0.6       grid_4.4.0        munsell_0.5.1     tibble_3.2.1      lifecycle_1.0.4   insight_0.19.11   compiler_4.4.0   
[19] dplyr_1.1.4       pkgconfig_2.0.3   rstudioapi_0.16.0 R6_2.5.1          tidyselect_1.2.1  utf8_1.2.4        pillar_1.9.0      magrittr_2.0.3    tools_4.4.0      
[28] gtable_0.3.5      bayestestR_0.13.2 ggplot2_3.5.1    
strengejacke commented 1 month ago

Thanks, should be fixed in insight, which will be submitted to CRAN the next days.

strengejacke commented 3 weeks ago

@larmarange Is it necessary to preserve haven_labelled and vctrs class attributes when labelled::labelled() is used?

See from ?haven::labelled:

This class provides few methods, as I expect you'll coerce to a standard R class (e.g. a factor) soon after importing.

label(s) attributes can be used for standard R classes, so no need to keep the vectrs class attribute. The latter behaves differently than standard R classes, which can cause errors (like described in this issue), which are a pain to debug (and it's literally not clear to users, where the error comes from - namely, R language behaviour is "broken", and there's not bug in the package's code).

If not really necessary in your package, maybe it's possible to remove the haven_labelled and vctrs class attributes?

larmarange commented 3 weeks ago

Hi. labelled::labelled() is identical to haven::labelled()

The labelled package just provides functions to manipulate such vectors.

Such vectors are not intended to be used in a model. They should be transformed into factors with to_factor() or numeric/character vectors with unclass() before modelling (You could also use unlabelled()).

In performance, I do not see the need to support such vectors. In gtsummary, fire example, there is a warning saying to the user if he didn't forget to transform these vectors before analysis.

larmarange commented 3 weeks ago

So the error here is to use a haven_labelled vector in a model. The variable am should have been transformed into a factor to be correctly be considered as categorical by the model.

strengejacke commented 3 weeks ago

Yes, I agree. The problem often is that users aren't aware that labelled data can be of classes haven_labelled and vctrs, and thus problems can arise. We fixed this issues in our packages by removing those class attributes whenever model-data is extracted.