easystats / insight

:crystal_ball: Easy access to model information for various model objects
https://easystats.github.io/insight/
GNU General Public License v3.0
391 stars 39 forks source link

`get_data.coxph` returns data without labels #790

Open iago-pssjd opened 1 year ago

iago-pssjd commented 1 year ago

get_data.coxph returns data without labels. As a consequence, when used for parameters::parameters, the attribute pretty_labels is not useful at all.

Indeed, in the function https://github.com/easystats/insight/blob/a95325c169707b37ddba06f0cd7bf5c84fc9edb5/R/get_data.R#L1827-L1852

the issue happens with .prepare_get_data, where labels are removed from variables.

strengejacke commented 1 year ago

labels are not removed in general inside .prepare_get_data(), maybe there's a specific issue with coxph models. Will look into this.

library(easystats)
#> # Attaching packages: easystats 0.6.0.10
#> ✔ bayestestR  0.13.1.2   ✔ correlation 0.8.4   
#> ✔ datawizard  0.8.0.3    ✔ effectsize  0.8.3.11
#> ✔ insight     0.19.3     ✔ modelbased  0.8.6.3 
#> ✔ performance 0.10.4.1   ✔ parameters  0.21.1.2
#> ✔ report      0.5.7.9    ✔ see         0.8.0.2
data(efc)
m <- lm(neg_c_7 ~ e42dep, data = efc)
str(get_data(m))
#> 'data.frame':    94 obs. of  2 variables:
#>  $ neg_c_7: num  12 20 11 12 19 15 11 15 10 28 ...
#>   ..- attr(*, "label")= chr "Negative impact with 7 items"
#>  $ e42dep : Factor w/ 4 levels "1","2","3","4": 3 3 3 4 4 4 4 4 4 4 ...
#>   ..- attr(*, "label")= chr "elder's dependency"
#>   ..- attr(*, "labels")= Named num [1:4] 1 2 3 4
#>   .. ..- attr(*, "names")= chr [1:4] "independent" "slightly dependent" "moderately dependent" "severely dependent"
str(get_data(m, source = "mf"))
#> 'data.frame':    94 obs. of  2 variables:
#>  $ neg_c_7: num  12 20 11 12 19 15 11 15 10 28 ...
#>   ..- attr(*, "label")= chr "Negative impact with 7 items"
#>  $ e42dep : Factor w/ 4 levels "1","2","3","4": 3 3 3 4 4 4 4 4 4 4 ...
#>   ..- attr(*, "labels")= Named num [1:4] 1 2 3 4
#>   .. ..- attr(*, "names")= chr [1:4] "independent" "slightly dependent" "moderately dependent" "severely dependent"
#>   ..- attr(*, "label")= chr "elder's dependency"
#>  - attr(*, "terms")=Classes 'terms', 'formula'  language neg_c_7 ~ e42dep
#>   .. ..- attr(*, "variables")= language list(neg_c_7, e42dep)
#>   .. ..- attr(*, "factors")= int [1:2, 1] 0 1
#>   .. .. ..- attr(*, "dimnames")=List of 2
#>   .. .. .. ..$ : chr [1:2] "neg_c_7" "e42dep"
#>   .. .. .. ..$ : chr "e42dep"
#>   .. ..- attr(*, "term.labels")= chr "e42dep"
#>   .. ..- attr(*, "order")= int 1
#>   .. ..- attr(*, "intercept")= int 1
#>   .. ..- attr(*, "response")= int 1
#>   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
#>   .. ..- attr(*, "predvars")= language list(neg_c_7, e42dep)
#>   .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "factor"
#>   .. .. ..- attr(*, "names")= chr [1:2] "neg_c_7" "e42dep"
#>  - attr(*, "na.action")= 'omit' Named int [1:6] 4 27 33 46 58 97
#>   ..- attr(*, "names")= chr [1:6] "4" "27" "33" "46" ...
#>  - attr(*, "is_subset")= logi FALSE

Created on 2023-07-09 with reprex v2.0.2

iago-pssjd commented 1 year ago

I should remark that get_data is called with the option source = "mf", since it is what is called here:

https://github.com/easystats/parameters/blob/71a5271a3f90c4707f67e5d2b5b07bd458ffe94b/R/format_parameters.R#L364-L373

(called by parameters:::.add_model_parameters_attributes, which is called in https://github.com/easystats/parameters/blob/71a5271a3f90c4707f67e5d2b5b07bd458ffe94b/R/1_model_parameters.R#L616-L631)

For a minimal example:

library(survival)
dat_regression_test <- data.frame(
    time = c(4, 3, 1, 1, 2, 2, 3),
    status = c(1, 1, 1, 0, 1, 1, 0),
    x = c(0, 2, 1, 1, 1, 0, 0),
    sex = c(0, 0, 0, 0, 1, 1, 1)
)
attr(dat_regression_test$x, "label") <- "Pred"
mod <- survival::coxph(Surv(time, status) ~ x + strata(sex),
                       data = dat_regression_test,
                       ties = "breslow"
)

str(get_data(mod, source = "mf"))
'data.frame':   7 obs. of  4 variables:
 $ time  : num  4 3 1 1 2 2 3
 $ status: num  1 1 1 0 1 1 0
 $ x     : num  0 2 1 1 1 0 0
 $ sex   : num  0 0 0 0 1 1 1
 - attr(*, "is_subset")= logi FALSE

For your example str(parameters(m)) includes

 - attr(*, "pretty_labels")= Named chr [1:4] "(Intercept)" "elder's dependency [slightly dependent]" "elder's dependency [moderately dependent]" "elder's dependency [severely dependent]

However, for str(parameters(mod))

- attr(*, "pretty_labels")= Named chr "x"
  ..- attr(*, "names")= chr "x"
iago-pssjd commented 1 year ago

Maybe the issue is that when calling .prepare_get_data in get_data.coxph, it is called through stats::na.omit in line 1840, which removes all labels.

iago-pssjd commented 1 year ago

So, @strengejacke why in some of the get_data methods there is a call to stats::na.omit inside .prepare_get_data and there is no in others? Is there an alternative?

strengejacke commented 1 year ago

The original idea of get_data() was to retrieve the data that was used to fit the model, matching the same number of observations (i.e. NA removed). Meanwhile, since there are so many edge cases, and because for updating the model or calculating predictions it's not necessary to remove missings, the default now is to retrieve the data from the environment, i.e. the original data. When this doesn't work, get_data() falls back to retrieving data from the model frame.

strengejacke commented 1 year ago

However, for str(parameters(mod))

Yes, but that data isn't labelled, so no surprise here?

iago-pssjd commented 1 year ago

@strengejacke The issue is that stats::na.omit removes the labels. Replacing it by tidyr::drop_na solves the issue, but I know you do not use dependencies and I did not find any other base way to remove the missings keeping the labels (beyond copying the labels and pasting them after removing missings).

Yes, but that data isn't labelled, so no surprise here?

Wrong, it is labelled, since previously I had done

attr(dat_regression_test$x, "label") <- "Pred"