easystats / insight

:crystal_ball: Easy access to model information for various model objects
https://easystats.github.io/insight/
GNU General Public License v3.0
392 stars 39 forks source link

`sampleSelection` and `find_variables` #424

Open vincentarelbundock opened 3 years ago

vincentarelbundock commented 3 years ago

A possible bug: I could not use find_variables on a model produced by sampleSelection::heckit or sampleSelection::selection. Example:

library(insight)
library(sampleSelection)

data(Mroz87, package = "sampleSelection")
Mroz87$kids <- (Mroz87$kids5 + Mroz87$kids618 > 0)
model <- heckit(
  lfp ~ age + I(age^2) + faminc + kids + educ,
  wage ~ exper + I(exper^2) + educ + city,
  data = Mroz87)

find_variables(model)
# Error in f[["conditional"]][[3]]: subscript out of bounds

> traceback()
4: .prepare_predictors(x, f, elements)
3: find_predictors.default(x, effects = effects, component = component,
       flatten = flatten, verbose = verbose)
2: find_predictors(x, effects = effects, component = component,
       flatten = flatten, verbose = verbose)
1: find_variables(model)
strengejacke commented 3 years ago

Your initial issue should be fixed:

library(insight)
library(sampleSelection)
#> Loading required package: maxLik
#> Loading required package: miscTools
#> 
#> Please cite the 'maxLik' package as:
#> Henningsen, Arne and Toomet, Ott (2011). maxLik: A package for maximum likelihood estimation in R. Computational Statistics 26(3), 443-458. DOI 10.1007/s00180-010-0217-1.
#> 
#> If you have questions, suggestions, or comments regarding the 'maxLik' package, please use a forum or 'tracker' at maxLik's R-Forge site:
#> https://r-forge.r-project.org/projects/maxlik/

data(Mroz87, package = "sampleSelection")
Mroz87$kids <- (Mroz87$kids5 + Mroz87$kids618 > 0)
model <- heckit(
  lfp ~ age + I(age^2) + faminc + kids + educ,
  wage ~ exper + I(exper^2) + educ + city,
  data = Mroz87)

find_variables(model)
#> $response
#> [1] "lfp"  "wage"
#> 
#> $conditional
#> $conditional$selection
#> [1] "age"    "faminc" "kids"   "educ"  
#> 
#> $conditional$outcome
#> [1] "exper" "educ"  "city"

Created on 2021-09-14 by the reprex package (v2.0.1)

But what would you suggest according to the response? Return a list of two as well?

strengejacke commented 3 years ago
library(insight)
library(sampleSelection)
#> Loading required package: maxLik
#> Loading required package: miscTools
#> 
#> Please cite the 'maxLik' package as:
#> Henningsen, Arne and Toomet, Ott (2011). maxLik: A package for maximum likelihood estimation in R. Computational Statistics 26(3), 443-458. DOI 10.1007/s00180-010-0217-1.
#> 
#> If you have questions, suggestions, or comments regarding the 'maxLik' package, please use a forum or 'tracker' at maxLik's R-Forge site:
#> https://r-forge.r-project.org/projects/maxlik/

data(Mroz87, package = "sampleSelection")
Mroz87$kids <- (Mroz87$kids5 + Mroz87$kids618 > 0)
model <- heckit(
  lfp ~ age + I(age^2) + faminc + kids + educ,
  wage ~ exper + I(exper^2) + educ + city,
  data = Mroz87)

# could be a list with "$selection" and "$outomce"
find_response(model)
#> [1] "lfp"  "wage"

Created on 2021-09-14 by the reprex package (v2.0.1)

vincentarelbundock commented 3 years ago

Thanks for the fix! (I don't know why I'm still surprised/impressed by your efficiency, but I am...)

Your question about find_response is a very good one. The list option would be more informative, but the current documentation says:

Value:

 The name(s) of the response variable(s) from ‘x’ as character
 vector, or ‘NULL’ if response variable could not be found.

The plural already accommodates multiple response variables, so this would require no change, whereas the list might break code that expects a vector. So for backward compatibility reasons, I would lean toward keeping the vector, but both options are probably fine.

bwiernik commented 3 years ago

I would suggest a named list of 2 for the response here. The documentation as written I think was intended more for something like an mlm, where there are multiple undifferentiated responses

vincentarelbundock commented 3 years ago

Sounds good. Just so I know what to expect, you are proposing a list for all models, not just for those with many responses?

Or are you suggesting a string in most cases and a list in cases like sampleSelection.

The first option is a breaking change but might be better in the future for consistency. The second is non-breaking, but it's not great in terms of consistency because users have to check every time if they get a string or list.

Edit: meant a string not a vector

strengejacke commented 3 years ago

Actually, I think we have a named vector for find_response() and could use a list as list-element for find_variables() (i.e. having $response$selection and $response$outcome)