bbolker / broom.mixed

tidy methods for mixed models in R
227 stars 23 forks source link

Add nlme::varFunc tidiers #115

Closed billdenney closed 2 years ago

billdenney commented 2 years ago

Fix #114

The main thing that I'm less sure of is how to handle fixed parameters. I added " ; fixed" to the term which seemed like the best way. But, I've never used them before, so I'm not sure that it's correct.

I think that I have handled all classes (via the generic varFunc and a separate varComb method). I have included the formula in the groups description, and I am extracting the coef() using the same arguments used in printing the output.

bbolker commented 2 years ago

How to handle "fixed" (as opposed to "fixed effect") parameters is a much more general issue that I think has not been tackled within broom.mixed. It certainly seems useful to have a consistent way for these different parameter types to be delimited/for users to specify whether or not they want both estimated and fixed parameters (although in general it's easiest as a first step to provide everything and allow users to filter() subsequently if they want). One way to identify fixed parameters is via an NA standard error, but that's not a reliable marker (standard errors might also be NA because they're not estimated by a particular package for a particular variable type, or because of numerical problems). I think the best solution is to add yet another column ... adding meta-data within the term names in a consistent way will work, of course, but doesn't seem (to me) to be good design.

Can we see if this has come up/raise this as an issue at the broom repo ... ?

billdenney commented 2 years ago

I'm happy to add the broom check (i.e. raise an issue there). But, I do think that the parameter should be shown since it is part of the coef(). If it were omitted, there would not be a simple way to see that it was part of the model (so, the tidy() results would seem to be incomplete and misleading relative to the actual model).

bbolker commented 2 years ago

I agree that they should be included. The questions are (1) how to indicate which parameters are fixed; (2) whether to allow a built-in argument for excluding fixed parameters (there are plenty of use cases where we want to do this, even if the default should be "include both"). I think your solution is OK, I am just a little bit paranoid about making interface/API decisions and being stuck with them/having to break backward compatibility if & when we change our mind.

billdenney commented 2 years ago

I fully agree. It's best to have the well-thought-out solution rather than the quick solution for something like this.

billdenney commented 2 years ago

Moving the conversation back here now that we're into specifics for broom.mixed...

I will add the estimated column to tidy.varFunc() output. The only open questions to me are:

  1. Do we add estimated to all outputs or only when some varFunc parameters are not estimated? I tend not to like outputs that have differing structure based on the inputs, but in this case, I think that it is better to only add estimated when there are fixed values. The reason is that I think fixed is very rarely used. (I'm using the egocentric definition of "rarely" that I'd never used it and I've not seen it used by others.) Since I think that it's very rarely used, I think that the difference between estimate and estimated would be confusing more often than helpful.
  2. For other estimates (ran_pars and fixed), should we set estimated to TRUE instead of NA? I think that if the column is there, it should be set to something other than NA, but we should not add the column just to set it to TRUE.
  3. I think that in recent versions of nlme, you can fix the residual error to a specific value (usually 1). So, we should consider adding the estimated column to that residual error value when it is fixed. Do you agree? (If so, I'll probably make a separate PR for that unless it's just a few lines more.)
billdenney commented 2 years ago

Whoops, I didn't mean to close this.

bbolker commented 2 years ago

As usual there are two conflicting rubrics here: (1) simplicity (don't clutter up all tidied models with every possible column that could apply to every possible category of model that broom.mixed covers); (2) uniformity of output (output format/dimensions should be consistent across inputs).

The availability of dplyr::bind_rows() (which easily handles incongruent sets of columns in its arguments) makes the inconsistency problem much less annoying.

I think I agree with your suggestions (1: add estimated only when there are at least some non-estimated parameters; 2: set non-estimated parameters to TRUE; 3: to be consistent with point 1, do add the estimated column if there are any non-estimated parameters).

In documentation (or as a first-class function, or in auxiliary code in inst/...) we could add a "helper function" that binds together tidied frames with and without fixed (non-estimated) parameters:

bind_tidy <- function(...) {
   add_estimated <- function(x) {
        if (!"estimated" %in% names(x)) mutate(x, estimated=TRUE) else x))    
   L <- list(...)
  ## was a list passed as an argument?
   if (length(L) == 1 && !is.data.frame(L)) L <- L[[1]]
   L %>% map(add_estimated) %>% bind_rows()
}

or something like that (easier to use replace_na(estimated, TRUE) but that would fail to distinguish some case where we actually wanted estimated to be NA ... ???

Maybe overkill ...

billdenney commented 2 years ago

I think that the bind_tidy() function does seem like overkill. I would add something like the following to the documentation. It works regardless of the column existing, and it's one line of added code without requiring a lot more code on our side.

I'll make the updates as you indicate above, and I'll include an example like the below into the documentation.

library(tidyverse)

not_really_a_tidied_model <-
  tibble(
    term=c("A", "B")
  )

not_really_a_tidied_model_with_estimated <-
  tibble(
    term=c("A", "B"),
    estimated=c(FALSE, TRUE)
  )
# Learned from https://stackoverflow.com/questions/45146688/execute-dplyr-operation-only-if-column-exists
not_really_a_tidied_model %>%
  filter(across(any_of("estimated"), ~.x))
#> # A tibble: 2 x 1
#>   term 
#>   <chr>
#> 1 A    
#> 2 B

not_really_a_tidied_model_with_estimated %>%
  filter(across(any_of("estimated"), ~.x))
#> # A tibble: 1 x 2
#>   term  estimated
#>   <chr> <lgl>    
#> 1 B     TRUE

Created on 2021-08-12 by the reprex package (v2.0.0)

billdenney commented 2 years ago

I just made the changes agreed above. The only change that I didn't make is adding the "estimated" column when sigma is fixed. The reason I didn't add that is that I didn't see how that information is stored in the model object. Without trying to parse the call itself, I don't see how to figure it out. (And, parsing the call seems like it would be error-prone.)