benbhansen-stats / propertee

Prognostic Regression Offsets with Propagation of ERrors, for Treatment Effect Estimation (IES R305D210029).
https://benbhansen-stats.github.io/propertee/
Other
2 stars 0 forks source link

`expand.model.frame` changes #112

Closed josherrickson closed 1 year ago

josherrickson commented 1 year ago

@jwasserman2 and I will have more to add about this issue in the future; just opening now to save some URLs.

Short version: stats::expand.model.frame changed in R 4.3.0: https://bugs.r-project.org/show_bug.cgi?id=18414. It now looks for a call slot, which we don't currently have, causing errors. Adding one is non-trivial due again to this bug. (I also briefly looked into writing our own getCall.DirectAdjusted but was unsuccessful; with more time this might be an appropriate solution.)

Diff of stats::expand.model.frame changes:

>>> diff expand.model.frame.R ~/Downloads/R-4.3.0/src/library/stats/R/expand.model.frame.R 
4c4
< #  Copyright (C) 1995-2012 The R Core Team
---
> #  Copyright (C) 1995-2022 The R Core Team
22a23,24
>     cl <- getCall(model)
>
25d26
<     data <- eval(model$call$data, envir)
35a37
>     environment(ff) <- envir
38,41c40,41
<         naa <- model$call$na.action
<         subset <- model$call$subset
<         rval <- eval(call("model.frame", ff, data = data, subset = subset,
<                           na.action = naa), envir)
---
>         rval <- eval(call("model.frame", ff, data = cl$data, subset = cl$subset,
>                           na.action = cl$na.action), envir)
43,44c43
<         subset <- model$call$subset
<         rval <- eval(call("model.frame", ff, data = data, subset = subset,
---
>         rval <- eval(call("model.frame", ff, data = cl$data, subset = cl$subset,
jwasserman2 commented 1 year ago

The non-trivial bug report Josh linked has a comment that offers a workaround: quote the call when storing it as a slot call. That would mean changing the code here to look like:

  if (inherits(lm_model, "glm")) {
    lm_model$formula <- as.formula(lm_model, env = eval_env)
  }
  da_call <- str2lang(paste0("quote(", paste(deparse(lm_model$call), collapse = ""), ")"))

  return(new("DirectAdjusted",
             lm_model,
             Design = design,
             absorbed_intercepts = absorbed_intercepts,
             absorbed_moderators = absorbed_moderators,
             call = da_call))

lm_model is loaded with calls that can be evaluated in parent.frame(), however, which in the case of the dichotomy call in a Design object, leads to errors. For example, if we run the following code after making the change I propose above, we get the following error:

> data(simdata)
> cmod <- lm(y ~ x, simdata)
> des <- rct_design(z ~ cluster(cid1, cid2), data = simdata)
> mod <- lmitt(y ~ 1, data = simdata, design = des, weights = ate(des), offset = cov_adj(cmod))
Error in str2lang(paste0("quote(", paste(deparse(lm_model$call), collapse = ""), :
<text>:1:7339: unexpected '<'
1: class = "data.frame"), column_index = c("t", "u", "u"), type = "RCT", unit_of_assignment_type = "cluster", call = rct_design(formula = z ~ cluster(cid1, cid2), data = simdata), di

This is because deparse causes translates des to this:

Design = new(\"Design\", "                      
[102] "    structure = structure(list(z = c(0L, 0L, 0L, 1L, 0L, 0L, "                                     
[103] "    1L, 0L, 1L, 1L), cid1 = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, "                                    
[104] "    5L, 5L), cid2 = c(1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L)), row.names = c(NA, "                
[105] "    -10L), class = \"data.frame\"), column_index = c(\"t\", \"u\", "                               
[106] "    \"u\"), type = \"RCT\", unit_of_assignment_type = \"cluster\", "                               
[107] "    call = rct_design(formula = z ~ cluster(cid1, cid2), data = simdata), "                        
[108] "    dichotomy = structure(list(), class = \"formula\", .Environment = <environment>)), "           
[109] "

If we wanted to use this deparse code, we might consider storing the weights and offset objects in the eval_env created here. The code here would then look like:

  assign("data", data, envir = eval_env)
  assign("design", design, envir = eval_env)
  assign("weights", lm_model$call$weights, envir = eval_env)
  assign("offset", lm_model$call$offset, envir = eval_env)
  lm_model$call$data <- quote(data)
  lm_model$call$weights <- quote(weights)
  lm_model$call$offset <- quote(offset)
  environment(lm_model$terms) <- eval_env
  if (inherits(lm_model, "glm")) {
    lm_model$formula <- as.formula(lm_model, env = eval_env)
  }
  da_call <- str2lang(paste0("quote(", paste(deparse(lm_model$call), collapse = ""), ")"))

  return(new("DirectAdjusted",
             lm_model,
             Design = design,
             absorbed_intercepts = absorbed_intercepts,
             absorbed_moderators = absorbed_moderators,
             call = da_call))

In R 4.3 we can run this code without error (which I verified in a Docker image):

> data(simdata)
> simdata$id <- seq_len(nrow(simdata))
> cmod <- lm(y ~ x, simdata)
> des <- rct_design(z ~ cluster(cid1, cid2), simdata)
> mod <- lmitt(y ~ 1, data = simdata, design = des, weights = ate(des), offset = cov_adj(cmod))
> expand.model.frame(mod, "id")
     flexida_y flexida::assigned() id
1  -0.97202973                -0.5  1
2   0.27969544                -0.5  2
3  -0.19241487                -0.5  3
4   0.38409370                -0.5  4
5  -0.10447669                -0.5  5
6  -0.95786444                -0.5  6
7   1.25275817                -0.5  7
8   0.72103572                -0.5  8
9   1.00177153                -0.5  9
10 -1.46079240                -0.5 10
11  0.94523053                -0.5 11
12 -1.74651896                -0.5 12
13 -0.58412620                -0.5 13
14 -1.42302351                -0.5 14
15 -2.25867384                 0.5 15
16  1.77136846                 0.5 16
17 -0.70414747                 0.5 17
18 -0.33543528                 0.5 18
19 -0.43770366                 0.5 19
20  0.33594091                 0.5 20
21  1.54963679                -0.5 21
22  1.63040090                -0.5 22
23 -1.23436045                -0.5 23
24 -1.40921131                -0.5 24
25 -1.56342486                -0.5 25
26 -1.30385896                -0.5 26
27  1.90860302                -0.5 27
28 -0.04310819                -0.5 28
29 -0.89336926                -0.5 29
30 -0.65191417                -0.5 30
31  1.02370535                 0.5 31
32  0.20984377                 0.5 32
33 -0.36502604                 0.5 33
34 -0.80038416                 0.5 34
35 -0.91295239                -0.5 35
36  1.99728624                -0.5 36
37  0.88916602                -0.5 37
38  1.95793306                -0.5 38
39 -0.47212763                -0.5 39
40 -0.40158848                -0.5 40
41 -1.07813466                 0.5 41
42 -0.30127319                 0.5 42
43  0.42110541                 0.5 43
44  1.30818576                 0.5 44
45  0.51341454                 0.5 45
46  0.40522603                 0.5 46
47  1.18019960                 0.5 47
48  1.09638279                 0.5 48
49  0.05584398                 0.5 49
50 -0.83407072                 0.5 50

and mod@call looks like:

stats::lm(formula = flexida_y ~ 0 + `flexida::assigned()`, data = data, 
    weights = weights, offset = offset)

This did cause some errors in the test suite for both R 4.2.3 and R 4.3, but I haven't had a chance look into them. I'm going to push my code to a branch for people to look at. If this is a route of interest, I can look into the errors it causes in the test suite and if they seem like major impasses.

josherrickson commented 1 year ago

Would da_call <- call("quote", lm_model$call) be sufficient to avoid the deparsing complexity? It does appear to pass basic checks to allow an @call slot.

I looked at one of the failing tests; it was failing on https://github.com/benbhansen-stats/flexida/blob/fc3b11c6e939258c7498f19363d08b80a33159e3/R/SandwichLayerVariance.R#L434-L435 specifically the na.action = na.pass argument - without it, the matrix is returned.

The number of errors that are appearing from this fix give me pause. I wonder if a better solution would be along the lines we had discussed; keeping our own version of expand.model.frame that we maintain closer to the 4.2.3 branch. Part of my motivation here is that I think the 4.3 functionality of stats::expand.model.frame might be considered a bug; it will straight up fail on any object that does not have an @call slot (see the source) without any fallback to 4.2.3 functionality.

The other option is to explore adding our own getCall.DirectAdjusted. As I mentioned, in the 5 minutes I played around with it I was unsuccessful, but I think that might be more fruitful.

Either of these would avoid having to monkey around with calls.

josherrickson commented 1 year ago

Hmm, I don't know what I was doing before to muck it up so badly, but adding

#' @export
getCall.DirectAdjusted <- function(x, ...) {
  return(x$call)
}

passes all tests sans one:

✖ | 1      91 | test.SandwichLayer [0.2s]
─────────────────────────────────────────
Failure (test.SandwichLayer.R:585:3): .get_ca_and_prediction_gradient warns about less than full rank model fit
`stats_preds <- stats::predict(cmod, newdata)` did not throw the expected warning.
─────────────────────────────────────────

I'm not sure whether that's sufficient (should we return a call with lmitt as the function rather than lm?) but this is at least a simpler approach.

jwasserman2 commented 1 year ago

I agree with your feeling about it being kind of a buggy change. Your getCall.DirectAdjusted approach is simpler than mine, and on its face seems less likely to cause other issues (it feels natural for getCall to look for a call attribute rather than fail if there's no slot named call anyways). The call itself I think should have lmitt rather than lm so that if someone were to evaluate it, it would return a DirectAdjusted object, but otherwise this change seems reasonable to me

josherrickson commented 1 year ago

Going to bed last night, I was mentally composing a comment suggesting that we combine these two; add getCall and add a slot (either @call with some quote nonsense, or just call it something different).

However, I've reconsidered for the following reasons:

  1. It'd be easy to save the lmitt.formula call and pass it down to .convert_to_lmitt to attach to @call. However, it's less clear what as.lmitt would keep as it's @call. The as.lmitt(lm(...)) call? We'd need to somehow pass the arguments in lm up to as.lmitt. Or do we just use the lm call? In which case it doesn't really represent what it's supposed to, namely calling eval(as.lmitt(lm(...))@call) wouldn't return a DirectAdjusted objected.
  2. Putting as.lmitt aside for a second, even just considering lmitt.formula, the actual call attached the lm inside a DirectAdjusted object is quite different from the call to lmitt.formula due to all the monkeying around I do in there. We'd need to do all sort of special casing to even get expand.model.frame to work with the lmitt.formula call the user actually wrote.

Given that DirectAdjusted is a S4 object with slots (@) containing an S3 object with elements ($), we have a somewhat "private" call object right now in terms of the lm$call which can be used for things like expand.model.frame without directly exposing users. We could potentially add a separate @call to DirectAdjusted objects and keep getCall pointed towards the $call, but if we go that route, I suspect it won't be the last time the existence of both a @call and $call causes a headache. If we come across a strong argument for the existence of @call we can revisit, but right now I don't see one. If we do go down that route, we should probably considering using the ... in getCall to allow a choice of $call vs @call.

So ultimately, I think we go with getCall and modify it to change lm to lmitt so if a user really need to eval it.

(On a related note, update(lmitt(...)) doesn't work currently on "main" [giving the same error about unable to find "call"], but does work after adding the getCall.DirectAdjusted above - albeit generating an lm as expected. I'm not sure update.DirectAdjusted should work [given that there's only varieties of it, y ~ 1 and y ~ sbgrp], should we generate a dummy function that just errors? Just switching the lm to lmitt in the returned call won't solve the issue because lm's first argument is "formula" whereas lmitt's first argument is "obj" since it can be either a formula or an lm object. @benthestatistician )

josherrickson commented 1 year ago

Actually, I think we probably shouldn't change lm to lmitt in getCall.DirectAdjusted.

  1. As stated above, "obj" vs "formula".
  2. The formula inside a DirectAdjusted wouldn't actually be accepted into lmitt.formula! (It contains both a 0 (for no intercept) and a call to assigned().) We could of course work around this, but for what gain?

Though perhaps this is an argument for a separate @call slot.....

josherrickson commented 1 year ago

I'm assigning myself for this issue - I disabled my check of consistency between stats::expand.model.frame and flexida:::.expand.model.frame.DA pending an offline discussion of whether we need these two to be in sync going forward.

josherrickson commented 1 year ago

Per offline discussion:

josherrickson commented 1 year ago

There is now a @lmitt_call slot on DA objects that stores the users actual call. getCall() continues to return the $call from the internal lm.

josherrickson commented 1 year ago

Turns out that expand.model.frame is not a generic. We can still define expand.model.frame.DirectAdjusted, but stats::expand.model.frame won't dispatch to it. Do we still want to export it (and thus probably rename the existing .expand.model.frame.DA)?

josherrickson commented 1 year ago

Despite what I said at our call, I ended up implementing both the <=4.2.3 and >=4.3 versions of expand.model.frame. Since both work for us (we weren't hitting the error that spurred the 4.3 change) I figure there is no harm just keeping things version-specific.