benbhansen-stats / propertee

Prognostic Regression Offsets with Propagation of ERrors, for Treatment Effect Estimation (IES R305D210029).
https://benbhansen-stats.github.io/propertee/
Other
2 stars 1 forks source link

`cov_adj()` additional arguments #4

Open benthestatistician opened 3 years ago

benthestatistician commented 3 years ago

cov_adj() is intended as a variant of predict() that assembles independent variables to predict off of in manner better suited to our purposes, and also sanitizes values of the treatment variable (cf #3).

  1. Should we present it as a wrapper to predict(), so that it can be expected to function whenever there's a suitable predict() method and it accepts whatever ... arguments that method does? Update: :-1: I'd like to offer users a way to set those arguments, but I don't have a specific use case. And we'd routinely be setting predict()'s type= argument to "response", perhaps confounding user expectations based on use of predict().
  2. Also, given its function of overwriting values of treatment variable(s), cov_adj() should have a (mandatory?) argument with which way to identify any post-treatment variables appearing as predictors in its first argument. (The funding proposal presents sample code in which cov_adj() calls only for a fitted model, not also the names of the treatment variable(s). This works only if the treatment variable doesn't appear among the fitted model's independent variables, as occurs e.g. in Peters-Belson modeling. ) This makes cov_adj()'s signature even less like that of predict().
josherrickson commented 3 years ago

To get some preliminary testing code, I added a newdata argument to cov_adj() which just passes its argument down to predict. We'll want to eventually make it smarter about obtaining the new data.

josherrickson commented 2 years ago

We'll need to add cov_adj(..., design = ...) optional argument.

We should also examine whether we can avoid double specifying the Design, e.g. lm( ..., offset = cov_adj(..., design = des), weights = ate(design = des)).

Ideas for avoiding the double specification:

  1. Have cov_adj find the Design object in the ate/ett call. (I think the work already done in .get_data_from_model combined with both offset and weights being the same call could make this easy.)

  2. When user creates the covariance adjustment, can we pass something like weights = noweight(des).

benthestatistician commented 2 years ago

It'd be really great if we could have cov_adj find the Design object from the ate/ett call occuring elsewhere within the same invocation of lm() (or glm() or whatever).

If that doesn't work, I don't think it would solve our problem to ask the user to pass or weights=noweight(des) or similar in the weights slot of the covariance adjustment model. (Or more specifically, I think that even if it did solve this problem it would create others that are arguably bigger.)

If hunting down the Design doesn't work out for us, or doesn't work reliably, how's the following for a fallback? Make it possible to fit one's covariance model like so:

cmod <- glm(grad_degree_goal ~ itt(nm_semifinals) + forcing(psat_sum), fam=binomial )

or

cmod <- glm(grad_degree_goal ~ assignments(nm_semifinals) + forcing(psat_sum), fam=binomial )

or even

cmod <- glm(grad_degree_goal ~ post_assignment(nm_finalist) + forcing(psat_sum), fam=binomial )

where neither itt(), assignments(), post_assignment()[^1] nor forcing() does anything to its argument, but perhaps marks its term in the model formula as a "special", and in any event gets referenced later as cov_adj() decides which predictor variables to overwrite and which to leave as-is.

Tagging @adamSales to hear his thoughts about the fallback idea. (Update: Adam perhaps this thread of conversation belongs on the #14 thread; I've commented there accordingly.

[^1]: The operator names assignments() and post_assignment() are just placeholders that I made up on the fly.

josherrickson commented 2 years ago

Finding the Design in the weights argument turned out pretty easy; I've pushed up a commit that does it. It needs hardening and some more testing but otherwise cov_adj can now find the Design if weights = ate(des) is in the call stack somewhere. (In ittestimate, we just pass the Design directly.)

benthestatistician commented 2 years ago

Cool! if the user did weights = pop_size * ate(des), would des still get found?

josherrickson commented 2 years ago

Yup, I need to put in a test, but we luck out that weights is evaluated prior to offset, so as long as the argument to weights = evaluates to a WeightedDesign object, we can find it.

On Feb 19, 2022, at 3:43 PM, Ben @.***> wrote:

Cool! if the user did weights = pop_size * ate(des), would des still get found?

— Reply to this email directly, view it on GitHub https://github.com/benbhansen-stats/flexida/issues/4#issuecomment-1046099275, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMYXO75H7HJ6SXQ357Q2P3U376GNANCNFSM45XR3PPA. You are receiving this because you commented.

benthestatistician commented 2 years ago

(Is it also the case that the weights= arg gets evaluated before the right-hand side variables in the formula get tracked down? So that if for example we wanted to build logic into forcing() that would check for the presence of a design, then we could do something similar there also?)

josherrickson commented 2 years ago

You mean in the lm call? Unfortunately not:

> we <- function() {
+ print("in weights")
+ return(rep(1, 32))
+ }
> os <- function() {
+ print("in offset")
+ return(rep(1, 32))
+ }
> log2 <- function(x) {
+ print("in log2")
+ return(log(x))
+ }
> m <- lm(disp ~ mpg + log2(drat), data = mtcars, weights = we(), offset = os())
[1] "in log2"
[1] "in weights"
[1] "in offset"
josherrickson commented 2 years ago

Oh interesting - it looks like using get("weights", sys.frame(#)) forces an evaluation of weights even if it hasn't hit naturally yet, so yes, we can exploit this (and the fact that weights gets evaluated prior to offset is irrelevant).

> log2 <- function(x) {
+ browser()
+ return(log(x))
+ }
> m <- lm(disp ~ mpg + log2(drat), data = mtcars, weights = we(), offset = os())
Called from: log2(drat)
Browse[1]> debug at #3: return(log(x))
Browse[2]> get("weights", sys.frame(1))
[1] "in weights"
 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
josherrickson commented 2 years ago

Note to self: can we avoid double evaluation of 'weights'? After using 'get', can we then set it in the same frame so when weights is accessed in model.frame we save a bit of time? Or does it even double evaluate, or does the 'get' store the evaluation in the proper frame as well?

Edit: Initial experiments aren't promising. The evaluation of both functions on the RHS of the formula and the weights occurs inside model.frame.default. weights is not a formal argument, instead being passed into .... Inside model.frame.default, we have:

extras <- substitute(list(...))
extras <- eval(extras, data, env)

where the first line obtains the call including weights = we() (from example above), and the second does the actual evaluation. The RHS of the formula gets evaluated earlier (mostly irrelevant, but line is: variables <- eval(predvars, data, env)). We can use

eval(expression(list(...)), sys.frame(#))$weights

to obtain the weights even though they don't formally exist, however, I can't find a way to modify the ... arguments to replace the expression we() (as in above example) with the vector of 1's. Because ultimately what we'd like to do is replace the call to ate(des) with the WeightedDesign object.

[What doesn't work: We can't modify it e.g. substitute(list(...))$weights <- WeightedDesign(...) because that modifies the extracted weights, not the weight that was passed as an argument. We need to modify the actual ... argument because substitute(list(...)) pulls directly from that. We also can't just replace the weights argument in the initial lm as the initial version has already been passed down to model.frame.default.]

How much does this matter? Not that much, as long as weight calculations stay fast and low memory. We're guaranteed that both evaluations of the weights (inside log2 above and directly in weights = argument, or in the real case, in forcing() and directly in weights = argument) will be identical unless we really mangle the data in the log2/forcing work. Currently weight calculations are very fast, but we haven't profiled them to see what would happen if someone had a Design with an extremely large n.

I think if we decide to use forcing as described above, we punt the double evaluation issue down the road (unless anyone else has bright ideas) and leave it as a no-action issue to a) profile and b) think of a solution around it.

benthestatistician commented 1 year ago

The statement of this issue is due for an update. I'd like to retain the idea of an argument with which to name post-treatment variables. I may also want to add another new argument. I propose to take out the stuff about resembling the predict() UI.

Noting this for comment (in particular on removing the resemblance to predict()) in anticipation of updating the head of this issue. @josherrickson

benthestatistician commented 1 year ago

Noting a side-discussion on #100 that's really about cov_adj() args, so maybe belongs here. I suggested that maybe design= should come before newdata= in the order of args, so that users could provide the design via positional arg matching and omit a "newdata=". @josherrickson replied that

We need at least one of ate()/ett() or cov_adj() to have a design= argument, not both. In the 2x2 table of weights yes/no and cov_adj yes/no, if we expect the weights no/cov_adj yes box to have a high proportion of all models, then perhaps that would make sense.

josherrickson commented 1 year ago

No issues with moving cov_adj() away from predict()'s API.

I'm not following the post-treatment stuff, can you provide any insight? Regardless, if we need a way to specify, we can choose either formula or string input. I'd say formula if it's expected to have multiple variable names (as ~ x + y + z is shorter and simpler than c("x", "y", "z")) or string if it's a single. (Or we can support both of course, but perhaps only if there's some more advanced functionality that the formula version supports, e.g. ~ log(x) or somesuch, similar to how lmitt() supports either weights = "ate" or weights = mywt*ate().) If there needs to be a pairing, the formula version could also be new ~ old. (Of course we could insist on something like c(new = "old") or even list(new = "old")).

I suspect that newdata will be more commonly used that design in the argument because we don't "require" a Design inside the call to cov_adj(). (I.e. a user calling cov_adj() external to the model must provide newdata but doesn't have to provide design.) But I don't feel strongly and we can order however you'd like.

benthestatistician commented 1 year ago

Thanks for the feedback, @josherrickson . Motivation for an argument to identify post-treatment variables: In situations where assignment functions as an instrumental variable, such that treatment assignment (z) and treatment received (d) are not the same, I'd like us to support cov_adj models along the lines of y ~ d + x, but with actual d values replaced by an imputation value when we're generating predictions. Eventually, that is.

benthestatistician commented 1 year ago

Potentially a new issue here, but I'll file w/ this related one: @adamSales has some code including

des=rd_design(Z~forcing(R)+unitid(id),data=dat)

    modHet=glmrob(Y~R+Z*x,family=binomial,data=dat)
    QR <- qr(model.matrix(modHet)) #workaround for glmrob oddity
    modHet$qr <- QR
    modHet$rank <- QR$rank

    resHet2 <- lmitt(Y~x,design=des,offset=cov_adj(modHet),data=dat,weights="ate")

I'm worried that perhaps cov_adj() may not be setting Z to the reference level in each of its contribs ot modHet predictions. I.e., it's not just Z that need to be zeroed out but also Z:x. Is it? From what I see in test.cov_adj.R, this scenario isn't currently being tested. Could you take a look, @josherrickson ?

adamSales commented 1 year ago

I checked this (sorry Ben, forgot to mention in my email just now) It appears that cov_adj() does what we want it to in this situation

here's my code:

devtools::load_all()
library(robustbase)

Make data (re-purposed function from a simulation)

makeData <- function(n=1000,Rslope=1,eff=0.5,Xslope=.2,EffSlope=.2,invLink=plogis){
    R <- sample(seq(-1,1,length.out=30),n,replace=TRUE)
    Z <- R>0
    x <- rnorm(n)
    pc <- invLink(Rslope*R+Xslope*x)
    pt <- invLink(Rslope*R+Xslope*x+eff+EffSlope*x)
    P <- ifelse(Z,pt,pc)
    Y <- rbinom(1000,1,P)
    out <- data.frame(R=R,Z=Z,x=x,Y=Y, P,pc,pt,eff=pt-pc)
    out
}

make the data

dat <- makeData()

covariance adjustment model w/ interaction

dat$id=1:nrow(dat)
des=rd_design(Z~forcing(R)+unitid(id),data=dat)

modHet=glmrob(Y~R+Z*x,family=binomial,data=dat)
QR <- qr(model.matrix(modHet))
modHet$qr <- QR
modHet$rank <- QR$rank

resHet2 <- lmitt(Y~x,design=des,offset=cov_adj(modHet),data=dat,weights="ate")

propertee cov_adj (offset is in the @.Data slot)

ca=cov_adj(modHet, newdata=dat,design=des )

what we want cov_adj to be (first double check that model matrix and coef give us what we think) (predict.glmRob doesn't work cuz I messed with the QR stuff)

mm0 = model.matrix(modHet)
mm0[,'ZTRUE'] <- 0
mm0[,'ZTRUE:x'] <- 0

myca=plogis(crossprod(t(mm0),coef(modHet))[,1])

are they the same?

all.equal(myca,ca@.Data, check.attributes=FALSE, check.names=FALSE)

TRUE

benthestatistician commented 1 year ago

Phew. Belated thanks, Adam! The odd simulation results in the background remain to be understood, but there are other potential explanations to explore. (As noted in separate correspondence on that issue.)

adamSales commented 1 year ago

Would it be helpful for me to turn the code I wrote into a formal test?

benthestatistician commented 1 year ago

That would be great, yes! (I take you to be referring to the checks you reported 2 comments up, here.)

adamSales commented 1 year ago

Added test in commit ee27720

benthestatistician commented 4 months ago

Getting back to point 2 in the issue statement, offering users as a way to flag post-treatment variables other than the treatment variable itself that are to be overwritten with a reference value before creating predictions.

My original idea was to have an optional cov_adj() argument in which such variables would be named explicitly. As another alternative, how about exporting something like

post_tx <- function(x) x

and inviting people to use it in model formulas to flag post treatment variables? So a typical use would be something like

des <- rct_design(z~block(clinic)+uoa(patientID), data=D)
cmod <- glm(y ~ post_tx(dose) + age + comorbidities, data=C)
tmod <- lmitt(y~., design=des, offset=cov_adj(cmod), data=Q)

In this usage, each of the predictions would be made using the smallest recorded value of dose. (At least in the first implementation; we might later consider setting newdata$dose to the median or mode of dose, or its median or mode among controls.)

jwasserman2 commented 4 months ago

I think I like the original proposal more, @benthestatistician, because the optional argument could take a list that specifies each post-treatment variable with its reference value; I'm thinking like tidyr::replace_na(). Our current behavior for finding the reference value for the treatment variable specifically could be done by default, but if the user specifies the treatment variable in this list then we can use the reference value they provide. The additional (minor) benefit of the original approach is that users have to update less of their existing code (i.e. fitting the covariance model) if we add this argument to cov_adj().

josherrickson commented 4 months ago

I agree with Josh, I like the separate argument better. That said, we could let post_tx() take in an optional reference= argument to handle that issue.

benthestatistician commented 4 months ago

I like @jwasserman2's idea also, as well as the post_tx() concept w/ @josherrickson's suggestion. If we go the JW route, we might also consider the UI of base::transform() as opposed to tidyr::replace_na()'s, adding a ... to cov_adj() where users put one or more expressions stating which variables are to be overwritten before creating the predictions.

  1. If we're offering to compute the replacement value for var2 when the user doesn't care to specify one, we could let them just give var2 as one of the expressions. But when specifying a named list, one would have to write out var2=NULL, which seems verbose and non-intuitive.
    
    > list(v1=5, v2=NULL)
    $v1
    [1] 5

$v2 NULL

list(v1=5, v2) Error: object 'v2' not found

(I suppose this could be mitigated by allowing `list(v1=5, v2="default")`.)
2.  `transform()` offers you more options, you can do stuff like the following.  
```r
transform(airquality, Ozone=mean(Ozone[Month==5], na.rm=T)) |> head()
Ozone Solar.R Wind Temp Month Day
1  23.6     190  7.4   67     5   1
2  23.6     118  8.0   72     5   2
3  23.6     149 12.6   74     5   3
4  23.6     313 11.5   62     5   4
5  23.6      NA 14.3   56     5   5
6  23.6      NA 14.9   66     5   6
airquality |> head()
Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6
  1. What we're doing is more like what transform() does than what replace_na() does: we're overwriting all values of the variable, not just those with NAs, or those in the treatment group. So the analogy to transform() points the user in the right direction.

OTOH, transform() comes with a warning label I don't fully understand:

For programming it is better to use the standard subsetting arithmetic functions, and in particular the non-standard evaluation of argument transform can have unanticipated consequences.

I didn't see anything specific to scare me off in this discussion of dplyr::mutate() vs base::transform(), but I also didn't consider scenarios that might come up in programming (as Wickham discusses here for base::subset()). If one of you does understand it and thinks it should be scaring us, do speak up.

josherrickson commented 4 months ago

The fact that the example for transform in the documentation uses attach makes me very wary of it...

Are you suggesting syntax like this?

cmod <- glm(y ~ post_tx(dose) + age + comorbidities, data=C)
tmod <- lmitt(y~., design=des, offset=cov_adj(cmod, age = 45), data=Q)

First, how does that solve the problem of users letting us pick the value? Were you thinking cov_adj(cmod, age = 45, comborbidities)? I think that would error, you'd need to pass comorbidities as a string. If we're going this approach, I'd suggest we define a function default, so then the user could do cov_adj(cmod, age = 45, comorbidites = default). (This is similar to the family argument to glm and similar).

Second, I worry about us locking into the .... What if down the road we want to use the ... to pass arguments through cov_adj to something lower, e.g. .make_PreSandwichlayer which does in fact take in a ... that we're not currently passing down.

I remain on the side of a list, perhaps with that default() function. Something like cov_adj(cmod, post_tx = list(age = 45, comorbidities = default)).