benbhansen-stats / propertee

Prognostic Regression Offsets with Propagation of ERrors, for Treatment Effect Estimation (IES R305D210029).
https://benbhansen-stats.github.io/propertee/
Other
2 stars 0 forks source link

glmrob methods #125

Closed benthestatistician closed 3 months ago

benthestatistician commented 1 year ago

We could accommodate cov_adj models fit using robust binary regression in much the same way we accommodate cov_adj's fitt with robust linear regression, by setting up estfun.glmrob() and bread.glmrob() methods, accepting certain objects fit with robustbase::glmrob().

The default method used by robustbase::glmrob(<...>, family=binomial) is "Mqle", which

fits a generalized linear model using Mallows or Huber type robust estimators, as described in Cantoni and Ronchetti (2001) and Cantoni and Ronchetti (2006)

Following section 2 of cantoniRonchetti01, Robust Generalized Linear Models.pdf, the variance-covariance of the coefficients issuing from such a model is $M^{-1}QM^{-1}$. Among the artifacts stored with the model are "matM" and "matQ", with dimensions matching the length of the coefficient vector. After a check to ensure that the fitted object has method equal to "Mqle", bread.glmrob() might just return the inverse of matM. Reconstructing the appropriate estfun could take a little more work, as matQ isn't the estfun so much as the inner product of the estfun with itself.

benthestatistician commented 1 year ago

This would be to support @adamSales's and @kirkvanacore's work, involving an RDD with a binary outcome. After speaking with them I realized that I had overestimated the package's readiness for what they wanted to do, suggesting that it was ready for cov_adj's fit using robustbase::glmrob(). It's not, or not yet; this issue records my notes on what would have to happen.

I haven't explored what would have to happen to accommodate other robustbase::glmrob() fitting methods than "Mqle". It's probably not necessary to restrict to family=binomial; the same things will probably work for anything fit with the "Mqle" method.

Adam and Kirk, it was robustbase::glmrob() you were exploring, correct? (As opposed to say robust::glmRob(), which I haven't looked into, and whose justification may or may not make the connections to $\epsilon$-contamination that are important to us, and that Cantoni and Ronchetti do seem to make.) Also, please close this issue if you decide this isn't the direction you'd like to go in.

benthestatistician commented 1 year ago

@kirkvanacore pls note query to you above on this thread, posted before you had joined the repo. Also:

kirkvanacore commented 1 year ago

Great thanks @benthestatistician, I will talk with @adamSales soon about working on this issue.

benthestatistician commented 1 year ago

I may have been making a mountain out of a molehill in my meeting earlier today with @kirkvanacore and @adamSales. We were discussing this block of robustbase:::glmrobMqle():

    eval(comp.Epsi.init)
    alpha <- colMeans(eval(Epsi) * w.x * sni/sV * dmu.deta * X)
    DiagA <- eval(Epsi2) / (ni*sV^2)* w.x^2* (ni*dmu.deta)^2
    matQ  <- crossprod(X, DiagA*X)/nobs - tcrossprod(alpha, alpha)

Turns out it's easy to align most of these variable names with those used in the Cantoni & Ronchetti paper I linked to in the above. From p.1023 and 1024 (ie pp 3-4 of the PDF): image and image So estfun.glmrob() just needs to return X*sqrt(DiagA) - matrix(alpha, nrow(X), length(alpha), byrow=TRUE), I'm pretty sure[^fn].

[^fn]: Pending either a second pair of eyes or a second look from me; coding could proceed in the meantime.

From reading the surrounding code, variables ni and sni are weights and square-root weights, while sV is the square-root of the product of psi and the glm's variance function. (EDIT: It's better to read the source code itself than the actual function, eg as delivered by page(robustbase:::glmrobMqle). By the time you invoke the function in R, Martin Maechler's helpful inline comments have all been stripped out.)

In another boo-boo, I said these models have a separately estimated scale parameter, so that the dimension of the Q matrix is one plus the number of coefficients. Re-reading the paper reminds me that these authors avoided estimating the scale parameter. Consequently, corresponding estfun() values should have as many columns as there are regression parameters, and that's also the dimension of the square matrix matQ that's returned with the glmrob object.

adamSales commented 1 year ago

Just added code for estfun and bread methods for glmrob objects. (see commit 8063401a3c7e35f15f37a2a1ff04a8a32fb4159a)

I want to note something, though: The estfun method does not, as suggested by @benthestatistician,

return X*sqrt(DiagA) - matrix(alpha, nrow(X), length(alpha), byrow=TRUE)

which is derived from the formula for the Q matrix in Cantoni & Ronchetti p. 1024. That formula used a theoretical calculation for $E[\psi_c(r_i)^2]$ (where $\psic(\cdot)$ is the Huber function), which is appropriate for estimating the Q matrix, but not (I think) for estimating $E[\Psi\Omega^T]$ from Carrol et al.'s $B{12}$ matrix (where $\Omega$ and $\Psi$ are estimating equations from two models), which is what we're after, right?

(While @benthestatistician is right that all of the components of that calculation are recoverable from the fitted glmrob model, so that one may re-calculate the Q matrix using the formula in the paper, if you try to derive the estimating equations as psi <- X*sqrt(DiagA) - matrix(alpha, nrow(X), length(alpha), byrow=TRUE)/n, then crossprod(psi)/n does not give you exactly the right Q matrix, I think because the formula in Cantoni & Ronchetti assumes that alpha and X*sqrt(DiagA) are exactly uncorrelated, while in a finite sample, they will only be approximately uncorrelated.)

Instead, my code is based on equation (7) from Cantoni & Ronchetti: image

along with some snippets from the robustbase source code, which are annotated in the code comments. Unfortunately, you can't check this by using this to recover the exact matQ that's returned in the fitted glmrob model object, since my formula doesn't make use of distributional theory to calculate $E[\psi_c(r_i)^2]$, so the correspondence is only approximate.

But I'm pretty sure it's right...

OH also--I put it in a file called R/glmrobMethods.R which is probably the wrong place.

josherrickson commented 1 year ago

@adamSales File location is fine. A few issues:

  1. Can you add a basic test, just so the code gets run during testing? (E.g. a single call to it is sufficient.) No need for full coverage at this point. The test can go in a file such as tests/testthat/test.estfun.glmrob.R and need not even be a formal testthat test, just as long as the code is run.
  2. Using internal functions (:::) is very frowned upon from CRAN - that's going to be a challenge for each submission. I have a vague recollection of finding a workaround in the past, but a quick search online wasn't helpful. Can we find an alternative?
  3. In the final line, there's a w.x object that is not being defined at any earlier place in the code that I can identify.
adamSales commented 1 year ago

Thanks @josherrickson

As for internal functions, can I copy-and-paste code from robustbase (with proper attribution, whatever that would be)? The code is pretty simple so it would be easy to do

josherrickson commented 1 year ago

That may work (license permitting). Before going down that route, let me do a little more research. I still have a nagging sense that this was something I solved before more elegantly.

josherrickson commented 1 year ago

Ok figured it out. Rather than pkg:::fn, use utils::getFromNamespace("fn", "pkg"). This passes all CRAN checks that I can run. I can't promise CRAN won't flag us on this down the road though - perhaps we should have a discussion of whether we should just do the copy-and-paste approach? @benthestatistician feel free to weigh in here. "robustbase" is GPL3.

I'll push up a commit shortly addressing this and a few other minor issues to fix checks. 1. and 3. from my list above remain.

benthestatistician commented 1 year ago

Thanks kindly for the commit and informative comment, Adam. Your analysis of the situation does strike me as correct. In particular, you're quite right that our need for the cross product of its and another set of estfuns precludes us from taking the $B(\hat{\theta})$ approach, and our $\hat{B}(\hat{\theta})$ are not expected to line up exactly with their $B(\hat{\theta})$.

Re internal functions, I like Josh's solution over copy/paste. (If CRAN decides they don't like that, presumably lots of people will be in the same boat and hopefully another solution will turn up on forums.)

adamSales commented 1 year ago

There are still some issues due to differences between glmrob objects and other types of covariance models. I think they are all pretty straightforward to solve, but I don't want to break anything!

Example:

library(robustbase)
data(lsoSynth)
lsoSynth$Z=lsoSynth$R<0
lsoSynth$id=1:nrow(lsoSynth)
lsoSynth$ybin=lsoSynth$nextGPA>1.18

des=rd_design(Z~forcing(R)+unitid(id),data=lsoSynth)
mod <- glmrob(ybin~R+Z,family=binomial,data=lsoSynth)

res <- lmitt(ybin~1,design=des,offset=cov_adj(mod),data=lsoSynth)

This returns an error:

Error in seq_len(p) : argument must be coercible to non-negative integer
In addition: Warning message:
In seq_len(p) : first element used of 'length.out' argument

This error comes from the .get_ca_and_prediction_gradient. The reason is that there is no rank element in a glmrob object. This fix doesn't work:

mod$rank = qr(model.matrix(mod))$rank # = 3
res <- lmitt(ybin~1,design=des,offset=cov_adj(mod),data=lsoSynth)

because (1) setValidity("PreSandwichLayer" ... also looks for a rank element, and (2) .get_ca_and_prediction_gradient requires other stuff from the qr element that isn't present. The following fix avoids the error message:

QR <- qr(model.matrix(mod))
mod$qr <- QR
mod$rank <- QR$rank
res <- lmitt(ybin~1,design=des,offset=cov_adj(mod),data=lsoSynth)

I was trying to think of how to make this sort of thing automatic in the package, but I decided to leave that to the experts. Anyway, our troubles are not yet over:

> summary(res)

Call:
lmitt(ybin ~ 1, design = des, offset = cov_adj(mod), data = lsoSynth)

 Treatment Effects :
       Estimate Std. Error t value Pr(>|t|)
Z.TRUE  0.07783        NaN     NaN      NaN
Std. Error calculated via type "CR0"

The problem here is the following lines from .get_a11_inverse:

  nc <- sum(summary(cmod)$df[1L:2L])

  out <- sandwich::bread(cmod) / nc

but summary.glmrob doesn't return a df element, so nc = 0.

I think if we set another value for nc it will work.

benthestatistician commented 1 year ago

Thanks for the informative update & statement of outstanding issues, @adamSales . I might be able to advise on some of the outstanding issues, but I propose to wait to give @jwasserman2 a chance to chime in, once he returns from some time away and has a chance to get caught up. (Josh feel free to pass back to me, online or offline, after addressing what you're in a position to address.)

benthestatistician commented 4 months ago

@jwasserman2 I intended a shoutout to you here some months ago, but it looks like my timing was probably off. Can you take a look at Adam's last comment above and let us know if you see any hidden wrinkles? It looks to me like it should be straightforward to clean up, interested to know whether you'll agree.

Another comment: Back in August Adam had committed a file hitting some tests, as @josherrickson had requested and later addressed by creating test.glmrob_methods.R. It's just that Adam's file was called test.estfun.glmrob.R~, so I imagine you and Josh E overlooked it. My sense is that the test.estfun.glmrob.R~ material should either go into test.glmrob_methods.R, or simply overwrite test.glmrob_methods.R. Thoughts, either Josh?

jwasserman2 commented 4 months ago

I'll use my question on #142 about our updated .get_a11_inverse() as my reply to the first comment here as well.

As for the tests, I think the content in test.estfun.glmrob.R~ should be added to what's in test.glmrob_methods.R so we have tests for the format of the outputs as well as the calculations. The only discussion point is that the tests intest.estfun.glmrob.R~ use lsoSynth, a dataset we don't load in propertee. The results of the tests don't look reliant on that dataset though, so could we re-write the tests so they use either simdata or STARdata?

benthestatistician commented 4 months ago
josherrickson commented 4 months ago

@adamSales just to clarify why Ben is asking, files ending in ~ can often come from backup files saved from editors like Emacs or vim. There's no issue including the tests, we just wanted to make sure that neither a) you have a more recent version saved but uncommitted, nor that b) you purposely deleted those tests but the backup file hung around.

benthestatistician commented 4 months ago

To fix .get_ca_and_prediction_gradient()'s problem with glmrob objects, it seems to Josh W and I that .get_ca_and_prediction_gradient() should become an S3 generic with the current implementation as default method but also a separate method for glmrob objects. The glmrob method will have a lot of code in common with the default method, but we can bear this cost. (JW had some ideas about mitigating it that I'll let him post about if/as appropriate.)