IQSS / Zelig

A statistical framework that serves as a common interface to a large range of models
http://zeligproject.org
110 stars 43 forks source link

Tobit expected values incorrect? #314

Open izahn opened 7 years ago

izahn commented 7 years ago

Consider these two models (from the tobit vignette)

m1 <- survreg(Surv(durable, durable>0, type='left') ~ age + quant,
              data=tobin, dist='gaussian')

z.out <- zelig(durable ~ age + quant, model = "tobit", data = tobin)

coefs and vcov are the same:

all.equal(vcov(z.out)[[1]], vcov(m1))
## TRUE
all.equal(coef(z.out), coef(m1))
## TRUE

but the expected values as very different

predict(m1,
        newdata = data.frame(lapply(model.frame(m1)[-1], mean)),
        se.fit = TRUE)
## $fit
##        1 
## -2.067679 
##
## $se.fit
##        1 
## 1.962933
ev <- sim(setx(z.out))$get_qi(qi = "ev")
c(mean = mean(ev), sd = sd(ev))
##      mean        sd 
## 1.5536944 0.6302974 

Do we expect these to be so different? If so, why?

cchoirat commented 7 years ago

We wrap 'AER' ( https://stats.stackexchange.com/questions/149091/censored-regression-in-r). We should take a closer look, but it might boil down to predicting Y vs Y*.

On Thu, Nov 9, 2017 at 8:59 PM, Ista Zahn notifications@github.com wrote:

Consider these two models (from the tobit vignette)

m1 <- survreg(Surv(durable, durable>0, type='left') ~ age + quant, data=tobin, dist='gaussian')

z.out <- zelig(durable ~ age + quant, model = "tobit", data = tobin)

coefs and vcov are the same:

all.equal(vcov(z.out)[[1]], vcov(m1))

TRUE

all.equal(coef(z.out), coef(m1))

TRUE

but the expected values as very different

predict(m1, newdata = data.frame(lapply(model.frame(m1)[-1], mean)), se.fit = TRUE)

$fit

1

-2.067679

$se.fit

1

1.962933

ev <- sim(setx(z.out))$get_qi(qi = "ev") c(mean = mean(ev), sd = sd(ev))

mean sd

1.5536944 0.6302974

Do we expect these to be so different? If so, why?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/IQSS/Zelig/issues/314, or mute the thread https://github.com/notifications/unsubscribe-auth/AD-wZS4P2Hguu997yScPFP3RsTzaoAuGks5s064HgaJpZM4QY9mo .

christophergandrud commented 7 years ago

Before digging into this further, is this issue adequately addressed by https://github.com/IQSS/Zelig/issues/315?

izahn commented 7 years ago

No, expected values for tobit models are not as documented. Either the documentation or the implimentation is wrong. I suspect the implimentation.

On Nov 11, 2017 5:05 AM, "Christopher Gandrud" notifications@github.com wrote:

Before digging into this further, is this issue adequately addressed by #315 https://github.com/IQSS/Zelig/issues/315?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/IQSS/Zelig/issues/314#issuecomment-343654016, or mute the thread https://github.com/notifications/unsubscribe-auth/AATm0zy2X2d595XJKx1i4H9FIImg5mP2ks5s1XGBgaJpZM4QY9mo .

christophergandrud commented 7 years ago

Thanks for the info. I'll take a look.

christophergandrud commented 7 years ago

For reference:

https://github.com/IQSS/Zelig/blob/0f4e1f17f79be4d3aafa238f45dd6c195d37a424/R/model-tobit.R#L148-L163

christophergandrud commented 7 years ago

This is a hasty test to see if the Y, rather than Y* is the issue. The example and Y calculation procedure is taken directly from: https://stats.stackexchange.com/a/149529

# Create data 
N = 10
f = rep(c("s1","s2","s3","s4","s5","s6","s7","s8"),N)
fcoeff = rep(c(-1,-2,-3,-4,-3,-5,-10,-5),N)
set.seed(100) 
x = rnorm(8*N)+1
beta = 5
epsilon = rnorm(8*N,sd = sqrt(1/5))
y.star = x*beta+fcoeff+epsilon ## latent response
y = y.star 
y[y<0] <- 0 ## censored response
test_data <- data.frame(y = y, zero = 0, x = x, f = f)

# Estimate with AER -------------------------------------------------------
library(AER)
fit <- tobit(y~ 0 + x + f, data = test_data)

# E[Y*] ---------------
mean(predict(fit))

## [1] 0.9823243

# E[Y] -----------
mu <- fitted(fit)
sigma <- fit$scale
p0 <- pnorm(mu/sigma)
lambda <- function(x) dnorm(x)/pnorm(x)
ey0 <- mu + sigma * lambda(mu/sigma)
ey <- p0 * ey0
c(mean = mean(ey, na.rm = T), sd = sd(ey, na.rm = T))

##    mean       sd 
## 2.544692 3.354458 

# Estimate with Zelig -------------------------------------------------------
zfit <- zelig(y~ 0 + x + f, data = test_data, model = "tobit", cite = FALSE)

# E[Y*] ---------------
mean(unlist(predict(zfit)))

## [1] 0.9823243

ev <- sim(setx(zfit))$get_qi(qi = "ev")
c(mean = mean(ev), sd = sd(ev))

# E[Y] ???? -----------
ev <- sim(setx(zfit))$get_qi(qi = "ev")
c(mean = mean(ev), sd = sd(ev))

##     mean        sd 
## 0.6255104 0.1359505 

Note that in this example:

mean(y)

## [1] 2.51392

sd(y)

## [1] 3.353757

which is almost identical to the E[Y] using Achim's procedure from Stack Exchange, but very different from Zelig. The ev from Zelig is also fairly different from the E[Y*] from predict.

@cchoirat am I missing something about what Zelig is trying to achieve (or misinterpreted the results from this test)?