HertieDataScience / SyllabusAndLectures

Hertie School of Governance Introduction to Collaborative Social Science Data Analysis
MIT License
37 stars 60 forks source link

CI for predicted probabilities #117

Open MarieAgosta opened 7 years ago

MarieAgosta commented 7 years ago

Hi all,

So we manually did the predicted probabilities and now have to estimate the confidence intervals for our predicted probabilities… but the predict argument does not work for us.

We’ve searched around for ways to go about it and have yet to find anything that works. Do you have any suggestions? Below is the code, the bolded bit at the end is the new trouble spot.

Many thanks,

Marie

#manually collecting probabilities

####Black Millennial Woman
#1-probability for below-high-school educated black employed woman who didn't vote in 2008
prob_genY_ed1 <- (exp(-0.3448 - 0.3448 + 0.1249 + 0.6389))/(1+exp(-0.3448 - 0.3448 + 0.1249 + 0.6389)) 
#2-probability for high-school educated black employed woman who didn't vote in 2008
prob_genY_ed2 <- (exp(-0.3448 + 0.5263 + 0.1249 + 0.6389))/(1+exp(-0.3448 + 0.5263 + 0.1249 + 0.6389)) 
#3-probability for some post-high-school educated black employed woman who didn't vote in 2008
prob_genY_ed3 <- (exp(-0.3448 + 0.7955 + 0.1249 + 0.6389))/(1+exp(-0.3448 + 0.7955 + 0.1249 + 0.6389)) 
#4-probability for bachelor-educated black employed woman who didn't vote in 2008
prob_genY_ed4 <- (exp(-0.3448 + 0.8990 + 0.1249 + 0.6389))/(1+exp(-0.3448 + 0.8990 + 0.1249 + 0.6389)) 
#5-probability for graduate-educated black employed woman who didn't vote in 2008
prob_genY_ed5 <- (exp(-0.3448 + 1.5099 + 0.1249 + 0.6389))/(1+exp(-0.3448 + 1.5099 + 0.1249 + 0.6389)) 

####Black Baby Boomer Woman
#1-probability for below-high-school educated black employed woman who didn't vote in 2008
prob_boomer_ed1 <- (exp(-0.87481 - 0.87481 + 0.22553 -0.76681))/(1+exp(-0.87481 - 0.87481 + 0.22553 -0.7668)) 
#2-probability for high-school educated black employed woman who didn't vote in 2008
prob_boomer_ed2 <- (exp(-0.87481 + 0.81340 + 0.22553 -0.76681))/(1+exp(-0.87481 + 0.81340 + 0.22553 -0.76681)) 
#3-probability for some post-high-school educated black employed woman who didn't vote in 2008
prob_boomer_ed3 <- (exp(-0.87481 + 1.12412 + 0.22553 -0.76681))/(1+exp(-0.87481 + 1.12412 + 0.22553 -0.76681)) 
#4-probability for bachelor-educated black employed woman who didn't vote in 2008
prob_boomer_ed4 <- (exp(-0.87481 + 1.69306 + 0.22553 -0.76681))/(1+exp(-0.87481 + 1.69306 + 0.22553 -0.76681)) 
#5-probability for graduate-educated black employed woman who didn't vote in 2008
prob_boomer_ed5 <- (exp(-0.87481 + 1.54687 + 0.22553 -0.76681))/(1+exp(-0.87481 + 1.54687 + 0.22553 -0.76681)) 

#creating a dataframe of predicted variables for plotting and other uses
blackfem_genY_ed <- c(prob_genY_ed1, prob_genY_ed2, prob_genY_ed3, prob_genY_ed4, prob_genY_ed5)
blackfem_boomer_ed <- c(prob_boomer_ed1, prob_boomer_ed2, prob_boomer_ed3, prob_boomer_ed4, prob_boomer_ed5)
education <- cbind(blackfem_genY_ed, blackfem_boomer_ed)

#creating a data frame of fitted values for a hypothetical black woman

blackfem_genY <- with(anes_genY, data.frame(female = 1, race = 2, vote_2008 = 0, unemployed = 0, education = factor(1:5))) 

#this is an alternative we've played around with that includes the weighting columns
#used by the survey package as well
blackfem_genY <- with(anes_genY, data.frame(caseID = factor(1:5), weights = .506, psu = 1, strata = 6, female = 1, race = 2, vote_2008 = 0, unemployed = 0, education = factor(1:5))) 

#predict probability point estimates for each fitted value.

blackfem_genY$predicted_prob <- prediction(M_genY, data = blackfem_genY, type = "response")

blackfem_genY$predicted_prob <- predict(M_genY, newdata = blackfem_genY, type = "response", se = TRUE)
#another alternative
blackfem_genY$predicted_prob <- predict(M_genY, newdata = blackfem_genY, type = "response", 
                                        design = ANESdesign_genY)

#Now to plot this with confidence intervals.
#First, add the confidence intervals

blackfem_genY <- cbind(blackfem_genY, predict(M_genY, newdata = blackfem_genY, type="link", se=TRUE))
blackfem_genY <- within(blackfem_genY, {
  prob <- plogis(fit)
  LL <- plogis(fit - (1.96 * se.fit))
  UL <- plogis(fit + (1.96 * se.fit))
})
christophergandrud commented 7 years ago

Hi @MarieAgosta

Some small notes to make it easier for others to help:

Re predicted probabilities: you are repeating a lot of code. Whenever you find yourself repeating a lot of code, it's time to create a function, e.g. something like:

# Note assumes a model with 3 explanatory variables
predicted_fit <- function(alpa, beta1, beta2, beta3, x1, x2, x3){
    fit <- alpha + beta1 * x1 + beta2 * x2 + beta3 * x3
    pred_prob <- (exp(fit)) / (1 + exp(fit)) 
    return(pred_prob)
}

You then enter the values like: sample_fitted <- predicted_fit(alpha = 1, beta1 = 0.434 and so on.

To really cut down on the code repetition you could then stick this in a loop, but don't worry about that now.

Confidence intervals

I'm a little confused about what is going on here. It seems like you are manually finding the predicted probabilities and then also using prediction to find them? Why do both?

For the confidence intervals, I'm confused about your use of plogis.

Wouldn't you instead want find the fitted values, then use these to come up with the confidence intervals? You could combine these steps with the above function, something like:

# Note assumes a model with 3 explanatory variables
predicted_fit <- function(alpha, beta1, beta2, beta3, x1, x2, x3, se){
    fit <- alpha + beta1 * x1 + beta2 * x2 + beta3 * x3
    lb <- fit - (1.96 * se)
    ub <- fit + (1.96 * se) 
    predictions <- c(lb, fit, ub)

    probs <- function(x) {(exp(x)) / (1 + exp(x) }
    predictions <- probs(predictions)
    return(predictions)
}

You could then rbind multiple results of this into a data frame and plot.

Note that I haven't actually run this code, but it probably is a good start.

MarieAgosta commented 7 years ago

Thanks professor and sorry, the predict argument is in the old code, we just haven't removed it yet. I shouldn't have included it.

christophergandrud commented 7 years ago

No worries. Let me know how it goes.

MarieAgosta commented 7 years ago

Thank you professor. We tried out the code, we adapted the code from what you wrote but we're struggling with the standard errors. There are five coefficients in our model so there will be five standard errors. How would we account for all of the standard errors?

Attempt using suggestions from prof

predicted_fit <- function(alpha, beta1, beta2, beta3, beta4, beta5, x1, x2, x3, x4, x5, se1, se2, se3, se4, se5) { fit <- alpha + beta1 x1 + beta2 x2 + beta3 x3 + beta4 x4 + beta5 x5 lb <- fit - (1.96 se) ub <- fit + (1.96 * se) predictions <- c(lb, fit, ub)

probs <- function(x) {(exp(x)) / (1 + exp(x)) }
predictions <- probs(predictions)
return(predictions) 

}

christophergandrud commented 7 years ago

Good point, sorry I neglected that earlier.

predicted_fit <- function(alpha, beta1, beta2, beta3, beta4, beta5, 
    x1, x2, x3, x4, x5, 
    se1, se2, se3, se4, se5)
{
lb_fun <- function(coef, se){
    lb <- coef - (1.96 * se)
    return(lb)
}

ub_fun <- function(coef, se){
    ub <- coef + (1.96 * se)
    return(ub)
}

fit <- alpha + beta1 * x1 + beta2 * x2 + beta3 * x3 + beta4 x4 + beta5 x5
lb <- alpha + lb_fun(beta1, se1) * x1  + lb_fun(beta2, se2) * x2  + lb_fun(beta3, se3) * x3
ub <- alpha + ub_fun(beta1, se1) * x1  + ub_fun(beta2, se2) * x2  + ub_fun(beta3, se3) * x3
predictions <- c(lb, fit, ub)

probs <- function(x) {(exp(x)) / (1 + exp(x)) }
predictions <- probs(predictions)
return(predictions) 
}

This code is really nasty and really the best way to do this would be to hack into the fitted model object and extract these programatically, but I don't have much more time today. Give this a shot and see what happens.

MarieAgosta commented 7 years ago

Thank you!!

kardishcj commented 7 years ago

Actually one quick follow-up question. For the upper and lower bounds, should the code stop at the 3rd beta or go all the way to the 5th (or however many coefficients there are in the model)? It does appear to work, though, yes. fit <- alpha + beta1 x1 + beta2 x2 + beta3 x3 + beta4 x4 + beta5 x5 lb <- alpha + lb_fun(beta1, se1) x1 + lb_fun(beta2, se2) x2 + lb_fun(beta3, se3) x3 ub <- alpha + ub_fun(beta1, se1) x1 + ub_fun(beta2, se2) x2 + ub_fun(beta3, se3) * x3 predictions <- c(lb, fit, ub)

christophergandrud commented 7 years ago

Sorry, right it should include all of the coefficient point estimates beta I just did 3 for expediency.

kardishcj commented 7 years ago

Right, I figured. Is it possible that there's an error in the confidence-interval calculation? It's giving very wide confidence bands, which of course look odd when plotting. Or is that simply a function of having very high standard errors and imprecise estimates? Looking at the model output, they certainly don't seem that high at least.

christophergandrud commented 7 years ago

Can you given an example, including both the coefficient estimates/standard errors and predicted probabilities?

Remember that it is difficult to mentally relate these two as they are are different scales--hence the need to do predicted probabilities.

Also, stepping back, in class I did cover a simulation approach to estimating uncertainty that is especially useful when it is numerically tricky to mathematically come up with the uncertainty bounds. Do take a look.

kardishcj commented 7 years ago

I can take a look at the simulation approach. Below is a text-version stargazer table of the model across the four generations, followed by predicted probabilities (w/ upper and lower bounds) for a black, female Millennial of moderate income. I apologize for how the table looks but it doesn't really copy-paste well. Parentheses are standard errors.

                  Dependent variable:         
         -------------------------------------
             Probability of voting in 2012    
          Gen Y    Gen X    Boomers   Silent  
           (1)            (2)         (3)            (4)   

education 1.317 1.133 1.377 1.404 (0.134) (0.118) (0.120) (0.192)

female 1.112 1.360 1.206*** 0.484
(0.235) (0.247) (0.253) (0.481)

vote_2008 2.956 6.727 14.474 23.935 (0.248) (0.271) (0.254) (0.625)

black 2.640 1.567 0.482 6.277*** (0.393) (0.426) (0.357) (0.869)

hispanic 1.878* 0.680 0.883** 0.262
(0.352) (0.298) (0.390) (0.832)

income 1.217 1.161 1.196 1.665 (0.079) (0.087) (0.069) (0.215)

Constant 0.240 0.378 0.224 0.144
(0.413) (0.495) (0.444) (0.812)

Predicted probabilities for black, female Millennial at different levels of education and moderate income

Below high school lower: 0.1458708 predicted: 0.4814261 upper: 0.4199645

High school lower: 0.1476030 predicted: 0.5501755 upper: 0.5534615

Some post-high lower: 0.1493521 predicted: 0.6170605 upper: 0.6796692

Bachelor lower: 0.1511182 predicted: 0.6797885 upper: 0.7841197

Graduate lower: 0.1529015 predicted: 0.7366269 upper: 0.8614545

christophergandrud commented 7 years ago

Those don't look wildly off. We might be missing something, but I would check by comparing them to the simulated intervals.

kardishcj commented 7 years ago

Strange, but my Zelig simulations look much different than my predicted probabilities (shown below as plotted on ggplot).

This is the ggplot of the predictions for a black female of average income at various levels of education (top line is Millennial, bottom is Baby Boomer, still need to work on the aesthetics, I know)

millennialvboomer

Then below is a Millennial followed by Baby Boomer simulation using the same characteristics as the ggplot above.

Millennial millennial

Baby Boomer babyboomer

These Zelig simulations look quite different from the ggplot above. And there's not much of a distinction between the two simulations, which is another stark contrast with the ggplot.

christophergandrud commented 7 years ago

What are the solid lines in the ggplot one?

christophergandrud commented 7 years ago

btw, I would go with the Zelig predictions (we may have messed up the formula somewhere along the way) and they are much more straightforward to interpret.

kardishcj commented 7 years ago

Sorry, I was having trouble labeling those lines but wanted to submit my question. The top line in the ggplot is the Millennial at various levels of education, and the bottom is Baby Boomer. The big blob behind them are the confidence intervals. Both have the same qualities (black female of moderate income who didn't vote in 2008).

christophergandrud commented 7 years ago

Is there largely no difference in the coefficients for Millennial and Baby Boomer?

kardishcj commented 7 years ago

Yes, largely, with a few notable exceptions (strength of affect of voting in 2008, being black, or being hispanic). But effects of education, income, and being a woman are quite similar.

christophergandrud commented 7 years ago

Ah, I wonder then why the predicted probability lines are so different (though the shaded intervals overlap) in the numeric ggplot version

kardishcj commented 7 years ago

Yeah, I hadn't thought about it that way. So it seems like you're saying Zelig is more reliable, so perhaps we just run with that. It looks like the code for the predicted probabilities is fine, though. I mean, it should be simply [exp(coef)/(1+exp(coef))], which it is, so I'm not sure where there's room for error.

christophergandrud commented 7 years ago

I would just go with Zelig. We could definitely find the bug, but this would only get us what we already have from Zelig.

Hopefully, you learned some useful things about programming with functions though.

kardishcj commented 7 years ago

So we really wanted to be able to have lines of predicted probability for two generations on one plot for the purpose of visual comparison, which makes ggplot attractive. It doesn't appear that we can do that from Zelig, which led me to wonder whether we could use Zelig objects in ggplot or extract the info from the objects for use in ggplot, which led me to here (something you wrote before, something you probably already knew):

https://gist.github.com/christophergandrud/2503364

Except this part below doesn't work for me: Extract expected values from simulations Model.demAge.e <- Model.DemSim$qi Model.demAge.e <- data.frame(Model.demAge.e$ev) Model.demAge.e <- melt(Model.demAge.e, measure = 1:86)

I try it like this and get the following error message: Z_genY <- zelig(vote_2012 ~ education + female + vote_2008 + black + hispanic + income, model = "logit.survey", weights=~weights, data = anes_genY) setZ_genY <- setx(Z_genY, education = 1:5, black = 1, hispanic = 0, income = 3, female = 1, vote_2008 = 0) simZ_genY <- sim(Z_genY, x = setZ_genY) values <- simZ_genY$qi values <- as.data.frame(values$ev)
Error in values$ev : object of type 'closure' is not subsettable

Any insight there or recommendations about whether trying to do what we'd like to do is ill-advised for whatever reason. It just seems to make sense to make better visual comparisons, because that's kind of the crux of what we're doing -- making comparisons between generations. The regular ci.plot stuff works, of course, but it doesn't allow us to make these comparisons using a single plot.

kardishcj commented 7 years ago

Sorry, last question (I hope): I was taught that really the best goodness-of-fit measure for a logit model is % correctly predicted, but seeing as we're avoiding the predict function because of our issues with that function when we use our survey-weighted model, how would you advise doing that?

Would the test at the end of this UCLA page (comparing the model vs a "null model" of just an intercept) suffice? http://www.ats.ucla.edu/stat/r/dae/logit.htm

christophergandrud commented 7 years ago

I believe Zelig has a function--rocplot--for creating ROC plots. You could try this out.

christophergandrud commented 7 years ago

Re the earlier question about putting two sets of Zelig results on the same plot:

kardishcj commented 7 years ago

You mean make sure the scale is the same but maintain two separate plots and simply present those two plots side-by-side? Do you have any recommendations on the goodness-of-fit question I presented after the plot question?

christophergandrud commented 7 years ago
kardishcj commented 7 years ago

Ah, didn't realize the ROC plots were used for that purpose. I've never used them before. Thanks!

christophergandrud commented 7 years ago

No problem

kardishcj commented 7 years ago

So we wrapped up our presentation but I also created an ROC plot to talk goodness of fit but am not totally comfortable with interpretation of it. My understanding is that you want your lines to be as close to the top-right corner as possible, (far beyond the diagonal), which would indicate that our models are not so so great in this department. Is that generally correct, and is there anything you'd add?

rocplot(Z_genY, Z_boomer, col1 = 'blue', col2 = 'red') legend("left", c("Millennial","Baby Boomer"), lty=c(1,1), lwd=c(2.5,2.5), col=c('blue','red'))

rocplot

christophergandrud commented 7 years ago

For the presentation that is mostly fine (and pretty much all I would imagine you will have time to do).

Maybe also mention something about how the graph shows what proportion of "Votes" and "Not Votes" are correctly predicted using different probability thresholds. The straight dashed line represents what we would expect in an uninformative model. Or something like that, but for the presentation keep it brief, you only have 10 minutes.