Open MarieAgosta opened 7 years ago
Hi @MarieAgosta
Some small notes to make it easier for others to help:
Provide a link to the data so that the issue is reproducible.
More clearly specify what you mean by it "doesn't work".
It's good practice not to overload the question. Put only the minimal amount of code and data needed to reproduce the problem.
Re predicted probabilities: you are repeating a lot of code. Whenever you find yourself repeating a lot of code, it's time to create a function, e.g. something like:
# Note assumes a model with 3 explanatory variables
predicted_fit <- function(alpa, beta1, beta2, beta3, x1, x2, x3){
fit <- alpha + beta1 * x1 + beta2 * x2 + beta3 * x3
pred_prob <- (exp(fit)) / (1 + exp(fit))
return(pred_prob)
}
You then enter the values like: sample_fitted <- predicted_fit(alpha = 1, beta1 = 0.434
and so on.
To really cut down on the code repetition you could then stick this in a loop, but don't worry about that now.
Confidence intervals
I'm a little confused about what is going on here. It seems like you are manually finding the predicted probabilities and then also using prediction
to find them? Why do both?
For the confidence intervals, I'm confused about your use of plogis
.
Wouldn't you instead want find the fitted values, then use these to come up with the confidence intervals? You could combine these steps with the above function, something like:
# Note assumes a model with 3 explanatory variables
predicted_fit <- function(alpha, beta1, beta2, beta3, x1, x2, x3, se){
fit <- alpha + beta1 * x1 + beta2 * x2 + beta3 * x3
lb <- fit - (1.96 * se)
ub <- fit + (1.96 * se)
predictions <- c(lb, fit, ub)
probs <- function(x) {(exp(x)) / (1 + exp(x) }
predictions <- probs(predictions)
return(predictions)
}
You could then rbind
multiple results of this into a data frame and plot.
Note that I haven't actually run this code, but it probably is a good start.
Thanks professor and sorry, the predict argument is in the old code, we just haven't removed it yet. I shouldn't have included it.
No worries. Let me know how it goes.
Thank you professor. We tried out the code, we adapted the code from what you wrote but we're struggling with the standard errors. There are five coefficients in our model so there will be five standard errors. How would we account for all of the standard errors?
predicted_fit <- function(alpha, beta1, beta2, beta3, beta4, beta5, x1, x2, x3, x4, x5, se1, se2, se3, se4, se5) { fit <- alpha + beta1 x1 + beta2 x2 + beta3 x3 + beta4 x4 + beta5 x5 lb <- fit - (1.96 se) ub <- fit + (1.96 * se) predictions <- c(lb, fit, ub)
probs <- function(x) {(exp(x)) / (1 + exp(x)) }
predictions <- probs(predictions)
return(predictions)
}
Good point, sorry I neglected that earlier.
predicted_fit <- function(alpha, beta1, beta2, beta3, beta4, beta5,
x1, x2, x3, x4, x5,
se1, se2, se3, se4, se5)
{
lb_fun <- function(coef, se){
lb <- coef - (1.96 * se)
return(lb)
}
ub_fun <- function(coef, se){
ub <- coef + (1.96 * se)
return(ub)
}
fit <- alpha + beta1 * x1 + beta2 * x2 + beta3 * x3 + beta4 x4 + beta5 x5
lb <- alpha + lb_fun(beta1, se1) * x1 + lb_fun(beta2, se2) * x2 + lb_fun(beta3, se3) * x3
ub <- alpha + ub_fun(beta1, se1) * x1 + ub_fun(beta2, se2) * x2 + ub_fun(beta3, se3) * x3
predictions <- c(lb, fit, ub)
probs <- function(x) {(exp(x)) / (1 + exp(x)) }
predictions <- probs(predictions)
return(predictions)
}
This code is really nasty and really the best way to do this would be to hack into the fitted model object and extract these programatically, but I don't have much more time today. Give this a shot and see what happens.
Thank you!!
Actually one quick follow-up question. For the upper and lower bounds, should the code stop at the 3rd beta or go all the way to the 5th (or however many coefficients there are in the model)? It does appear to work, though, yes. fit <- alpha + beta1 x1 + beta2 x2 + beta3 x3 + beta4 x4 + beta5 x5 lb <- alpha + lb_fun(beta1, se1) x1 + lb_fun(beta2, se2) x2 + lb_fun(beta3, se3) x3 ub <- alpha + ub_fun(beta1, se1) x1 + ub_fun(beta2, se2) x2 + ub_fun(beta3, se3) * x3 predictions <- c(lb, fit, ub)
Sorry, right it should include all of the coefficient point estimates beta
I just did 3 for expediency.
Right, I figured. Is it possible that there's an error in the confidence-interval calculation? It's giving very wide confidence bands, which of course look odd when plotting. Or is that simply a function of having very high standard errors and imprecise estimates? Looking at the model output, they certainly don't seem that high at least.
Can you given an example, including both the coefficient estimates/standard errors and predicted probabilities?
Remember that it is difficult to mentally relate these two as they are are different scales--hence the need to do predicted probabilities.
Also, stepping back, in class I did cover a simulation approach to estimating uncertainty that is especially useful when it is numerically tricky to mathematically come up with the uncertainty bounds. Do take a look.
I can take a look at the simulation approach. Below is a text-version stargazer table of the model across the four generations, followed by predicted probabilities (w/ upper and lower bounds) for a black, female Millennial of moderate income. I apologize for how the table looks but it doesn't really copy-paste well. Parentheses are standard errors.
Dependent variable:
-------------------------------------
Probability of voting in 2012
Gen Y Gen X Boomers Silent
(1) (2) (3) (4)
education 1.317 1.133 1.377 1.404 (0.134) (0.118) (0.120) (0.192)
female 1.112 1.360 1.206*** 0.484
(0.235) (0.247) (0.253) (0.481)
vote_2008 2.956 6.727 14.474 23.935 (0.248) (0.271) (0.254) (0.625)
black 2.640 1.567 0.482 6.277*** (0.393) (0.426) (0.357) (0.869)
hispanic 1.878* 0.680 0.883** 0.262
(0.352) (0.298) (0.390) (0.832)
income 1.217 1.161 1.196 1.665 (0.079) (0.087) (0.069) (0.215)
Constant 0.240 0.378 0.224 0.144
(0.413) (0.495) (0.444) (0.812)
Predicted probabilities for black, female Millennial at different levels of education and moderate income
Below high school lower: 0.1458708 predicted: 0.4814261 upper: 0.4199645
High school lower: 0.1476030 predicted: 0.5501755 upper: 0.5534615
Some post-high lower: 0.1493521 predicted: 0.6170605 upper: 0.6796692
Bachelor lower: 0.1511182 predicted: 0.6797885 upper: 0.7841197
Graduate lower: 0.1529015 predicted: 0.7366269 upper: 0.8614545
Those don't look wildly off. We might be missing something, but I would check by comparing them to the simulated intervals.
Strange, but my Zelig simulations look much different than my predicted probabilities (shown below as plotted on ggplot).
This is the ggplot of the predictions for a black female of average income at various levels of education (top line is Millennial, bottom is Baby Boomer, still need to work on the aesthetics, I know)
Then below is a Millennial followed by Baby Boomer simulation using the same characteristics as the ggplot above.
Millennial
Baby Boomer
These Zelig simulations look quite different from the ggplot above. And there's not much of a distinction between the two simulations, which is another stark contrast with the ggplot.
What are the solid lines in the ggplot one?
btw, I would go with the Zelig predictions (we may have messed up the formula somewhere along the way) and they are much more straightforward to interpret.
Sorry, I was having trouble labeling those lines but wanted to submit my question. The top line in the ggplot is the Millennial at various levels of education, and the bottom is Baby Boomer. The big blob behind them are the confidence intervals. Both have the same qualities (black female of moderate income who didn't vote in 2008).
Is there largely no difference in the coefficients for Millennial and Baby Boomer?
Yes, largely, with a few notable exceptions (strength of affect of voting in 2008, being black, or being hispanic). But effects of education, income, and being a woman are quite similar.
Ah, I wonder then why the predicted probability lines are so different (though the shaded intervals overlap) in the numeric ggplot version
Yeah, I hadn't thought about it that way. So it seems like you're saying Zelig is more reliable, so perhaps we just run with that. It looks like the code for the predicted probabilities is fine, though. I mean, it should be simply [exp(coef)/(1+exp(coef))], which it is, so I'm not sure where there's room for error.
I would just go with Zelig. We could definitely find the bug, but this would only get us what we already have from Zelig.
Hopefully, you learned some useful things about programming with functions though.
So we really wanted to be able to have lines of predicted probability for two generations on one plot for the purpose of visual comparison, which makes ggplot attractive. It doesn't appear that we can do that from Zelig, which led me to wonder whether we could use Zelig objects in ggplot or extract the info from the objects for use in ggplot, which led me to here (something you wrote before, something you probably already knew):
https://gist.github.com/christophergandrud/2503364
Except this part below doesn't work for me: Extract expected values from simulations Model.demAge.e <- Model.DemSim$qi Model.demAge.e <- data.frame(Model.demAge.e$ev) Model.demAge.e <- melt(Model.demAge.e, measure = 1:86)
I try it like this and get the following error message:
Z_genY <- zelig(vote_2012 ~ education + female + vote_2008 + black + hispanic + income, model = "logit.survey", weights=~weights, data = anes_genY)
setZ_genY <- setx(Z_genY, education = 1:5, black = 1, hispanic = 0, income = 3, female = 1,
vote_2008 = 0)
simZ_genY <- sim(Z_genY, x = setZ_genY)
values <- simZ_genY$qi
values <- as.data.frame(values$ev)
Error in values$ev : object of type 'closure' is not subsettable
Any insight there or recommendations about whether trying to do what we'd like to do is ill-advised for whatever reason. It just seems to make sense to make better visual comparisons, because that's kind of the crux of what we're doing -- making comparisons between generations. The regular ci.plot stuff works, of course, but it doesn't allow us to make these comparisons using a single plot.
Sorry, last question (I hope): I was taught that really the best goodness-of-fit measure for a logit model is % correctly predicted, but seeing as we're avoiding the predict function because of our issues with that function when we use our survey-weighted model, how would you advise doing that?
Would the test at the end of this UCLA page (comparing the model vs a "null model" of just an intercept) suffice? http://www.ats.ucla.edu/stat/r/dae/logit.htm
I believe Zelig has a function--rocplot
--for creating ROC plots. You could try this out.
Re the earlier question about putting two sets of Zelig results on the same plot:
The instructions in that gist are for a previous version of Zelig. They changed the data structure and it broke that code.
I wouldn't actually do this. It looks like from the results you presented that the different sets of simulations will overlap considerably and so be hard to distinguish. What I would do instead, is put them on the same y-axis scale and plot them side-by-side. That way direct comparisons are possible.
You mean make sure the scale is the same but maintain two separate plots and simply present those two plots side-by-side? Do you have any recommendations on the goodness-of-fit question I presented after the plot question?
Yes
See my earlier comment about ROC plots.
Ah, didn't realize the ROC plots were used for that purpose. I've never used them before. Thanks!
No problem
So we wrapped up our presentation but I also created an ROC plot to talk goodness of fit but am not totally comfortable with interpretation of it. My understanding is that you want your lines to be as close to the top-right corner as possible, (far beyond the diagonal), which would indicate that our models are not so so great in this department. Is that generally correct, and is there anything you'd add?
rocplot(Z_genY, Z_boomer, col1 = 'blue', col2 = 'red') legend("left", c("Millennial","Baby Boomer"), lty=c(1,1), lwd=c(2.5,2.5), col=c('blue','red'))
For the presentation that is mostly fine (and pretty much all I would imagine you will have time to do).
Maybe also mention something about how the graph shows what proportion of "Votes" and "Not Votes" are correctly predicted using different probability thresholds. The straight dashed line represents what we would expect in an uninformative model. Or something like that, but for the presentation keep it brief, you only have 10 minutes.
Hi all,
So we manually did the predicted probabilities and now have to estimate the confidence intervals for our predicted probabilities… but the predict argument does not work for us.
We’ve searched around for ways to go about it and have yet to find anything that works. Do you have any suggestions? Below is the code, the bolded bit at the end is the new trouble spot.
Many thanks,
Marie