ProjectMOSAIC / mosaic

Project MOSAIC R package
http://mosaic-web.org/
93 stars 26 forks source link

add plotModel() support for logistic regression from summarized data? #584

Open rpruim opened 8 years ago

rpruim commented 8 years ago

An example like this is not currently supported by plotModel():

GoosePermits <-
  data.frame(
    bid = c(1,5,10,20,30,40,50,75,100,150,200),
    keep = c(31,29,27,25,23,21,19,17,15,15,15),
    sell = c(0,3,6,7,9,13,17,12,11,14,13)
  )

glm( cbind(keep, sell) ~ bid, data = GoosePermits, family = binomial())

There is also some question about what sort of plot one should produce, since there will often be lots of overplottng.

Finally, I'm likely going to add this data set either to fastR2 or to mosaicData. Anyone care where it goes?

dtkaplan commented 8 years ago

Randy,

I don’t have an opinion about where to put GoosePermits, but I am interested in the model and in plotModel(). For the DataCamp course I’m writing, I’ve prototyped a function fmodel() that is something like plotModel() but with a formula interface for specifying the variables to display and some logic for picking discrete values for variables used for color and for faceting.

It works with glm(), but not when the LHS of the formula isn’t a variable name, so something to fix ….

Still, in looking at what I would need to do to get fmodel() to work, I tried out the model in your letter:

[image: Inline image 1] That’s way too stiff. I thought, since money is involved (bid?) that taking logs is appropriate, so that it’s the proportionate change in bid that has the effect. This model is better, I think …

mod2 <- glm(sold ~ poly(log(bid),3), data = GPP, family = "binomial")

Where sold is a 0/1 variable replicated the number of times indicated in your keep/sell variables.

[image: Inline image 2] Best, Danny ​

On Sat, Apr 9, 2016 at 11:05 AM, Randall Pruim notifications@github.com wrote:

An example like this is not currently supported by plotModel():

GoosePermits <- data.frame( bid = c(1,5,10,20,30,40,50,75,100,150,200), keep = c(31,29,27,25,23,21,19,17,15,15,15), sell = c(0,3,6,7,9,13,17,12,11,14,13) )

glm( cbind(keep, sell) ~ bid, data = GoosePermits, family = binomial())

There is also some question about what sort of plot one should produce, since there will often be lots of overplottng.

Finally, I'm likely going to add this data set either to fastR2 or to mosaicData. Anyone care where it goes?

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/ProjectMOSAIC/mosaic/issues/584

...

DeWitt Wallace Professor of Mathematics, Statistics, and Computer Science Macalester College

rpruim commented 8 years ago

I see I was typing too fast. The model of interest is

glm( cbind(keep, sell) ~ log(bid), data = GoosePermits, family = binomial())

or something like that. But the issue here is independent of the particular form of the model. The issue is how best to deal with data presented as success and failure counts rather than as a single row for each observation.

Before you get too far along with fmodel(), you should take a closer look at plotModel() since (a) I'm guessing it does more than you know and (b) there is relatively little value in developing two of these unless there are use cases where one actually wants different behavior. In any case, better to have one good function then two mediocre ones.

Your images didn't show up in the message above, so I'm not sure what they were intended to contain.

Finally, here is a plot I created in ggplot2 as I was musing about best ways to handle this situation:

image

The labeling could be improved, this was just a quick attempt to see if I liked it. (In the past I have typically used jittered dots to deal with overplotting in these plots, and that is another option -- although a little less convenient when working from summarized data.)

rpruim commented 8 years ago

Another option would be to display the percentages at each level. That makes good sense when the data are in this form since (a) there are typically multiple observations at each level of the predictor -- perhaps enough to justify computing a percent, and (b) it is easier to just fit that way than from the relative sizes of the dots compared to the curve.

Of course, it gets noisy or even silly if the predictor values are unique, or there are only a few observations at each predictor value.

rpruim commented 8 years ago

Here is an example of the sort of plot I was just describing -- using size to indicate the volume of data involved in the proportion calculation.

image

rpruim commented 8 years ago

image

Better choice of scale for size. (I always forget that the ggplot2 defaults for size are often bad.)