MUCollective / multiverse

R package for creating explorable multiverse analysis
https://mucollective.github.io/multiverse/
GNU General Public License v3.0
62 stars 5 forks source link

Confidence intervals from each universe #33

Closed abhsarma closed 5 years ago

abhsarma commented 5 years ago

Related to #9 Allow the user to extract point estimates and confidence intervals for a particular variable from each universe

mjskay commented 5 years ago

This might not even need your own API, so much as showing how to do it with mutate and map

E.g. if the universe has variables lower, upper, etc you could do something like:

M %>%
  multiverse_table() %>%
  mutate(
    lower = map_dbl(.results, "lower"),
    upper = map_dbl(.results, "upper"),
    estimate = map_dbl(.results, "estimate")
  )

Could always add a function to pull those out in parallel, but in the short term that should work for examples right?

abhsarma commented 5 years ago

That works as a temporary fix.

If I have a fit object as such M$fit <- ~ lm( y ~ x1 + x2, data = df), I'd probably use broom::tidy: M$model_summary <- ~ M$fit %>% broom::tidy( conf.int = TRUE )

Then, we can then extract the results as:

multiverse_table(M) %>%
    mutate( index = seq(1:nrow(.)) ) %>%
    filter (index <= 6) %>%
    mutate( summary = map(.results, "model_summary") ) %>% 
    unnest(summary)  %>%
    filter( term != "(Intercept)") %>%
    ggplot() +
    geom_point( aes(x = estimate, y = term)) +
    geom_errorbarh( aes(xmin = conf.low, xmax = conf.high, y = term), height = 0) + 
    facet_wrap( ~ index )

This seems very complicated tho

mjskay commented 5 years ago

If within each multiverse you can construct a tidy dataframe with one row corresponding to the estimates you want, you can just unnest + map it. E.g.:

M$tidy_df = ~ tibble(lower = 0, upper = 2, estimate = 1)

Then later:

M %>%
  multiverse_table() %>%
  unnest(map(.results, "tidy_df"))

?

mjskay commented 5 years ago

In general, I think that we should be able to come up with examples of good strategies to do this kind of thing with tidyverse stuff, and then only if those examples are cumbersome (or the same pattern keeps cropping up) should we think about creating functions for them. Part of the goal is to capitalize on the fact that this stuff should be mostly easy to do with base tidyverse stuff.

ntaback commented 5 years ago

I agree. Tidyverse code is intentionally verbose, so your example looks good to me!

abhsarma commented 5 years ago
M %>%
  multiverse_table() %>%
  unnest(map(.results, "tidy_df"))

Yeah, this approach would work here. I guess I can provide a wrapper function for this to make it easier to access.

mjskay commented 5 years ago

I mean, it's pretty straightforward, does it need a wrapper?

Like I said, I would concentrate on making the examples with the existing functions until you encounter cases that are painful enough or common enough to warrant simplification. There's always a cost in adding more words functions) to a vocabulary (API) in that it adds more things for people to learn and more unfamiliar terms for readers who haven't seen them before. That cost is often not outweighed for things that can already be done simply and is more worth it for more complex things.

ntaback commented 5 years ago

This will produce a plot of regression parameter estimates with their 95% CIs. Could you build on this example, to use multidy to produce plots of 90%, 95%, and 99% confidence intervals?

`set.seed(22) x1 <- rnorm(100) x2 <- rnorm(100) y <- x1 + x2 + runif(100)

M <- lm(y ~ x1+ x2)

add_column(as_tibble(confint(M)), M$coefficients, type = c("Intercept", "x1", "x2")) %>% rename("low" = 2.5 %, "high" = 97.5 %, "est" = M$coefficients) %>% ggplot(aes(x = low, y = high)) + geom_point(aes(x = type, y = est)) + geom_errorbar(aes(ymin = low, ymax = high, x = type)) + labs(x = "Parameter", y = "Parameter Values", title = "Estimates with 95% confidence intervals") `

abhsarma commented 5 years ago

@mjskay I guess I have two concerns which is why I thought of pushing for a function:

  1. do we need an interface to extract a variable from all the universes in the multiverse in a tidy format? we do not have that right now and this could solve that
  2. how intuitive is it to arrive at this solution? In this case, I'd say it seems fairly complex because it is extracting variables / data-frames from a environment nested within a data-frame
abhsarma commented 5 years ago

@ntaback Sure, I'll draft that up quickly today!

mjskay commented 5 years ago

I got excited and drafted something (mostly to see if I could use the API) ---

M = multiverse()

inside(M, {
  # using tidyeval to do this with `!!!` though I don't think we should allow it
  # see issue #36
  set.seed(branch(seed, !!!1:100))

  x1 <- rnorm(100)
  x2 <- rnorm(100)
  y <- x1 + x2 + runif(100)

  m <- lm(y ~ x1+ x2)

  intervals <- broom::tidy(m, conf.int = TRUE)
})

execute_multiverse(M)

Each universe now contains (amongst everything else) a three-row table called intervals, each with columns term, estimate, conf.low, and conf.high in it. So if we unnest those tables from the universes we get a long-format table where the seed parameter indexes the universe and we can plot them all:

M %>%
  multiverse_table() %>%
  unnest(map(.results, "intervals"), .drop = FALSE) %>%
  unnest(seed) %>%  # won't be necessary once issue #34 is fixed
  ggplot(aes(x = seed, y = estimate, ymin = conf.low, ymax = conf.high)) +
  geom_pointrange() +
  facet_grid(. ~ term) +
  labs(x = "Universe (i.e. the seed parameter)", y = "Model Parameter Values", title = "Estimates with 95% confidence intervals")

image

Some issues cropped up:

  1. I used a sort of hack with tidy-eval and set.seed to get different universes. I'm not sure this approach is the right one --- I think we shouldn't be tidy-eval-ing the universe code by default since I'm pretty sure it will break other people's tidy-eval code if we do it (see #36). So perhaps we need a syntax for letting people pass options as a list? That would make the tidy-eval usage here unnecessary.

  2. This definitely convinces me that parameter values in the multiverse table should not be list columns, because I had to add an extra unnest call to get this to work that I don't think should need to be there (see #34).

  3. I happened to do this in a way where each universe gets a unique number determined by the seed column, but this vis would have been harder to construct if that weren't the case (and with multiverses with more than one parameter, that generally would not be the case). This suggests it might be useful for us to add something like a ".universe" column to the multiverse table that always has a unique identifier for each universe in it.

mjskay commented 5 years ago
  1. do we need an interface to extract a variable from all the universes in the multiverse in a tidy format? we do not have that right now and this could solve that

I think we only need it if it is not easy to do with existing functions. Currently I have not encountered examples where tidyverse stuff doesn't work.

  1. how intuitive is it to arrive at this solution? In this case, I'd say it seems fairly complex because it is extracting variables / data-frames from a environment nested within a data-frame

I agree that arriving at the solution may not be immediately obvious, but I think the pattern is not too complicated once you learn it. So I might argue it is a pattern that needs to be well-documented, but that learning the pattern is about the same barrier as learning a function would be and probably about as easy to remember. But adding a function has additional costs of maintenance and unfamiliarity to people who are reading the code and are familiar with the tidyverse but not our package.

In any case I would keep building examples without making a custom function and then see if there are commonalities across the examples that could be combined into a simple, reusable function. This is how I've tended to build things like the tidyverse API, and in my experience its easier to find those commonalities by building the examples first and then deriving the function, and not by trying to imagine the function first. Inevitably you get the function wrong and then have to rewrite it anyway.