visualisation_recipe design

DominiqueMakowski commented 3 years ago

Follow up of https://github.com/easystats/see/issues/38 (note that the scope of that proposition is broader than just the plot for estimate_predicted).

To summarize, plotting is one of the fundamental skill of data science, often overlooked in favour of results values (them sweetsweet p-values 🤤). The grammar of graphics offers an incredible amount of freedom of expression, creativity, and composability and for me is the best approach to plotting, of the ones that I tried.

That said, understanding how to think in terms of GG is a hard process, as (especially in psychology) we are usually taught to think in terms of types of plots and one-to-one correspondence with some statistical models (e.g., correlation - scatterplot, anovas - bar chart 🤮, t-test - boxplot etc.) rather than in terms of layers of geometries coupled with data.

On top of that inherent steep path to having the GG-mindset, its main implementation in R (ggplot2) adds its own layer (no pun intended) of complexity with a lot of new different argument names and idiosyncratic logic.

This double complexity makes it hard for students and beginners, and it could explain the creation (and popularity) of pre-baked plots creators, such as see or ggstatsplot. Don't get me wrong, these pre-baked plots are absolutely necessary, not just to help beginners to get stuff done but also for advanced users: I don't see myself re-writing the ggplot code for check_model plots at each model 😁

But the limitation that I see from that landscape is that there are two opposed paths to plots creation: either you build everything from the ground-up using geoms (which really makes you think about how and what to put into), or you get a plot done for you (which really doesn't make you think and doesn't necessarily show you how and what to put into).

Having been immersed within a Buddhist environment, it is not surprising that I'm longing for a 3rd noble path that would achieve some form balance 😅, for instance by exposing and getting access to the elements directly underlying a given plot.

One of the early attempts at something like that was the data_plot() and how_to_plot()methods in see. The goal of the former was to transform the input into the data that would be plotted, and the second was to expose the code used to make the plot, so that users can copy and edit and this way learn how we make our plots. But this didn't really go anywhere, in part because: 1) It was in practice difficult to have a robust and working how_to_plot() function that would print its internal code. 2) there is often no unique source of data for complex plots. For instance, for a plot for estimate_predicted(), the plot has a line and a ribbon, and eventually some points on top of that, not necessarily coming from the same data. So that was problematic for data_plot()

A plot, within the GG framework, is made of layers. Each layers has information pertaining to the form (the geom), aesthetics, data etc. So I thought, why not having that as data, instead of as a plot.

In other words, why not have an ensemble of lists (layers) that would each contain the relevant stuff. That would make it easy for people to get a sense of how a plot is made (and how to make it, customize it etc), and at the same time it would make it easier for a third-party such as see to simply "render" this information (e.g., loop through the layers and programmatically create them and add them).

Here's a first example of visualisation_recipe():

library(modelbased)
library(ggplot2)

model <- lm(mpg ~ wt, data = mtcars)
vizdata <- estimate_expectation(model)

# Intermediate steps with information and data for plotting
layers <- visualisation_recipe(vizdata)
layers
#> Layer 1
#> --------
#> Type: geom_line
#> data, y, x
#> 
#> Layer 2
#> --------
#> Type: geom_ribbon
#> data, y, x, ymin, ymax, alpha
#> 
#> Layer 3
#> --------
#> Type: labs
#> x, y, title

# Plotting (which could be implementing programatically)
ggplot() +
  geom_line(data = layers$l1$data,
            aes_string(x = layers$l1$x, 
                       y = layers$l1$y)) +
  geom_ribbon(data = layers$l2$data, 
              alpha = layers$l2$alpha,
              aes_string(x = layers$l2$x, 
                         y = layers$l2$y,
                         ymin = layers$l2$ymin, 
                         ymax = layers$l2$ymax)) +
  labs(x = layers$l3$x,
       y = layers$l3$y, 
       title = layers$l3$title)

^{Created on 2021-05-30 by the reprex package (v1.0.0)}

bwiernik commented 3 years ago

I guess I don't really see this as meaningfully more accessible than writing the ggplot code originally. If anything, I obscuring the variable names probably makes writing the final plotting code more difficult for new coders.

(As an aside, I think most of the problems new ggplot2 users have stems from approaching it as a copy-paste recipe task versus taking the time to understand the layer-building logic underlying the syntax. I've found my students have found ggplot fairly intutive once I give a 15 minute walkthrough of the principles underlying it.)

I guess my preference here would be to follow the general idea of see in that just passing the result of a modelbased function to plot() should generate an effective visualization. We can provide some good defaults and arguments to map variables to various features, but beyond that, I think we can let the existing universe of tutorials, teaching guides, and ggplot helpers (e.g., there is a Shiny app giving a click and drag interface) handle the education.

bwiernik commented 3 years ago

Copied from the other issue:

I think a basic approach for estimate_predicted() would be to produce a plot with:

x-axis
1. If model has one predictor, the predictor
2. If model has multiple predictors, default to the fitted value? or maybe the first predictor in the formula?
3. Have an x argument that takes the name of a variable to map to the x axis.
geom_line
1. y = Fitted
geom_ribbon
1. ymin = CI_low, ymax = CI_high
points
1. If x is numeric, geom_point(aes(y = response))
2. If x is factor or ordered, geom_jitter(aes(y = response), height = 0, width = (length(levels(x)) - 1) / (4 * length(levels(x))))
other aeshetics
1. Arguments for shape, color, alpha, linetype, facet, etc. that pass these along

Then, we apply nice see defaults. If a user wants more customization, they can take the resulting ggplot and add more layers.

DominiqueMakowski commented 3 years ago

Somehow I didn't convey the fact that my "recipe approach" to plotting was first and foremost to facilitate the implementation of the plots, so that it is more robust, readable and debuggable. That is its first advantage beyond others like flexibility, didactic value and customizability.

You will find below are a more complex example of how it could look like for estimate_response (of linear models only for now).

(Note: currently the plotting only does lines (which might not be appropriate when only factors / factor is the first predictor) - but this we can address once we have the plot for estimate_means which we could call in some cases to plot the output of estimate_predicted)

Note2: the call to visualise_recipe can actually be bypassed altogether (as in the last examples), and calling plot(estimate_response(m)) would suffice. The visualise_recipe middle step is useful because it separates the plot render from the plot info, which makes IMO it easier to code, debug and implement.

library(modelbased)
library(see)

# Simple ------------------------------------------------------------------

# Default
x <- estimate_relation(lm(mpg ~ wt, data = mtcars))
layers <- visualisation_recipe(x)
layers
#> Layer 1
#> --------
#> Geom type: point
#> data, aes, stroke, shape
#> 
#> Layer 2
#> --------
#> Geom type: ribbon
#> data, aes, alpha
#> 
#> Layer 3
#> --------
#> Geom type: line
#> data, aes
#> 
#> Layer 4
#> --------
#> Geom type: labs
#> x, y, title
plot(layers)

# Customize aesthetics
x <- estimate_relation(lm(mpg ~ wt, data = mtcars))
layers <- visualisation_recipe(x,
                               points = list(color = "red", alpha = 0.6, size = 3),
                               line = list(color = "blue", size = 4),
                               ribbon = list(fill = "green", alpha = 0.7),
                               labs = list(subtitle = "Oh yeah!"))
layers
#> Layer 1
#> --------
#> Geom type: point
#> data, aes, stroke, shape, color, alpha, size
#> 
#> Layer 2
#> --------
#> Geom type: ribbon
#> data, aes, alpha, fill
#> 
#> Layer 3
#> --------
#> Geom type: line
#> data, aes, color, size
#> 
#> Layer 4
#> --------
#> Geom type: labs
#> x, y, title, subtitle
plot(layers)


# 2-ways interaction ------------------------------------------------------------

# Numeric * numeric
x <- estimate_relation(lm(mpg ~ wt * qsec, data = mtcars))
layers <- visualisation_recipe(x)
plot(layers)

# Factor * numeric
x <- estimate_relation(lm(Sepal.Width ~ Species * Sepal.Length, data = iris))
plot(x)


# 3-ways interaction ------------------------------------------------------------
data <- mtcars
data$vs <- as.factor(data$vs)
data$cyl <- as.factor(data$cyl)
data$new_factor <- as.factor(rep(c("A", "B"), length.out = nrow(mtcars)))

# Numeric * numeric * numeric
x <- estimate_relation(lm(mpg ~ wt * qsec * hp, data = data))
plot(x)

# Numeric * numeric * factor
x <- estimate_relation(lm(mpg ~ wt * am * vs, data = data))
plot(x)

# Numeric * factor * factor
x <- estimate_relation(lm(mpg ~ wt * cyl * new_factor, data = data))
plot(x)

^{Created on 2021-05-31 by the reprex package (v1.0.0)}

bwiernik commented 3 years ago

I am worried about two things now:

Scope management with ggplot() is tricky. Does calling vectors directly open us up to weirdness (versus calling with data frames and aes(); e.g., for each layer).
With this approach, what happens when a user adds a another layer? Does the omission of the original data argument and variable names mean they can't easily add another layer with the intuitive syntax?

DominiqueMakowski commented 3 years ago

For themes it seems to work

library(modelbased)
library(see)

plot(estimate_relation(lm(mpg ~ wt, data = mtcars))) +
  see::theme_modern()

^{Created on 2021-05-31 by the reprex package (v1.0.0)}

do you have an example of something that could not work? If the issue is the missing "main" data (and "main" aesthetics?), it could be easily added I think

what do you mean calling vectors?

DominiqueMakowski commented 3 years ago

I think the recipe approach is on the contrary much more safer than previous ones, because all the elements remain as data (specifically, lists) until the very end, til it uses ggplot2::layer() to create the geoms (which has been made for that purpose), layer by layer. The only "sensitive" step is probably the steps where it puts multiple geoms into a list, but we used that approach in see in other instances and it seems to work

bwiernik commented 3 years ago

I think I misread something earlier. Didn't realize you were using aes_string().

DominiqueMakowski commented 3 years ago

Essentially the other approach (used in other places in see) is to create a plot and then add directly geoms to it conditionally (depending on args etc.), and this becomes quite messy due to ggplot's inherent declarative syntax and the way it treats geoms and their addition with +. In the recipe approach, instead of adding / editing / working with geoms, we work with lists, which is imo quite convenient

bwiernik commented 3 years ago

Regading further extension by users, what about:

What about

x <- estimate_relation(lm(mpg ~ wt * cyl * new_factor, data = data))
plot(x) + scale_color_viridis_d()

and

x <- estimate_relation(lm(mpg ~ wt * cyl * new_factor, data = data))
plot(x) + geom_ribbon(aes(ymin = CI_low, ymax = CI_high))

or

x <- estimate_relation(lm(mpg ~ wt * cyl * new_factor, data = data))
plot(x) + geom_ribbon(aes(ymin = CI_low, ymax = CI_high))

(the last one what I want to know is can a user add another geom without issue?)

bwiernik commented 3 years ago

Essentially the other approach (used in other places in see) is to create a plot and then add directly geoms to it conditionally (depending on args etc.), and this becomes quite messy due to ggplot's inherent declarative syntax and the way it treats geoms and their addition with +.

I'm not really following. If you + messy, you can do:

Reduce(`+`, list_of_geoms)

DominiqueMakowski commented 3 years ago

library(modelbased)
library(ggplot2)

data <- mtcars
data$vs <- as.factor(data$vs)
data$cyl <- as.factor(data$cyl)
data$new_factor <- as.factor(rep(c("A", "B"), length.out = nrow(mtcars)))

x <- estimate_relation(lm(mpg ~ wt * cyl * new_factor, data = data))
plot(x) + scale_color_viridis_d()

x <- estimate_relation(lm(mpg ~ wt * cyl * new_factor, data = data))
plot(x) + geom_ribbon(aes(ymin = CI_low, ymax = CI_high))
#> Error in FUN(X[[i]], ...): object 'CI_low' not found

x <- estimate_relation(lm(mpg ~ wt * cyl * new_factor, data = data))
plot(x) + geom_ribbon(aes(ymin = CI_low, ymax = CI_high))
#> Error in FUN(X[[i]], ...): object 'CI_low' not found

^{Created on 2021-05-31 by the reprex package (v1.0.0)}

The two last don't work as expected. We could indeed add the main object as main data, let me try

DominiqueMakowski commented 3 years ago

library(modelbased)
library(ggplot2)

data <- mtcars
data$vs <- as.factor(data$vs)
data$cyl <- as.factor(data$cyl)
data$new_factor <- as.factor(rep(c("A", "B"), length.out = nrow(mtcars)))

x <- estimate_relation(lm(mpg ~ wt * cyl * new_factor, data = data))
plot(x) + geom_ribbon(aes(x = wt, y = Predicted, ymin = CI_low, ymax = CI_high, fill = cyl))

^{Created on 2021-05-31 by the reprex package (v1.0.0)}

It works "in principle" now that the main data is passed to ggplot(), but still requires all the aesthetics. We also could in principle add some "main" x and y aesthetics, but I'm worried it will be misleading - if people want to add custom layers, they might as well specify explicitly the data and the aesthetics (especially since they are so easily retrievable from the ingredients-lists created by visualise_recipe)

bwiernik commented 3 years ago

I think it's okay so long as the main plot has data.

mattansb commented 3 years ago

Side note:

Instead of aes_string() you can index directly from the .data option in aes():

x_name <- "varX"

ggplot2::aes(x = .data[[x_name]])
#> Aesthetic mapping: 
#> * `x` -> `.data[["varX"]]`

^{Created on 2021-05-31 by the reprex package (v2.0.0)}

DominiqueMakowski commented 3 years ago

I like aes_string why do you want to take it away from me 😂

I made a first draft for estimate_means:

library(magrittr)

lm(Sepal.Width ~ Species, data = iris) %>% 
  modelbased::estimate_means() %>% 
  plot()


lm(Sepal.Width ~ Species, data = iris) %>% 
  modelbased::estimate_means() %>% 
  plot(jitter = list(width = 0.03, color = "red"))

^{Created on 2021-05-31 by the reprex package (v1.0.0)}

DominiqueMakowski commented 3 years ago

# Fully custom using layers
library(modelbased)
library(ggplot2)
library(see)

means <- modelbased::estimate_means(lm(Sepal.Width ~ Species, data = iris))
d <- modelbased::visualisation_recipe(means)
d
#> Layer 1
#> --------
#> Geom type: jitter
#> data, aes, stroke, shape, width
#> 
#> Layer 2
#> --------
#> Geom type: line
#> data, aes
#> 
#> Layer 3
#> --------
#> Geom type: pointrange
#> data, aes
#> 
#> Layer 4
#> --------
#> Geom type: labs
#> x, y, title

ggplot() +
  # Collect elements from layer 1 and do something else
  geom_violin(data = d$l1$data, aes_string(x = d$l1$aes$x, y = d$l1$aes$y)) +
  # Only plot a subset of layers
  see::geoms_from_list(d[c("l3", "l4")])

^{Created on 2021-05-31 by the reprex package (v1.0.0)}

I think it's a really flexible and easy-to-program-with solution

DominiqueMakowski commented 3 years ago

I added support for two factors and its modulation by a numeric:

library(modelbased)
library(see)

data <- mtcars
data$cyl <- as.factor(data$cyl)
data$new_factor <- as.factor(rep(c("A", "B"), length.out = nrow(mtcars)))

# Two factors ---------------
model <- lm(mpg ~ new_factor * cyl * wt, data = data)
x <- estimate_means(model, levels = c("new_factor", "cyl"))
#> NOTE: Results may be misleading due to involvement in interactions
plot(visualisation_recipe(x))

# Modulations --------------
x <- estimate_means(model, levels = c("new_factor"), modulate = "wt")
#> NOTE: Results may be misleading due to involvement in interactions
plot(visualisation_recipe(x))

x <- estimate_means(model, levels = c("new_factor", "cyl"), modulate = "wt")
plot(visualisation_recipe(x))

^{Created on 2021-05-31 by the reprex package (v1.0.0)}

There's a slight problem with the dodging of the lines / pointrange, not sure what's going on, feel free to check out the visualisation_recipe.estimate_means.R code to help me fix

And in general, feel free to comment / roast / improve; the code is pretty straightforward so it should be easy to add options and all!

DominiqueMakowski commented 3 years ago

I added the possibility of changing the display of raw data:

library(modelbased)

x <- estimate_means(lm(Sepal.Width ~ Species, data = iris))

layers <- visualisation_recipe(x, show_data = c("violin", "boxplot", "points"))

plot(layers)

^{Created on 2021-06-01 by the reprex package (v1.0.0)}

But it would be even more awesome if we had a super raincloud-like geom (https://github.com/easystats/see/issues/135)

DominiqueMakowski commented 3 years ago

Similarly, for large data in which having points might not be the best representation, now is added the possibility of 2D density geoms instead (or in combination with):

library(modelbased)

x <- estimate_relation(lm(mpg ~ wt, data = mtcars))
plot(x, show_data = c("density_2d", "points"))

plot(x, show_data = "density_2d_filled")

plot(x, show_data = "density_2d_polygon")

plot(x, show_data = "density_2d_raster", line = list(color = "white"), ribbon = list(fill = "white"))

^{Created on 2021-06-01 by the reprex package (v1.0.0)}

DominiqueMakowski commented 3 years ago

I must admit I'm quite happy about how the visualisation for modelbased is shaping up 😁 , it's quite convenient!

The problem is that now modelbased depends on the latest see, so for the CRAN release... either I disable these functions until see is updated, or idk 🤔 @strengejacke what do you think?

bwiernik commented 3 years ago

I'm a little confused. Does this plot code live in see or in modelbased?

DominiqueMakowski commented 3 years ago

because the creation of the visualisation ingredients (i.e., the layers) is independent from rendering and only necessitates some code logic, it is present in modelbased, close to the main functions so that it's always up-to-date. (That also means that, in principle, users can use modelbased to generate the plotting scheme, and then render it with whatever alternative option they want - like some Python GG module because reasons). That said, the default plotting that most users will go with requires see for its plotting functions (geoms_from_list and, someday, the rainclouds geoms etc.).

So the code to design the plot lives in modelbased, but the code to render it in see

mattansb commented 3 years ago

I think this new update with all of its changes, especially the ones regarding visualization, require a blog post... I, for one, have no idea how to use any of it!

DominiqueMakowski commented 3 years ago

the blogpost's whole content could be: "you can now run plot() on estimate_means and estimate_predicted" ^^ but yeah I agree once it's on CRAN and all I'll slowly start documenting all that more thoroughly

strengejacke commented 3 years ago

Does modelbased depend on see, or is it only required for plotting? Else, I would include that code and it starts working once see is updated

DominiqueMakowski commented 3 years ago

required only (conditional) - you mean add the geoms_from_list functions to modelbased?

The only call to see/ggplot2 is here:

https://github.com/easystats/modelbased/blob/14101335205340f0f28bbc4d7b7575baac31da52/R/visualisation_recipe.R#L31-L37

strengejacke commented 3 years ago

Sorry, confused. You're using a new function in see, that doesn't work of course. I would disable that code for now

DominiqueMakowski commented 3 years ago

note: the printing of the layers has been improved and detailed:

bwiernik commented 3 years ago

We could add minimum version argument to check_if_installed()

strengejacke commented 3 years ago

Yes, but I think that calling a non-existent function like see::geoms_from_list() still results in a failure (because it's not exported by the see namespace). So I'm not sure this would resolve this issue?

bwiernik commented 3 years ago

I think if the call is inside a conditional, it should work?

DominiqueMakowski commented 3 years ago

For now cannot use the new check_if_installed with the dependency arg because otherwise it would need me to require the newest version of insight 😅

anyway, since to works on winbuilders as is, I'll submit as soon as the tests (#115) are dealt with, and we'll see

DominiqueMakowski commented 3 years ago

Closing this, will start adding vignettes/ examples / posts to showcase all this once see is updated

easystats / modelbased

visualisation_recipe design #110