Closed DominiqueMakowski closed 3 years ago
I guess I don't really see this as meaningfully more accessible than writing the ggplot code originally. If anything, I obscuring the variable names probably makes writing the final plotting code more difficult for new coders.
(As an aside, I think most of the problems new ggplot2 users have stems from approaching it as a copy-paste recipe task versus taking the time to understand the layer-building logic underlying the syntax. I've found my students have found ggplot fairly intutive once I give a 15 minute walkthrough of the principles underlying it.)
I guess my preference here would be to follow the general idea of see in that just passing the result of a modelbased function to plot()
should generate an effective visualization. We can provide some good defaults and arguments to map variables to various features, but beyond that, I think we can let the existing universe of tutorials, teaching guides, and ggplot helpers (e.g., there is a Shiny app giving a click and drag interface) handle the education.
Copied from the other issue:
I think a basic approach for estimate_predicted()
would be to produce a plot with:
x
argument that takes the name of a variable to map to the x axis.geom_point(aes(y = response))
geom_jitter(aes(y = response), height = 0, width = (length(levels(x)) - 1) / (4 * length(levels(x))))
shape
, color
, alpha
, linetype
, facet
, etc. that pass these alongThen, we apply nice see defaults. If a user wants more customization, they can take the resulting ggplot and add more layers.
Somehow I didn't convey the fact that my "recipe approach" to plotting was first and foremost to facilitate the implementation of the plots, so that it is more robust, readable and debuggable. That is its first advantage beyond others like flexibility, didactic value and customizability.
You will find below are a more complex example of how it could look like for estimate_response
(of linear models only for now).
(Note: currently the plotting only does lines (which might not be appropriate when only factors / factor is the first predictor) - but this we can address once we have the plot for estimate_means which we could call in some cases to plot the output of estimate_predicted)
Note2: the call to visualise_recipe
can actually be bypassed altogether (as in the last examples), and calling plot(estimate_response(m))
would suffice. The visualise_recipe middle step is useful because it separates the plot render from the plot info, which makes IMO it easier to code, debug and implement.
library(modelbased)
library(see)
# Simple ------------------------------------------------------------------
# Default
x <- estimate_relation(lm(mpg ~ wt, data = mtcars))
layers <- visualisation_recipe(x)
layers
#> Layer 1
#> --------
#> Geom type: point
#> data, aes, stroke, shape
#>
#> Layer 2
#> --------
#> Geom type: ribbon
#> data, aes, alpha
#>
#> Layer 3
#> --------
#> Geom type: line
#> data, aes
#>
#> Layer 4
#> --------
#> Geom type: labs
#> x, y, title
plot(layers)
# Customize aesthetics
x <- estimate_relation(lm(mpg ~ wt, data = mtcars))
layers <- visualisation_recipe(x,
points = list(color = "red", alpha = 0.6, size = 3),
line = list(color = "blue", size = 4),
ribbon = list(fill = "green", alpha = 0.7),
labs = list(subtitle = "Oh yeah!"))
layers
#> Layer 1
#> --------
#> Geom type: point
#> data, aes, stroke, shape, color, alpha, size
#>
#> Layer 2
#> --------
#> Geom type: ribbon
#> data, aes, alpha, fill
#>
#> Layer 3
#> --------
#> Geom type: line
#> data, aes, color, size
#>
#> Layer 4
#> --------
#> Geom type: labs
#> x, y, title, subtitle
plot(layers)
# 2-ways interaction ------------------------------------------------------------
# Numeric * numeric
x <- estimate_relation(lm(mpg ~ wt * qsec, data = mtcars))
layers <- visualisation_recipe(x)
plot(layers)
# Factor * numeric
x <- estimate_relation(lm(Sepal.Width ~ Species * Sepal.Length, data = iris))
plot(x)
# 3-ways interaction ------------------------------------------------------------
data <- mtcars
data$vs <- as.factor(data$vs)
data$cyl <- as.factor(data$cyl)
data$new_factor <- as.factor(rep(c("A", "B"), length.out = nrow(mtcars)))
# Numeric * numeric * numeric
x <- estimate_relation(lm(mpg ~ wt * qsec * hp, data = data))
plot(x)
# Numeric * numeric * factor
x <- estimate_relation(lm(mpg ~ wt * am * vs, data = data))
plot(x)
# Numeric * factor * factor
x <- estimate_relation(lm(mpg ~ wt * cyl * new_factor, data = data))
plot(x)
Created on 2021-05-31 by the reprex package (v1.0.0)
I am worried about two things now:
aes()
; e.g., for each layer).library(modelbased)
library(see)
plot(estimate_relation(lm(mpg ~ wt, data = mtcars))) +
see::theme_modern()
Created on 2021-05-31 by the reprex package (v1.0.0)
do you have an example of something that could not work? If the issue is the missing "main" data (and "main" aesthetics?), it could be easily added I think
I think the recipe approach is on the contrary much more safer than previous ones, because all the elements remain as data (specifically, lists) until the very end, til it uses ggplot2::layer()
to create the geoms (which has been made for that purpose), layer by layer. The only "sensitive" step is probably the steps where it puts multiple geoms into a list, but we used that approach in see
in other instances and it seems to work
I think I misread something earlier. Didn't realize you were using aes_string().
Essentially the other approach (used in other places in see
) is to create a plot and then add directly geoms to it conditionally (depending on args etc.), and this becomes quite messy due to ggplot's inherent declarative syntax and the way it treats geoms and their addition with +
. In the recipe approach, instead of adding / editing / working with geoms, we work with lists, which is imo quite convenient
Regading further extension by users, what about:
What about
x <- estimate_relation(lm(mpg ~ wt * cyl * new_factor, data = data))
plot(x) + scale_color_viridis_d()
and
x <- estimate_relation(lm(mpg ~ wt * cyl * new_factor, data = data))
plot(x) + geom_ribbon(aes(ymin = CI_low, ymax = CI_high))
or
x <- estimate_relation(lm(mpg ~ wt * cyl * new_factor, data = data))
plot(x) + geom_ribbon(aes(ymin = CI_low, ymax = CI_high))
(the last one what I want to know is can a user add another geom without issue?)
Essentially the other approach (used in other places in see) is to create a plot and then add directly geoms to it conditionally (depending on args etc.), and this becomes quite messy due to ggplot's inherent declarative syntax and the way it treats geoms and their addition with +.
I'm not really following. If you +
messy, you can do:
Reduce(`+`, list_of_geoms)
library(modelbased)
library(ggplot2)
data <- mtcars
data$vs <- as.factor(data$vs)
data$cyl <- as.factor(data$cyl)
data$new_factor <- as.factor(rep(c("A", "B"), length.out = nrow(mtcars)))
x <- estimate_relation(lm(mpg ~ wt * cyl * new_factor, data = data))
plot(x) + scale_color_viridis_d()
x <- estimate_relation(lm(mpg ~ wt * cyl * new_factor, data = data))
plot(x) + geom_ribbon(aes(ymin = CI_low, ymax = CI_high))
#> Error in FUN(X[[i]], ...): object 'CI_low' not found
x <- estimate_relation(lm(mpg ~ wt * cyl * new_factor, data = data))
plot(x) + geom_ribbon(aes(ymin = CI_low, ymax = CI_high))
#> Error in FUN(X[[i]], ...): object 'CI_low' not found
Created on 2021-05-31 by the reprex package (v1.0.0)
The two last don't work as expected. We could indeed add the main object as main data, let me try
library(modelbased)
library(ggplot2)
data <- mtcars
data$vs <- as.factor(data$vs)
data$cyl <- as.factor(data$cyl)
data$new_factor <- as.factor(rep(c("A", "B"), length.out = nrow(mtcars)))
x <- estimate_relation(lm(mpg ~ wt * cyl * new_factor, data = data))
plot(x) + geom_ribbon(aes(x = wt, y = Predicted, ymin = CI_low, ymax = CI_high, fill = cyl))
Created on 2021-05-31 by the reprex package (v1.0.0)
It works "in principle" now that the main data is passed to ggplot()
, but still requires all the aesthetics. We also could in principle add some "main" x and y aesthetics, but I'm worried it will be misleading - if people want to add custom layers, they might as well specify explicitly the data and the aesthetics (especially since they are so easily retrievable from the ingredients-lists created by visualise_recipe
)
I think it's okay so long as the main plot has data.
Side note:
Instead of aes_string()
you can index directly from the .data
option in aes()
:
x_name <- "varX"
ggplot2::aes(x = .data[[x_name]])
#> Aesthetic mapping:
#> * `x` -> `.data[["varX"]]`
Created on 2021-05-31 by the reprex package (v2.0.0)
I like aes_string
why do you want to take it away from me 😂
I made a first draft for estimate_means:
library(magrittr)
lm(Sepal.Width ~ Species, data = iris) %>%
modelbased::estimate_means() %>%
plot()
lm(Sepal.Width ~ Species, data = iris) %>%
modelbased::estimate_means() %>%
plot(jitter = list(width = 0.03, color = "red"))
Created on 2021-05-31 by the reprex package (v1.0.0)
# Fully custom using layers
library(modelbased)
library(ggplot2)
library(see)
means <- modelbased::estimate_means(lm(Sepal.Width ~ Species, data = iris))
d <- modelbased::visualisation_recipe(means)
d
#> Layer 1
#> --------
#> Geom type: jitter
#> data, aes, stroke, shape, width
#>
#> Layer 2
#> --------
#> Geom type: line
#> data, aes
#>
#> Layer 3
#> --------
#> Geom type: pointrange
#> data, aes
#>
#> Layer 4
#> --------
#> Geom type: labs
#> x, y, title
ggplot() +
# Collect elements from layer 1 and do something else
geom_violin(data = d$l1$data, aes_string(x = d$l1$aes$x, y = d$l1$aes$y)) +
# Only plot a subset of layers
see::geoms_from_list(d[c("l3", "l4")])
Created on 2021-05-31 by the reprex package (v1.0.0)
I think it's a really flexible and easy-to-program-with solution
I added support for two factors and its modulation by a numeric:
library(modelbased)
library(see)
data <- mtcars
data$cyl <- as.factor(data$cyl)
data$new_factor <- as.factor(rep(c("A", "B"), length.out = nrow(mtcars)))
# Two factors ---------------
model <- lm(mpg ~ new_factor * cyl * wt, data = data)
x <- estimate_means(model, levels = c("new_factor", "cyl"))
#> NOTE: Results may be misleading due to involvement in interactions
plot(visualisation_recipe(x))
# Modulations --------------
x <- estimate_means(model, levels = c("new_factor"), modulate = "wt")
#> NOTE: Results may be misleading due to involvement in interactions
plot(visualisation_recipe(x))
x <- estimate_means(model, levels = c("new_factor", "cyl"), modulate = "wt")
plot(visualisation_recipe(x))
Created on 2021-05-31 by the reprex package (v1.0.0)
There's a slight problem with the dodging of the lines / pointrange, not sure what's going on, feel free to check out the visualisation_recipe.estimate_means.R
code to help me fix
And in general, feel free to comment / roast / improve; the code is pretty straightforward so it should be easy to add options and all!
I added the possibility of changing the display of raw data:
library(modelbased)
x <- estimate_means(lm(Sepal.Width ~ Species, data = iris))
layers <- visualisation_recipe(x, show_data = c("violin", "boxplot", "points"))
plot(layers)
Created on 2021-06-01 by the reprex package (v1.0.0)
But it would be even more awesome if we had a super raincloud-like geom (https://github.com/easystats/see/issues/135)
Similarly, for large data in which having points might not be the best representation, now is added the possibility of 2D density geoms instead (or in combination with):
library(modelbased)
x <- estimate_relation(lm(mpg ~ wt, data = mtcars))
plot(x, show_data = c("density_2d", "points"))
plot(x, show_data = "density_2d_filled")
plot(x, show_data = "density_2d_polygon")
plot(x, show_data = "density_2d_raster", line = list(color = "white"), ribbon = list(fill = "white"))
Created on 2021-06-01 by the reprex package (v1.0.0)
I must admit I'm quite happy about how the visualisation for modelbased is shaping up 😁 , it's quite convenient!
The problem is that now modelbased depends on the latest see
, so for the CRAN release... either I disable these functions until see
is updated, or idk 🤔 @strengejacke what do you think?
I'm a little confused. Does this plot code live in see or in modelbased?
because the creation of the visualisation ingredients (i.e., the layers) is independent from rendering and only necessitates some code logic, it is present in modelbased, close to the main functions so that it's always up-to-date. (That also means that, in principle, users can use modelbased to generate the plotting scheme, and then render it with whatever alternative option they want - like some Python GG module because reasons). That said, the default plotting that most users will go with requires see
for its plotting functions (geoms_from_list
and, someday, the rainclouds geoms etc.).
So the code to design the plot lives in modelbased, but the code to render it in see
I think this new update with all of its changes, especially the ones regarding visualization, require a blog post... I, for one, have no idea how to use any of it!
the blogpost's whole content could be: "you can now run plot()
on estimate_means and estimate_predicted" ^^ but yeah I agree once it's on CRAN and all I'll slowly start documenting all that more thoroughly
Does modelbased depend on see, or is it only required for plotting? Else, I would include that code and it starts working once see is updated
required only (conditional) - you mean add the geoms_from_list functions to modelbased?
The only call to see/ggplot2 is here:
Sorry, confused. You're using a new function in see, that doesn't work of course. I would disable that code for now
note: the printing of the layers has been improved and detailed:
We could add minimum version argument to check_if_installed()
Yes, but I think that calling a non-existent function like see::geoms_from_list()
still results in a failure (because it's not exported by the see namespace). So I'm not sure this would resolve this issue?
I think if the call is inside a conditional, it should work?
For now cannot use the new check_if_installed with the dependency arg because otherwise it would need me to require the newest version of insight 😅
anyway, since to works on winbuilders as is, I'll submit as soon as the tests (#115) are dealt with, and we'll see
Closing this, will start adding vignettes/ examples / posts to showcase all this once see is updated
Follow up of https://github.com/easystats/see/issues/38 (note that the scope of that proposition is broader than just the plot for estimate_predicted).
To summarize, plotting is one of the fundamental skill of data science, often overlooked in favour of results values (them sweetsweet p-values 🤤). The grammar of graphics offers an incredible amount of freedom of expression, creativity, and composability and for me is the best approach to plotting, of the ones that I tried.
That said, understanding how to think in terms of GG is a hard process, as (especially in psychology) we are usually taught to think in terms of types of plots and one-to-one correspondence with some statistical models (e.g., correlation - scatterplot, anovas - bar chart 🤮, t-test - boxplot etc.) rather than in terms of layers of geometries coupled with data.
On top of that inherent steep path to having the GG-mindset, its main implementation in R (ggplot2) adds its own layer (no pun intended) of complexity with a lot of new different argument names and idiosyncratic logic.
This double complexity makes it hard for students and beginners, and it could explain the creation (and popularity) of pre-baked plots creators, such as
see
orggstatsplot
. Don't get me wrong, these pre-baked plots are absolutely necessary, not just to help beginners to get stuff done but also for advanced users: I don't see myself re-writing the ggplot code forcheck_model
plots at each model 😁But the limitation that I see from that landscape is that there are two opposed paths to plots creation: either you build everything from the ground-up using geoms (which really makes you think about how and what to put into), or you get a plot done for you (which really doesn't make you think and doesn't necessarily show you how and what to put into).
Having been immersed within a Buddhist environment, it is not surprising that I'm longing for a 3rd noble path that would achieve some form balance 😅, for instance by exposing and getting access to the elements directly underlying a given plot.
One of the early attempts at something like that was the
data_plot()
andhow_to_plot()
methods insee
. The goal of the former was to transform the input into the data that would be plotted, and the second was to expose the code used to make the plot, so that users can copy and edit and this way learn how we make our plots. But this didn't really go anywhere, in part because: 1) It was in practice difficult to have a robust and workinghow_to_plot()
function that would print its internal code. 2) there is often no unique source of data for complex plots. For instance, for a plot forestimate_predicted()
, the plot has a line and a ribbon, and eventually some points on top of that, not necessarily coming from the same data. So that was problematic fordata_plot()
A plot, within the GG framework, is made of layers. Each layers has information pertaining to the form (the geom), aesthetics, data etc. So I thought, why not having that as data, instead of as a plot.
In other words, why not have an ensemble of lists (layers) that would each contain the relevant stuff. That would make it easy for people to get a sense of how a plot is made (and how to make it, customize it etc), and at the same time it would make it easier for a third-party such as
see
to simply "render" this information (e.g., loop through the layers and programmatically create them and add them).Here's a first example of
visualisation_recipe()
:Created on 2021-05-30 by the reprex package (v1.0.0)