Only show facets for which R squared >= specified value

aphalo / ggpmisc

R package ggpmisc is an extension to ggplot2 and the Grammar of Graphics

https://docs.r4photobiology.info/ggpmisc

94 stars 6 forks source link

Only show facets for which R squared >= specified value #46

Closed ggrothendieck closed 10 months ago

ggrothendieck commented 10 months ago

Would like to be able to show only the facets with high R squared. The code below does it but it would be easier if it could be done entirely within ggplot2, i.e. within the second pipeline. In particular we ran lm in the first pipeline and then again, implicitly, in the second pipeline. Also more important is having simpler code would be nice. A variation would be to show only the top k panels in R squared where k is specified.

library(broom)
library(dplyr)
library(ggplot2)
library(ggpmisc)

# find Trees that for which R squared >= 0.97.  Here all but tree 4.
Trees <- Orange %>%
  nest_by(Tree) %>%
  summarize(model = list(lm(age ~ circumference, data)), glance(model)) %>%
  filter(r.squared >= 0.97) %>%
  pull(Tree)

# plot
if (length(Trees)) {
  p <- Orange %>%
    filter(Tree %in% Trees) %>%
    ggplot(aes(circumference, age)) +
      geom_point() +
      stat_poly_eq() +
      geom_smooth(method = "lm", se = FALSE) +
      facet_wrap(~ Tree)
  plot(p)
 }

aphalo commented 10 months ago

I cannot imagine a way of doing the selection within the ggplot code without modifying data. I am not an expert of how facets are implemented, but in the grammar of graphics the data from layers is not expected to be visible outside the layer. Faceting is as far as I know always dependent on a variable in the argument passed to data in the call to ggplot(), not data returned by statistics.

ggrothendieck commented 10 months ago

If that is not possible what about running ggplot twice with the first instance generating all panels and the second instance only generating the R squared >= 0.97 panels making use of the computations done in the first. The idea would be that the two ggplot instances would be nearly the same making the coding simpler.

aphalo commented 10 months ago

The code in ggplot2 statistics runs when the plot is rendered into graphical objects, not before. What is it wrong with the approach of subsetting the data before plotting? Furthermore, not plotting all data in most cases would mislead the viewer.

ggrothendieck commented 10 months ago

The subset-ing is not terrible but I was hoping to simplify the code by eliminating the entire first pipeline. Also, in reality there could be many panels yet interest is only on the high R^2 panels. Here is a slightly better example. It generates 12 panels if not cut down but it only generates 4 if the filtering is done so it is easier to focus on what is relevant. Anyways, will stick with my current solution or look around a bit more for alternatives.

library(broom)
library(dplyr)
library(ggplot2)
library(ggpmisc)

# find Plants for which R squared >= 0.60
Plants <-  CO2 %>%
  nest_by(Plant) %>%
  summarize(model = list(lm(uptake ~ conc, data)), glance(model)) %>%
  filter(r.squared >= 0.60) %>%
  pull(Plant)

# plot
if (length(Plants)) {
  p <- CO2 %>%
    filter(Plant %in% Plants) %>%  # omit this line to see all panels
    ggplot(aes(conc, uptake)) +
      geom_point() +
      stat_poly_eq() +
      geom_smooth(method = "lm", se = FALSE) +
      facet_wrap(~ Plant)
  plot(p)
 }

ggrothendieck commented 10 months ago

I have found ggplot_build and %+% and now have this solution to plotting only panels with R^2 > 0.6 . I assume ggpmisc put the R squared values there. Maybe it could provide an extraction function so one could simplify the ugly line that below ends with ## .

library(broom)
library(dplyr)
library(ggplot2)
library(ggpmisc)

  p <- CO2 %>%
    ggplot(aes(conc, uptake)) +
      geom_point() +
      stat_poly_eq() +
      geom_smooth(method = "lm", se = FALSE) +
      facet_wrap(~ Plant)
  plot(p)
 } 

Plants <- levels(CO2$Plant)[ggplot_build(p)$data[[2]]$r.squared > .6] ##
if (length(Plants)) p %+% filter(CO2, Plant %in% Plants)

If the function were called get.r.squared, say, then

get.r.squared <- function(p) ggplot_build(p)$data[[2]]$r.squared

Then we could write the line marked ## above as

Plants <- levels(CO2$Plant)[get.r.squared(p) > .6]

markbneal commented 10 months ago

If a neat way is created to help with the filtering of facets, it would be useful to provide a message of how many facets there were originally, and how many are returned - this deals a little to Pedro's concern of misleading results.