corybrunson / ggalluvial

ggplot2 extension for alluvial plots
http://corybrunson.github.io/ggalluvial/
GNU General Public License v3.0
497 stars 34 forks source link

Enable "skipping" of columns #63

Closed RoganGrant closed 4 years ago

RoganGrant commented 4 years ago

First of all, thank you; this is a fantastic package!

I recognize that this is a rare use-case, but I am making an alluvial diagram to show a workflow. For one group, step 2 is not performed, whereas subsequent steps are. This unfortunately means that no alluvial lines are drawn for this group from column 1. Would it be possible to allow the flow lines to "skip" a column and directly point to a subsequent one? In the attached image, the group of interest would be the topmost (blue-grey).

Screen Shot 2020-07-30 at 2 42 30 PM

Minimal code to encounter the same issue:

ex = data.frame(sample = c(rep(1, 2), rep(2,2), rep(3, 3), rep(4, 3), rep(5, 3)),
test = c(rep(c("test1", "test3"), 2), rep(c("test1", "test2", "test3"), 3)),
group = c(rep(c("group1"), 4), rep(c("group2"), 9)))

ggplot(ex, aes(x = test, stratum = group, alluvium = sample,
                                  fill = group, label = group)) +
  geom_flow(stat = "alluvium") +
  geom_stratum() 
corybrunson commented 4 years ago

Hi @RoganGrant, thanks for the endorsement!

What you're after is actually pretty common in Sankey plots generally, especially those that allow free-floating nodes rather than stacked strata. Many software packages enable this, but i don't know of any ggplot2 extensions. One thing you can do in ggalluvial is complete the data frame with NA (missing) values of group, which will show up as grey graphical objects by default. Since those who skip one step but continue on to a future step are (i presume) different from those who stop at that step, this may be what you want—it will preserve the gradual shrinking of the stacked histograms, just with one off-color stratum at each axis. Another option is to complete the data frame in the same way but use the y parameter to shrink those boxes to zero height. As a result, the flows will still be plotted but they will not contribute to the height of the stacked histogram. Both options are illustrated below. (I need to better document the difference in behavior between stat_alluvium() and stat_flow() with respect to NA.)

I hope this helps!

library(ggplot2)
library(ggalluvial)
#> Warning: package 'ggalluvial' was built under R version 4.0.2

ex <- data.frame(
  sample = c(rep(1, 2), rep(2,2), rep(3, 3), rep(4, 3), rep(5, 3)),
  test = c(rep(c("test1", "test3"), 2), rep(c("test1", "test2", "test3"), 3)),
  group = c(rep(c("group1"), 4), rep(c("group2"), 9))
)

ggplot(ex, aes(x = test, stratum = group, alluvium = sample,
               fill = group, label = group)) +
  geom_flow(stat = "alluvium") +
  geom_stratum() 


ex <- tidyr::complete(ex, sample, test)

ggplot(ex, aes(x = test, stratum = group, alluvium = sample,
               fill = group, label = group)) +
  geom_alluvium() +
  geom_stratum() 
#> Warning in f(...): Some differentiation aesthetics vary within alluvia, and will be diffused by their first value.
#> Consider using `geom_flow()` instead.


ex$n <- ifelse(is.na(ex$group), 0, 1)
ex <- tidyr::fill(ex, group)

ggplot(ex, aes(x = test, stratum = group, alluvium = sample,
               fill = group, label = group, y = n)) +
  geom_flow(stat = "alluvium") +
  geom_stratum() 

Created on 2020-07-30 by the reprex package (v0.3.0)

RoganGrant commented 4 years ago

Thanks so much! Any of these will probably work.

RoganGrant commented 4 years ago

For anyone with the same question, I used a hybrid solution of this and my own to get what I was after:

  1. Set "skipped" group to NA (or in my case, just its own factor level) for that x value, as suggested above
  2. Add a new factor column: hide, which can isolate these samples (just factored as T/F)
  3. Use geom_alluvium() to get complete curves without breaks
  4. Remove border and fill for the skipped group:
    geom_stratum(aes(alpha = hide, color = hide))
    scale_color_manual(name = "",
                      values = c("FALSE" = "black",
                               "TRUE" = alpha("white", 0))) +
    scale_alpha_manual(name = "",
                      values = c("FALSE" = 1,
                               "TRUE" = 0))

Flows straight past, as I wanted! ![Uploading Screen Shot 2020-07-30 at 7.26.45 PM.png…]()