corybrunson / ggalluvial

ggplot2 extension for alluvial plots
http://corybrunson.github.io/ggalluvial/
GNU General Public License v3.0
497 stars 34 forks source link

Alluvial plot with varying flow widths and constant strata sizes? #75

Closed helske closed 3 years ago

helske commented 3 years ago

There have been few similar questions both here and at stackoverflow, but I still haven't quite figured out whether it is possible to obtain a figure like this with ggalluvial:

flow

So the idea here is that the size of the strata stays constant, but the flows are scaled. Above they are scaled pretty arbitrarily, but in my actual application the scales would be proportional to the starting widths of the input flows to that particular strata.

So essentially I'd like to be able to define the starting and end width of each of the A-A, A-B, A-C, B-A, B-B, B-C, etc flows. Is this possible with ggalluvial?

corybrunson commented 3 years ago

@helske i think what this would require is a data transformation ahead of the ggplot() call, which would perform the within-stratum rescaling you have in mind.

There aren't any good examples in the package of how flows change size from axis to axis, but here is one from StackOverflow. It works by taking advantage of lodes (long) form, in which the height (magnitude) of each ribbon is specified separately at each axis (and can therefore change from axis to axis).

It should be possible to use base R (or tidyr + dplyr) to (1) put the data in lodes form and (2) normalize the magnitudes within each axis and stratum to have a constant sum. Once that's done, ggalluvial should produce the plot you want; you'll pass the appropriate variables to the x, stratum, alluvium, and y aesthetics.

If this doesn't work out, and if you can share a subset of the data to experiment with, i can try to help further!

helske commented 3 years ago

Thanks, I managed to get this work with only two axis, but adding more breaks things. Here's a fake data:

   x stratum         y alluvium
1  1       1 0.8000000        1
2  2       1 0.5333333        1
3  2       1 0.4000000        2
4  3       1 0.3333333        2
5  1       1 0.1000000        3
6  2       2 0.1428571        3
7  2       1 0.4000000        4
8  3       2 0.3076923        4
9  1       1 0.1000000        5
10 2       3 0.1250000        5
11 2       1 0.2000000        6
12 3       3 0.4000000        6
13 1       2 0.5000000        7
14 2       1 0.3333333        7
15 2       2 0.3000000        8
16 3       1 0.2500000        8
17 1       2 0.4000000        9
18 2       2 0.5714286        9
19 2       2 0.6000000       10
20 3       2 0.4615385       10
21 1       2 0.1000000       11
22 2       3 0.1250000       11
23 2       2 0.1000000       12
24 3       3 0.2000000       12
25 1       3 0.2000000       13
26 2       1 0.1333333       13
27 2       3 0.5000000       14
28 3       1 0.4166667       14
29 1       3 0.2000000       15
30 2       2 0.2857143       15
31 2       3 0.3000000       16
32 3       2 0.2307692       16
33 1       3 0.6000000       17
34 2       3 0.7500000       17
35 2       3 0.2000000       18
36 3       3 0.4000000       18

For reading to R:

structure(list(x = structure(c(1L, 2L, 2L, 3L, 1L, 2L, 2L, 3L, 
1L, 2L, 2L, 3L, 1L, 2L, 2L, 3L, 1L, 2L, 2L, 3L, 1L, 2L, 2L, 3L, 
1L, 2L, 2L, 3L, 1L, 2L, 2L, 3L, 1L, 2L, 2L, 3L), .Label = c("1", 
"2", "3"), class = "factor"), stratum = structure(c(1L, 1L, 1L, 
1L, 1L, 2L, 1L, 2L, 1L, 3L, 1L, 3L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 
2L, 2L, 3L, 2L, 3L, 3L, 1L, 3L, 1L, 3L, 2L, 3L, 2L, 3L, 3L, 3L, 
3L), .Label = c("1", "2", "3"), class = "factor"), y = c(0.8, 
0.533333333333333, 0.4, 0.333333333333333, 0.1, 0.142857142857143, 
0.4, 0.307692307692308, 0.1, 0.125, 0.2, 0.4, 0.5, 0.333333333333333, 
0.3, 0.25, 0.4, 0.571428571428571, 0.6, 0.461538461538462, 0.1, 
0.125, 0.1, 0.2, 0.2, 0.133333333333333, 0.5, 0.416666666666667, 
0.2, 0.285714285714286, 0.3, 0.230769230769231, 0.6, 0.75, 0.2, 
0.4), alluvium = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 6L, 
6L, 7L, 7L, 8L, 8L, 9L, 9L, 10L, 10L, 11L, 11L, 12L, 12L, 13L, 
13L, 14L, 14L, 15L, 15L, 16L, 16L, 17L, 17L, 18L, 18L)), class = "data.frame", row.names = c(NA, 
-36L))

And this is what I get:

d %>%
  ggplot(
    aes(
      x = x,
      stratum = stratum,
      alluvium = alluvium,
      y = y,
      fill = stratum
    )
  ) +
  geom_stratum() +
  geom_flow() 

image

So the flows look what I wanted but the middle axis is twice the height of the two others. Which makes sense as the height is based on the sum of y. I looked at the majors example in the vignette, where there is one flow across all x-axis points, but there the input and output for one stratum have the same weight which is not the case here.

Separating input and output in the middle to different axis works of course, so if nothing else works perhaps I can hack around somehow to squeeze x=2 and x=3 together: image

helske commented 3 years ago

Well this seems to work in this example, but I expect problems if I want for example to have opacity in strata etc...

#data of the previous post with x-axis separated to 1-2, and 3-4
d$x <- as.numeric(d$x)
d$x[d$x==3] <- 2+1e-5
d$x[d$x==4] <- 3

image

corybrunson commented 3 years ago

Oh, i'm sorry—i hadn't noticed that your illustration at the top does not preserve ribbons. So you're right to use geom_flow() rather than geom_alluvium().

Still, it's not clear to me how the data provided would produce the kind of plot in the illustration. In the data, each alluvium only spans two axes, so the flows that meet the middle axis only go leftward or rightward, not both ways. This is better seen using the alluvium geom:

library(ggalluvial)
#> Loading required package: ggplot2
d <- structure(list(
  x = structure(
    c(1L, 2L, 2L, 3L, 1L, 2L, 2L, 3L, 
      1L, 2L, 2L, 3L, 1L, 2L, 2L, 3L, 1L, 2L, 2L, 3L, 1L, 2L, 2L, 3L, 
      1L, 2L, 2L, 3L, 1L, 2L, 2L, 3L, 1L, 2L, 2L, 3L),
    .Label = c("1", "2", "3"),
    class = "factor"),
  stratum = structure(
    c(1L, 1L, 1L, 
      1L, 1L, 2L, 1L, 2L, 1L, 3L, 1L, 3L, 2L, 1L, 2L, 1L, 2L, 2L, 2L, 
      2L, 2L, 3L, 2L, 3L, 3L, 1L, 3L, 1L, 3L, 2L, 3L, 2L, 3L, 3L, 3L, 
      3L),
    .Label = c("1", "2", "3"),
    class = "factor"),
  y = c(0.8, 
        0.533333333333333, 0.4, 0.333333333333333, 0.1, 0.142857142857143, 
        0.4, 0.307692307692308, 0.1, 0.125, 0.2, 0.4, 0.5, 0.333333333333333, 
        0.3, 0.25, 0.4, 0.571428571428571, 0.6, 0.461538461538462, 0.1, 
        0.125, 0.1, 0.2, 0.2, 0.133333333333333, 0.5, 0.416666666666667, 
        0.2, 0.285714285714286, 0.3, 0.230769230769231, 0.6, 0.75, 0.2, 
        0.4),
  alluvium = c(1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 5L, 5L, 6L, 
               6L, 7L, 7L, 8L, 8L, 9L, 9L, 10L, 10L, 11L, 11L, 12L, 12L, 13L, 
               13L, 14L, 14L, 15L, 15L, 16L, 16L, 17L, 17L, 18L, 18L)),
  class = "data.frame",
  row.names = c(NA, -36L))
ggplot(d, aes(
  x = x,
  stratum = stratum,
  alluvium = alluvium,
  y = y
)) +
  geom_alluvium(aes(fill = stratum)) +
  geom_stratum()
#> Warning in f(...): Some differentiation aesthetics vary within alluvia, and will be diffused by their first value.
#> Consider using `geom_flow()` instead.

Created on 2021-01-28 by the reprex package (v0.3.0)

Do you know what happens to a single unit of observation (corresponding to the "alluvium" column of d) across all three axes?

helske commented 3 years ago

Yes, the flows are completely distinct between axes i.e. the observations between 1-2 are not connected to observations between 2-3. So in that sense, the figure above with time points 1,2,3,4 with a gap between 2 and 3 might actually be more reasonable (or the one you draw). However, from the point of the application, it would make more sense to have the middle axis same size as the others, with the start/end position of the flows stacked as in my example.

The outgoing rightward flow from each stratum is always unit size in total (consisting of one to three distinct flows here), but the total input from all strata to one stratum varies (consisting of zero to three input flows). But perhaps the non-standard thing here is that I'd like to standardize the stratum and the corresponding input flow sizes at each time point to 1 (per stratum).

corybrunson commented 3 years ago

Hm. I think i understand what you want, and i don't think i've faced this issue before. I'm reasonably sure that you'd have to artificially "join" sequences of 2-axis strands into alluvia before using the flow layer to unjoin them for plotting purposes; that is, this is not a use case for which ggalluvial is designed.

The key distinction to make here is between "alluvial data", which tracks individual observations (or groups of them) across several axes, and the flow data represented in Sankey diagrams, in which the nodes (corresponding to the "strata" of an alluvial plot) can be considered "mixed" as in fluid dynamics models. The flow layer here is meant to simplify an alluvial plot by "mixing" the alluvia within each stratum, but it still only recognizes data that are alluvial to begin with.

There are some Sankey diagram tools for R, though i haven't used them myself. In the short term, that's another option you have. Though i don't think any take advantage of the vertical axis to keep track of the totals along the axes.

I'm not sure how reasonable it would be to enable ggalluvial to handle such flow data, since it would not be recognizable by the alluvium layer, and the flow layer would have to be redesigned to handle both types. The solution would probably be to introduce a third type of data recognized by ggalluvial, formatted as one row per flow (including different columns for the starting axis and the ending axis). The flow layer currently transforms alluvial data to this format internally. This would also mean exporting some new functions to convert between "flows form" and lodes form.

Does that make sense @helske ? Do you think this would be a generally useful extension, or would it only be rarely used? If there's already a good Sankey generator for the kind of data you have, i'd like to at least point users to that package in case they have the same needs you do. Anyway, it would be a while before i could work on this, but i'll leave the issue open.

helske commented 3 years ago

Thanks, Cory, I see that this kind of data is not quite what alluvial plots are for. The data I'm using is relatively general (essentially transition matrices), but I'm not quite sure yet how useful this kind of visualization is in the end (there are typically a large number of time points, so things naturally get quite messy). I'm not aware of other tools (at least in R) for this kind of visualization, but I look around and test bit more to see if it would make sense to extend the package (I could try to contribute as well in that case).

corybrunson commented 3 years ago

Sounds good. I'll close the issue for now, but feel free to reopen it if you think an extension would be appropriate.