corybrunson / ggalluvial

ggplot2 extension for alluvial plots
http://corybrunson.github.io/ggalluvial/
GNU General Public License v3.0
497 stars 34 forks source link

Removing gaps when uneven number of flows #105

Closed sonsoleslp closed 1 year ago

sonsoleslp commented 1 year ago

Hi, First of all, thank you very much for your package. I find it very convenient and easy to use and it is very useful for my current research work. I am working with longitudinal data and ran across an issue when dealing with subjects with different numbers of timepoints. For example, for subject A, I would have data for 10 time points, but for subject B only 5 time points. I will try to illustrate it with code.

library(dplyr)
library(tidyr)
library(ggplot2)
library(ggalluvial)

dataf <- tibble::tribble(
     ~USER, ~TIMEPOINT, ~STATE,
        1L,         1L, "High",
        1L,         2L,  "Low",
        2L,         1L, "High",
        2L,         2L, "High",
        3L,         1L,  "Low",
        3L,         2L, "High",
        4L,         1L,  "Low",
        5L,         1L,  "Low",
        6L,         1L,  "Low"
     )

As you see, users 1-3 have 2 time points and 4-6 only one. As such, the following figure has the first time point much higher than the second one, which is expected

ggplot(dataf, aes(x = TIMEPOINT, stratum = STATE, y=USER, alluvium = USER, fill = STATE)) +
  scale_x_discrete(expand = c(.1, .1)) +
  scale_fill_manual(values=c("#3ea51f","#4285F4","#EA4335")) +
  geom_flow() +
  geom_stratum( ) +
  geom_text(stat = "stratum",  
            aes( label = format(round(after_stat(prop), 2)))) 

To try and make them the same height, I modified my code based on your response to this Github issue: https://github.com/corybrunson/ggalluvial/issues/31 and that helped me a lot:

flows <- dataf %>% pivot_wider(id_cols=USER, names_from=TIMEPOINT, values_from=STATE)
flows$Freq = 1
flows <- to_lodes_form(flows, key = "TIMEPOINT", axes = 2:3)
flows <- flows %>% 
  na.omit() %>%  
  group_by(TIMEPOINT) %>%
  add_count(Freq, name = "Total") %>% 
  ungroup() %>%
  transform(Prop = Freq/Total)   

ggplot(flows, aes(x = TIMEPOINT, stratum = stratum, y=Prop, alluvium = alluvium, fill = stratum)) +
  scale_x_discrete(expand = c(.1, .1)) +
  scale_fill_manual(values=c("#3ea51f","#4285F4","#EA4335")) +
  geom_flow() +
  geom_stratum( ) +
  geom_text(stat = "stratum",  
            aes( label = format(round(after_stat(prop), 2)))) 

But still, I would like to scale also the flows ,so that the gap in the blue stratum in the first time point is not there, i.e., that the gaps are ignored when there is no flow, and the existing flows take 100% of the space. I need something like this: Thank you so much in advance

corybrunson commented 1 year ago

Hey—i wanted to get to this earlier, but it may still be a few days, this is just to let you know that i saw this and will address it ASAP!

sonsoleslp commented 1 year ago

Thank you very much!

corybrunson commented 1 year ago

Hi @sonsoleslp and thanks for raising the issue. I think i understand your concern, and i want to suggest that you either really do want the plot with varying axis heights or else want to subset your data to those cases (subjects) with records at all time points.

First, though, i urge a correction: Setting y = USER assigns each case (1L through 6L) a height equal to its numeric value, e.g. case 3L has a height of 3. This is why the first plot has height 21 (= sum(seq(6L))) rather than 6. Here's the corrected plot, not bothering to pass a value to the y aesthetic because each case presumably has the same importance and merits the same height:

ggplot(dataf, aes(x = TIMEPOINT, stratum = STATE, alluvium = USER,
                  fill = STATE)) +
  scale_x_discrete(expand = c(.1, .1)) +
  scale_fill_manual(values=c("#3ea51f","#4285F4","#EA4335")) +
  geom_flow() +
  geom_stratum( ) +
  geom_text(stat = "stratum",  
            aes( label = format(round(after_stat(prop), 2)))) 

Now, the reason for the gap on the left—still present in this revised plot—is not that the flows have failed to scale as the strata have, but rather that the data contain cases that are confined to one axis (as you explained at the top). To remove the part of the stratum from which no flow emanates would mean to remove these cases from the data:

dataf2 <- dataf[dataf$USER %in% c(1L, 2L, 3L), , drop = FALSE]
ggplot(dataf2, aes(x = TIMEPOINT, stratum = STATE, alluvium = USER,
                   fill = STATE)) +
  scale_x_discrete(expand = c(.1, .1)) +
  scale_fill_manual(values=c("#3ea51f","#4285F4","#EA4335")) +
  geom_flow() +
  geom_stratum( ) +
  geom_text(stat = "stratum",  
            aes( label = format(round(after_stat(prop), 2)))) 

What might make the original plot more palatable is to add flows that shrink to zero when data is missing from an axis. This requires passing a variable to the y aesthetic that takes the value 0 when the case has no value at some axis (i.e. no record at some time point). Working from the same assumption as above, that all cases are equally important, i assign the new variable NUM the value 1 when the case does have a value at the axis:

dataf %>%
  mutate(NUM = 1) %>%
  complete(USER, TIMEPOINT, fill = list(NUM = 0)) ->
  dataf3
ggplot(dataf3, aes(x = TIMEPOINT, stratum = STATE, y = NUM, alluvium = USER,
                   fill = STATE)) +
  scale_x_discrete(expand = c(.1, .1)) +
  scale_fill_manual(values=c("#3ea51f","#4285F4","#EA4335")) +
  geom_flow() +
  geom_stratum( ) +
  geom_text(stat = "stratum",  
            aes( label = format(round(after_stat(prop), 2)))) 

Does that suit your needs?

If i've misunderstood, please do let me know!

Created on 2022-12-08 with reprex v2.0.2

sonsoleslp commented 1 year ago

Hi, Thank you for your help. Unfortunately, none of those options work. The last one is close, as it does not have the gap on the left side, but both "bars" should be the same height. Is that not possible at all?

corybrunson commented 1 year ago

It's not possible using ggalluvial alone, since by design the heights of the bars encode the y totals at each axis. Maybe what you want is to combine your own pre-processing to calculate Prop with the complete() step and the use of y in my third example?

sonsoleslp commented 1 year ago

Yeah, maybe that is the way to go, thank you so much for your time.

corybrunson commented 1 year ago

You're welcome! Feel free to reopen this issue if that option doesn't work or you want to share a better solution.