corybrunson / ggalluvial

ggplot2 extension for alluvial plots
http://corybrunson.github.io/ggalluvial/
GNU General Public License v3.0
497 stars 34 forks source link

Longitudianal data #51

Closed joe-jhou2 closed 4 years ago

joe-jhou2 commented 4 years ago

I've mock example like

data TP1 TP2 TP3 V Freq 1 D1 D1D1 D1D1D1 IGHV1 10 2 D1 D1D1 D1D1D2 IGHV2 15 3 D1 D1D2 D1D1D1 IGHV1 31 4 D1 D1D2 D1D2D1 IGHV31 22 5 D1 D1D1 D1D2D1 IGHV2 2 6 D1 D1D2 D1D1D1 IGHV3 1 7 D1 D1D2 D1D1D2 IGHV3 111 8 D1 D1D2 D1D2D2 IGHV4 45 9 D1 D1D1 D1D1D1 IGHV1 67 10 D1 D1D1 D1D2D2 IGHV4 89 11 D1 D1D2 D1D2D2 IGHV5 23 12 D1 D1D2 D1D2D1 IGHV4 48 13 D1 D1D1 D1D1D1 IGHV2 90 14 D1 D1D2 D1D2D2 IGHV4 12 15 D1 D1D2 D1D2D1 IGHV5 46 16 D1 D1D1 D1D1D2 IGHV3 78 17 D1 D1D2 D1D2D2 IGHV48 90 18 D1 D1D1 D1D2D1 IGHV1 100 19 D1 D1D2 D1D1D2 IGHV4 0 20 D1 D1D1 D1D1D2 IGHV3 20

TP1, TP2, TP3 are time points. Beside TP1, TP2 has two "segments", D1D1, D1D2, TP3 has four "segments" D1D1D1, D1D1D2, D1D2D1 and D1D2D2.

The plot I made like this: image

ggplot(data = data, aes(axis1 = TP1, axis2 = TP2, axis3 = TP3, y = Freq)) + scale_x_discrete(limits = c("TP1", "TP2", "TP3"), expand = c(.1, .05)) + xlab("Time point") + geom_alluvium(aes(fill = V)) + geom_stratum() + geom_text(stat = "stratum", infer.label = TRUE) + theme_minimal()

What I desired, for example, TP1, only one segment D1 is there, I want it won't split too many substreams for the downstream. e.g. IGHV1 only shows once at TP1 and split into two for TP2(D1D1 and D1D2).

corybrunson commented 4 years ago

Hi @mimisikai, thanks for raising the issue, and i think i have the solution. The code below first reproduces your example after reconstructing the data, then uses the aes.bind parameter of stat_alluvium() to rearrange the lodes within each stratum so that those with the same aesthetics (in this case only fill) are adjacent, before they are rearranged according to the default rules. Is this the plot you wanted?

In case you're not familiar with ggplot2 internals: Whenever a layer is produced by a stat or a geom, parameters can be passed to either the stat or geom itself or the geom or stat (respectively) that it is paired with. geom_alluvium() pairs with stat_alluvium() by default, so, when the geom fails to recognize the aes.bind parameter, it passes this parameter to the stat instead. The parameter is documented there, at help(stat_alluvium).

# default alluvium settings
ggplot(data = data,
       aes(axis1 = TP1, axis2 = TP2, axis3 = TP3,
           y = Freq)) +
  scale_x_discrete(limits = c("TP1", "TP2", "TP3"), expand = c(.1, .05)) +
  xlab("Time point") +
  geom_alluvium(aes(fill = V)) +
  geom_stratum() + geom_text(stat = "stratum", infer.label = TRUE) +
  theme_minimal()


# bind by aesthetics
ggplot(data = data,
       aes(axis1 = TP1, axis2 = TP2, axis3 = TP3,
           y = Freq)) +
  scale_x_discrete(limits = c("TP1", "TP2", "TP3"), expand = c(.1, .05)) +
  xlab("Time point") +
  geom_alluvium(aes(fill = V), aes.bind = "alluvia") +
  geom_stratum() + geom_text(stat = "stratum", infer.label = TRUE) +
  theme_minimal()

Created on 2020-04-02 by the reprex package (v0.3.0)

joe-jhou2 commented 4 years ago

Thanks a lot! That's pretty awesome! I wanna to escalate this challenge: ideally, the each time point and stratum have their Freq data, like this

data TP1 TP2 TP3 V Freq_TP1 Freq_TP2 Freq_TP3 1 D1 D1D1 D1D1D1 IGHV1 10 12 5 2 D1 D1D1 D1D1D2 IGHV2 15 12 9 3 D1 D1D2 D1D1D1 IGHV1 31 3 16 4 D1 D1D2 D1D2D1 IGHV31 22 4 15 5 D1 D1D1 D1D2D1 IGHV2 2 15 16 6 D1 D1D2 D1D1D1 IGHV3 1 18 6 7 D1 D1D2 D1D1D2 IGHV3 111 19 12 8 D1 D1D2 D1D2D2 IGHV4 45 3 14 9 D1 D1D1 D1D1D1 IGHV1 67 3 20 10 D1 D1D1 D1D2D2 IGHV4 89 9 15 11 D1 D1D2 D1D2D2 IGHV5 23 3 14 12 D1 D1D2 D1D2D1 IGHV4 48 11 6 13 D1 D1D1 D1D1D1 IGHV2 90 7 4 14 D1 D1D2 D1D2D2 IGHV4 12 10 6 15 D1 D1D2 D1D2D1 IGHV5 46 8 16 16 D1 D1D1 D1D1D2 IGHV3 78 18 16 17 D1 D1D2 D1D2D2 IGHV48 90 12 7 18 D1 D1D1 D1D2D1 IGHV1 100 12 5 19 D1 D1D2 D1D1D2 IGHV4 0 13 15 20 D1 D1D1 D1D1D2 IGHV3 20 18 12

How can I arrange the data format and plot it?

Thx

On Thu, Apr 2, 2020 at 2:21 PM Cory Brunson notifications@github.com wrote:

Hi @mimisikai https://github.com/mimisikai, thanks for raising the issue, and i think i have the solution. The code below first reproduces your example after reconstructing the data, then uses the aes.bind parameter of stat_alluvium() to rearrange the lodes within each stratum so that those with the same aesthetics (in this case only fill) are adjacent, before they are rearranged according to the default rules. Is this the plot you wanted?

In case you're not familiar with ggplot2 internals: Whenever a layer is produced by a stat or a geom, parameters can be passed to either the stat or geom itself or the geom or stat (respectively) that it is paired with. geom_alluvium() pairs with stat_alluvium() by default, so, when the geom fails to recognize the aes.bind parameter, it passes this parameter to the stat instead. The parameter is documented there, at help(stat_alluvium).

default alluvium settings

ggplot(data = data, aes(axis1 = TP1, axis2 = TP2, axis3 = TP3, y = Freq)) + scale_x_discrete(limits = c("TP1", "TP2", "TP3"), expand = c(.1, .05)) + xlab("Time point") + geom_alluvium(aes(fill = V)) + geom_stratum() + geom_text(stat = "stratum", infer.label = TRUE) + theme_minimal()

https://camo.githubusercontent.com/e490eaf9e9c50c66b6573377bb8ed5ccaba4ffc1/68747470733a2f2f692e696d6775722e636f6d2f5162574147614c2e706e67

bind by aesthetics

ggplot(data = data, aes(axis1 = TP1, axis2 = TP2, axis3 = TP3, y = Freq)) + scale_x_discrete(limits = c("TP1", "TP2", "TP3"), expand = c(.1, .05)) + xlab("Time point") + geom_alluvium(aes(fill = V), aes.bind = "alluvia") + geom_stratum() + geom_text(stat = "stratum", infer.label = TRUE) + theme_minimal()

https://camo.githubusercontent.com/184d6d132c57b563ccf8d517288bf0ff744a519f/68747470733a2f2f692e696d6775722e636f6d2f6345486c364f6d2e706e67

Created on 2020-04-02 by the reprex package https://reprex.tidyverse.org (v0.3.0)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/corybrunson/ggalluvial/issues/51#issuecomment-608096011, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABFHYAYMGTI3HZLIVWOK7MDRKT6UTANCNFSM4L2ZPA6Q .

corybrunson commented 4 years ago

Aha, this is a separate issue, having to do (as you suspect) with the format of the data. These data are in alluvia form, which ggalluvial currently supports only with fixed frequencies. To allow changes in the y value, you first need to restructure the data. When frequencies are fixed, this is straightforward, using these convenience functions. It took me a while to figure out how to extend the trick to variable frequencies in tidyr, so i'll flag this as a feature to incorporate into to_lodes_form() before the first major release. Thanks!

data <- data.frame(
  TP1 = c("D1", "D1", "D1", "D1",
          "D1", "D1", "D1", "D1",
          "D1", "D1", "D1", "D1",
          "D1", "D1", "D1", "D1",
          "D1", "D1", "D1", "D1"),
  TP2 = c("D1D1", "D1D1", "D1D2", "D1D2",
          "D1D1", "D1D2", "D1D2", "D1D2",
          "D1D1", "D1D1", "D1D2", "D1D2",
          "D1D1", "D1D2", "D1D2", "D1D1",
          "D1D2", "D1D1", "D1D2", "D1D1"),
  TP3 = c("D1D1D1", "D1D1D2", "D1D1D1", "D1D2D1",
          "D1D2D1", "D1D1D1", "D1D1D2", "D1D2D2",
          "D1D1D1", "D1D2D2", "D1D2D2", "D1D2D1",
          "D1D1D1", "D1D2D2", "D1D2D1", "D1D1D2",
          "D1D2D2", "D1D2D1", "D1D1D2", "D1D1D2"),
  V = c("IGHV1", "IGHV2", "IGHV1", "IGHV31",
        "IGHV2", "IGHV3", "IGHV3", "IGHV4",
        "IGHV1", "IGHV4", "IGHV5", "IGHV4",
        "IGHV2", "IGHV4", "IGHV5", "IGHV3",
        "IGHV48", "IGHV1", "IGHV4", "IGHV3"),
  Freq_TP1 = c(10, 15, 31, 22, 2, 1, 111, 45, 67, 89,
               23, 48, 90, 12, 46, 78, 90, 100, 0, 20),
  Freq_TP2 = c(12, 12, 3, 4, 15, 18, 19, 3, 3, 9, 
               3, 11, 7, 10, 8, 18, 12, 12, 13, 18),
  Freq_TP3 = c(5, 9, 16, 15, 16, 6, 12, 14, 20, 15,
               14, 6, 4, 6, 16, 16, 7, 5, 15, 12),
  stringsAsFactors = FALSE
)

library(ggalluvial)
#> Loading required package: ggplot2

names(data)[1:3] <- paste("Seq_", names(data)[1:3], sep = "")
data$ID <- seq(nrow(data))
data <- tidyr::pivot_longer(data, c(Seq_TP1:Seq_TP3, Freq_TP1:Freq_TP3),
                            names_to = c(".value", "TP"),
                            names_sep = "_")

ggplot(data = data,
       aes(x = TP, stratum = Seq, alluvium = ID,
           y = Freq)) +
  xlab("Time point") +
  geom_alluvium(aes(fill = V), aes.bind = "alluvia") +
  geom_stratum() + geom_text(stat = "stratum", aes(label = Seq))

Created on 2020-04-03 by the reprex package (v0.3.0)

joe-jhou2 commented 4 years ago

Thanks Cory! Fantastic plot! A little bit suggestion on the cosmetic purpose in the new function, it will be really good if can separate D1D1, D1D2 et.al let them looks independence, also distribute evenly along the same time point axis.

mbojan commented 4 years ago

You can find a real data example on https://martakolczynska.com/post/polpan-voting-alluvial-plots/ that uses https://github.com/mbojan/alluvial. Perhaps it is a nice example to show-off ggalluvial too.

corybrunson commented 4 years ago

@mbojan it is a very cool example. I've noticed that political scientists have really taken to these diagram types.

@mimisikai to your point about the cosmetics, i think you're suggesting (in my terminology) the option of inserting gaps between the strata so that the stacks at each axis are the same height. Is that right? I've resisted that, since it would undermine the y axis and would not make sense when applied to plots with negative strata.

I've been hand-wavey about this property in the past (see #11, #28, and #30), but i've written it up more carefully for a software paper that should be out soon. I'll post a link to it here, as i'd be grateful for both of your feedback.

corybrunson commented 4 years ago

@mbojan i just opened #54 with more specs on a new data set to include (to also showcase some features that are still in development). If it sounds like any source you know, i'd be very interested!

corybrunson commented 4 years ago

(Closing, as the original issue has been resolved.)