corybrunson / ggalluvial

ggplot2 extension for alluvial plots
http://corybrunson.github.io/ggalluvial/
GNU General Public License v3.0
499 stars 34 forks source link

Error: Data is not in a recognized alluvial form #72

Closed andrewd789 closed 3 years ago

andrewd789 commented 3 years ago

Hi, I have some data that looks like this:

data
   group x_axis   strata freq
1      1     T0 Retained  142
2      2     T0 Retained   80
3      3     T0 Retained   38
43     1     T0      New    0
44     2     T0      New    0
45     3     T0      New    0
85     1     T0     Lost    0
86     2     T0     Lost    0
87     3     T0     Lost    0
7      1     T1 Retained   58
8      2     T1 Retained   23
9      3     T1 Retained   11
49     1     T1      New   88
50     2     T1      New   36
51     3     T1      New   18
91     1     T1     Lost   84
92     2     T1     Lost   57
93     3     T1     Lost   27

I am trying to make an alluvial plot from this, but it tells me the data is not correctly formatted, but it isn't clear to me why this is so. I can make a plot based on the UCB admission data, as follows:

UCB_lodes <- to_lodes_form(as.data.frame(UCBAdmissions), axes = 1:3, id = "Cohort")
ggplot(UCB_lodes, aes(y = Freq, x = x, alluvium = Cohort, stratum = stratum)) + 
    geom_alluvium(aes(fill = Cohort)) + 
    geom_stratum() +
    geom_label(stat = "stratum", aes(label = after_stat(stratum))) +
    scale_fill_viridis()

However, when I try to replicate this using the above data (exactly), it tells me there is a problem with the data.

ggplot(data, aes(y = freq, x = x_axis, alluvium = group, stratum = strata)) + 
     geom_alluvium(aes(fill = group)) + 
    geom_stratum() +
    geom_label(stat = "stratum", aes(label = after_stat(stratum))) +
    scale_fill_viridis()
Error in f(...) : 
  Data is not in a recognized alluvial form (see `help('alluvial-data')` for details).
# Check the data format:
is_alluvia_form(data, key = x_axis, value = strata, id = group) # TRUE
is_lodes_form(data, key = x_axis, value = strata, id = group) # FALSE!
  Duplicated id-axis pairings.

There is only one occurrence of each combination of group, x_axis, and strata, as far as I can tell. So what aspects of the data are duplicated? (I'm using R 4.03 and ggalluvial_0.12.3.)

corybrunson commented 3 years ago

Hi @andrewd789 and thanks for checking. I believe the issue is that some combinations of group and x_axis are duplicated, even though they may have different values of stratum. The alluvial plot will track each group (alluvium) across each x-axis through some stratum, and that stratum must be unique. (The alluvia are not allowed to reverse direction.) So, each groupx_axis pair must appear at most once.

In the example data you provide, rows 1 and 43, for example, have the same group and x_axis, which i expect is triggering the error. (The test for alluvia form is less strict because, when data are in that form, the user is expected to specify each axis rather than a column containing all of the axes.)

If you remove duplicate groupx_axis pairs, do you still get an error? If you need multiple values of stratum for some groupx_axis pairs, then you might want a different kind of diagram, though without knowing more i can't make an informed suggestion! But i can try to help further if this doesn't resolve the issue.

andrewd789 commented 3 years ago

Hi @corybrunson, thanks for the explanation. It would seem that duplicate combinations of group and x_axis are indeed a problem. I found that concatenating the group and strata labels, and specifying that as the alluvium variable fixed the error, but this doesn't result in the plot that I'm after :

data$group_strata = paste(data$group, data$strata)
> head(data)
   freq group x_axis   strata group_strata
1   142     1     T0 Retained   1 Retained
2    80     2     T0 Retained   2 Retained
3    38     3     T0 Retained   3 Retained
43    0     1     T0      New        1 New
44    0     2     T0      New        2 New
45    0     3     T0      New        3 New

is_lodes_form(data, key = x_axis, value = strata, id = group_strata) # TRUE
ggplot(data, 
       aes(y = freq, x = x_axis, alluvium = group_strata, stratum = strata)) + 
  geom_alluvium(aes(fill = group)) + 
  geom_stratum() +
  geom_label(stat = "stratum", aes(label = strata)) +
  scale_fill_viridis()

Result: image

This isn't quite right. The flow from the "Retained" stratum at T0 should be split between the "Retained" and "Lost" strata at T1. Currently, the alluvia are group_strata, so the value for "1 Retained" at T0 (142) correctly flows to the value for "1 Retained" at T1 (58). The remainder (84) should flow to "1 Lost" at T1, but instead this derives from nothing at T0. Presumably, this is because "1 Remainder" doesn't match "1 Lost".

This makes me think that the alluvium should be group, not group_strata, because this matches between T0 and T1, but this returns me to the original problem (data not in recognized alluvium form).

How can I resolve this? Do I need to reorganise the data somehow?

corybrunson commented 3 years ago

@andrewd789 you're right that you need to specify a different alluvium aesthetic, but group is not granular enough for this purpose. ggalluvial is pretty low-level: It expects the user to carefully format the data rather than making any decisions on its own. So, it doesn't understand that row 1 (with freq 142) is being split into rows 7 and 91.

Does freq represent a count, so that the flows should stay the same size from one x-axis to the next? If so, then you might instead format the data with a single freq column and a column for each x-axis value, i.e. a "T0" column and a "T1" column. The "T0" column would always equal "Retained" (or NA, if "T1" is "New"), while the "T1" column would look like the bottom half of the original data frame. This would be "alluvia form" or "wide form".

The better way, from a data analysis perspective, would be to put the data in "lodes form" or "long form". (It is necessary if the flows are intended to change size, i.e. height, from one axis to the next.) This would require more rows than the data set currently has, since each row would correspond to a single alluvium at a single x-axis. The freq column would presumably take the same value in every row corresponding to the same alluvium. (They would take different values in order for the flows to change size.) And you'd need a new alluvium identifier. For example, row 1 would become two rows, with different identifiers and freq values 58 and 84. The two identifiers would be the same as in the rows currently labeled 7 and 91.

For some examples of how the data should look in long form, try reproducing the last three plots in the main vignette and examining the data frames in each case.

andrewd789 commented 3 years ago

Thank you! Putting the data into the correct lodes format with more rows, as per your suggestion, has resolved the problem. I failed to understand that an alluvium can be represented by multiple rows in the data, each row containing the alluvium values at a different x-axis. This is what happens with the correctly formatted data:

image

corybrunson commented 3 years ago

@andrewd789 great! That looks right, i assume with more groups than the example above.

By the way, since you include ("New") both incoming and outgoing ("Lost") subjects/units, you might consider two additional features:

  1. Redefine "strata" as a factor variable with "New" first and "Lost" last, or vice-versa.
  2. Negate the "Lost" strata value, by making their "freq" values negative or by using the negate.strata parameter of the stat layers.

These ideas were used to great effect in this paper (figure shared here), which originally prompted me to introduce the option to negate some strata, to make creating such plots easier. But they don't get used often, and i don't have a natural real-world data set to include with the package to illustrate them. If you know of a public data set with this kind of structure, or if you could share a subset of yours (with attribution, of course), i'd be very glad to be able to include it!

andrewd789 commented 3 years ago

@corybrunson excellent idea, thanks! Here is a further evolution of the above plot, including your suggestions (yes, with more groups). Also, I want to clearly show the composition of groups within "New", "Retained" and "Lost" categories, on each axis. At T0 on the above plot, the alluvia at T0 are grouped according to the two categories at T1, rather than the sole category at T0, for reasons somewhat unclear to me. I made a new variable by combining "group" and "strata", and assigned that to stratum (instead of just "strata").

> data_g2$freq[data_g2$strata == "Lost"] = -(data_g2$freq[data_g2$strata == "Lost"]) # Make "Lost" values negative
> data_g2$str_group = paste(data_g2$strata, data_g2$group)
> head(data_g2, 12)
   freq           group x_axis   strata key                str_group
1    58      Ascomycota     T1 Retained  35      Retained Ascomycota
2    88      Ascomycota     T1      New  36           New Ascomycota
3   -84      Ascomycota     T1     Lost  37          Lost Ascomycota
4    58      Ascomycota     T0    Start  35         Start Ascomycota
5    84      Ascomycota     T0    Start  37         Start Ascomycota
6    23   Basidiomycota     T1 Retained  38   Retained Basidiomycota
7    36   Basidiomycota     T1      New  39        New Basidiomycota
8   -57   Basidiomycota     T1     Lost  40       Lost Basidiomycota
9    23   Basidiomycota     T0    Start  38      Start Basidiomycota
10   57   Basidiomycota     T0    Start  40      Start Basidiomycota
11   11 Chytridiomycota     T1 Retained  41 Retained Chytridiomycota
12   18 Chytridiomycota     T1      New  42      New Chytridiomycota

> ggplot(data_g2, aes(x = x_axis, y = freq, alluvium = key, stratum = str_group)) + 
  geom_alluvium(aes(fill = group)) + 
  geom_stratum(aes(fill = group)) + 
  scale_fill_viridis_d() + 
  theme_minimal()

image

So now the groups are sensibly organised at T0 and T1. The only possibly unsatisfying aspect of this is that the "New", "Retained", and "Lost" categories are no longer explicitly labelled, and have to be inferred based on their relative vertical positions, and flows to T0, which is more aesthetically pleasing but perhaps less simple to interpret.

I would be happy for you to use a portion of this data as an example in due course. Perhaps when the manuscript it's from is in a more developed state.

corybrunson commented 3 years ago

@andrewd789 thanks again, i would be glad to consider the data when it's ready to go public. : )

To showcase both the internal composition of the subjects and the categories of "New", "Retained", and "Lost", you could remove the fill aesthetic from geom_stratum() and add a geom_text(stat = "stratum") call as in several of the package examples. This should result in white, labeled boxes, still with viridis-colored ribbons between them.

corybrunson commented 3 years ago

Closing this issue to clear up the repo, but do check back @andrewd789 with any updates!