Closed andrewd789 closed 3 years ago
Hi @andrewd789 and thanks for checking. I believe the issue is that some combinations of group
and x_axis
are duplicated, even though they may have different values of stratum
. The alluvial plot will track each group (alluvium) across each x-axis through some stratum, and that stratum must be unique. (The alluvia are not allowed to reverse direction.) So, each group
–x_axis
pair must appear at most once.
In the example data you provide, rows 1 and 43, for example, have the same group
and x_axis
, which i expect is triggering the error. (The test for alluvia form is less strict because, when data are in that form, the user is expected to specify each axis rather than a column containing all of the axes.)
If you remove duplicate group
–x_axis
pairs, do you still get an error? If you need multiple values of stratum
for some group
–x_axis
pairs, then you might want a different kind of diagram, though without knowing more i can't make an informed suggestion! But i can try to help further if this doesn't resolve the issue.
Hi @corybrunson, thanks for the explanation. It would seem that duplicate combinations of group
and x_axis
are indeed a problem. I found that concatenating the group
and strata
labels, and specifying that as the alluvium variable fixed the error, but this doesn't result in the plot that I'm after :
data$group_strata = paste(data$group, data$strata)
> head(data)
freq group x_axis strata group_strata
1 142 1 T0 Retained 1 Retained
2 80 2 T0 Retained 2 Retained
3 38 3 T0 Retained 3 Retained
43 0 1 T0 New 1 New
44 0 2 T0 New 2 New
45 0 3 T0 New 3 New
is_lodes_form(data, key = x_axis, value = strata, id = group_strata) # TRUE
ggplot(data,
aes(y = freq, x = x_axis, alluvium = group_strata, stratum = strata)) +
geom_alluvium(aes(fill = group)) +
geom_stratum() +
geom_label(stat = "stratum", aes(label = strata)) +
scale_fill_viridis()
Result:
This isn't quite right. The flow from the "Retained" stratum at T0 should be split between the "Retained" and "Lost" strata at T1. Currently, the alluvia are group_strata
, so the value for "1 Retained" at T0 (142) correctly flows to the value for "1 Retained" at T1 (58). The remainder (84) should flow to "1 Lost" at T1, but instead this derives from nothing at T0. Presumably, this is because "1 Remainder" doesn't match "1 Lost".
This makes me think that the alluvium should be group
, not group_strata
, because this matches between T0 and T1, but this returns me to the original problem (data not in recognized alluvium form).
How can I resolve this? Do I need to reorganise the data somehow?
@andrewd789 you're right that you need to specify a different alluvium
aesthetic, but group
is not granular enough for this purpose. ggalluvial is pretty low-level: It expects the user to carefully format the data rather than making any decisions on its own. So, it doesn't understand that row 1 (with freq
142) is being split into rows 7 and 91.
Does freq
represent a count, so that the flows should stay the same size from one x-axis to the next? If so, then you might instead format the data with a single freq
column and a column for each x-axis value, i.e. a "T0" column and a "T1" column. The "T0" column would always equal "Retained" (or NA
, if "T1" is "New"), while the "T1" column would look like the bottom half of the original data frame. This would be "alluvia form" or "wide form".
The better way, from a data analysis perspective, would be to put the data in "lodes form" or "long form". (It is necessary if the flows are intended to change size, i.e. height, from one axis to the next.) This would require more rows than the data set currently has, since each row would correspond to a single alluvium at a single x-axis. The freq
column would presumably take the same value in every row corresponding to the same alluvium. (They would take different values in order for the flows to change size.) And you'd need a new alluvium identifier. For example, row 1 would become two rows, with different identifiers and freq
values 58 and 84. The two identifiers would be the same as in the rows currently labeled 7 and 91.
For some examples of how the data should look in long form, try reproducing the last three plots in the main vignette and examining the data frames in each case.
Thank you! Putting the data into the correct lodes format with more rows, as per your suggestion, has resolved the problem. I failed to understand that an alluvium
can be represented by multiple rows in the data, each row containing the alluvium values at a different x-axis. This is what happens with the correctly formatted data:
@andrewd789 great! That looks right, i assume with more groups than the example above.
By the way, since you include ("New") both incoming and outgoing ("Lost") subjects/units, you might consider two additional features:
negate.strata
parameter of the stat layers.These ideas were used to great effect in this paper (figure shared here), which originally prompted me to introduce the option to negate some strata, to make creating such plots easier. But they don't get used often, and i don't have a natural real-world data set to include with the package to illustrate them. If you know of a public data set with this kind of structure, or if you could share a subset of yours (with attribution, of course), i'd be very glad to be able to include it!
@corybrunson excellent idea, thanks! Here is a further evolution of the above plot, including your suggestions (yes, with more groups). Also, I want to clearly show the composition of groups within "New", "Retained" and "Lost" categories, on each axis. At T0 on the above plot, the alluvia at T0 are grouped according to the two categories at T1, rather than the sole category at T0, for reasons somewhat unclear to me. I made a new variable by combining "group" and "strata", and assigned that to stratum
(instead of just "strata").
> data_g2$freq[data_g2$strata == "Lost"] = -(data_g2$freq[data_g2$strata == "Lost"]) # Make "Lost" values negative
> data_g2$str_group = paste(data_g2$strata, data_g2$group)
> head(data_g2, 12)
freq group x_axis strata key str_group
1 58 Ascomycota T1 Retained 35 Retained Ascomycota
2 88 Ascomycota T1 New 36 New Ascomycota
3 -84 Ascomycota T1 Lost 37 Lost Ascomycota
4 58 Ascomycota T0 Start 35 Start Ascomycota
5 84 Ascomycota T0 Start 37 Start Ascomycota
6 23 Basidiomycota T1 Retained 38 Retained Basidiomycota
7 36 Basidiomycota T1 New 39 New Basidiomycota
8 -57 Basidiomycota T1 Lost 40 Lost Basidiomycota
9 23 Basidiomycota T0 Start 38 Start Basidiomycota
10 57 Basidiomycota T0 Start 40 Start Basidiomycota
11 11 Chytridiomycota T1 Retained 41 Retained Chytridiomycota
12 18 Chytridiomycota T1 New 42 New Chytridiomycota
> ggplot(data_g2, aes(x = x_axis, y = freq, alluvium = key, stratum = str_group)) +
geom_alluvium(aes(fill = group)) +
geom_stratum(aes(fill = group)) +
scale_fill_viridis_d() +
theme_minimal()
So now the groups are sensibly organised at T0 and T1. The only possibly unsatisfying aspect of this is that the "New", "Retained", and "Lost" categories are no longer explicitly labelled, and have to be inferred based on their relative vertical positions, and flows to T0, which is more aesthetically pleasing but perhaps less simple to interpret.
I would be happy for you to use a portion of this data as an example in due course. Perhaps when the manuscript it's from is in a more developed state.
@andrewd789 thanks again, i would be glad to consider the data when it's ready to go public. : )
To showcase both the internal composition of the subjects and the categories of "New", "Retained", and "Lost", you could remove the fill
aesthetic from geom_stratum()
and add a geom_text(stat = "stratum")
call as in several of the package examples. This should result in white, labeled boxes, still with viridis-colored ribbons between them.
Closing this issue to clear up the repo, but do check back @andrewd789 with any updates!
Hi, I have some data that looks like this:
I am trying to make an alluvial plot from this, but it tells me the data is not correctly formatted, but it isn't clear to me why this is so. I can make a plot based on the UCB admission data, as follows:
However, when I try to replicate this using the above data (exactly), it tells me there is a problem with the data.
There is only one occurrence of each combination of group, x_axis, and strata, as far as I can tell. So what aspects of the data are duplicated? (I'm using R 4.03 and ggalluvial_0.12.3.)