corybrunson / ggalluvial

ggplot2 extension for alluvial plots
http://corybrunson.github.io/ggalluvial/
GNU General Public License v3.0
499 stars 34 forks source link

matching strata names to correct position #87

Closed kaseyzapatka closed 3 years ago

kaseyzapatka commented 3 years ago

Hi @corybrunson,

Thanks so much for this wonderful package. It's so well documented.

I'm having a bit of trouble matching strata names to their correction position on the alluvial plot and was wondering if you could point me in the right direction.

I'm using ggalluvial to visualize changes in rank of the top 100 metro areas by population over time from 2000 to 2019. Here is glimpse of the data structure, the code used to produce the plot below, and a link to the data.

# print data
head(rank2000_2019)

  CBSA_name                                     `2000`   `2019` rank00 rank19  change class
  <chr>                                          <dbl>    <dbl>  <dbl>  <dbl>   <dbl> <chr>
1 New York-Newark-Jersey City, NY-NJ-PA       18323037 19294236      1      1  971199 NA   
2 Los Angeles-Long Beach-Anaheim, CA          12365346 13249614      2      2  884268 NA   
3 Chicago-Naperville-Elgin, IL-IN-WI           9098314  9508605      3      3  410291 NA   
4 Philadelphia-Camden-Wilmington, PA-NJ-DE-MD  5687141  6079130      4      8  391989 down 
5 Dallas-Fort Worth-Arlington, TX              5156324  7320663      5      4 2164339 up   
6 Miami-Fort Lauderdale-Pompano Beach, FL      5007562  6090660      6      7 1083098 down 

# plot -------------------------------------------------------------------------
alluvial <-  ggplot(data = rank2000_2019,
       aes(axis1 = rank00, 
           axis2 = rank19)) +
  geom_alluvium(aes(fill = class)) +
  geom_stratum(aes(fill = class), width = .3, fill = "black", alpha=0.1) +
  geom_lode(width = .32) +
  geom_text(stat = "stratum", aes(label = after_stat(forcats::fct_reorder(names$CBSA_name, desc(names$rank19)))), size = 1.9) +
  theme_void()

alluvial

The main problem is that the strata labels aren't in their correction positions. You can see from their rank that New York metro area was the largest in 2000 and 2019 and it didn't change position. On the plot, it is at the bottom, when is should be at the top, and is shown that it's position changed. I'm not 100% sure inverting them will solve this issue, but I suspect that might since one of the smallest metros is at the top.

I can't figure out how to reference CBSA_name to plot it correctly (I gather the data is being transformed under the hood?) so I created another object (names) that is just a dataframe of CBSA_name and rank and referenced it in geom_text()but that didn't work. Next, I tried re-ordering using desc() but that didn't work either.

Any thoughts on how to locate the strata names in their correct positions? Thanks so much.

# create names df to specify strata names
names <- rank2000_2019 %>% 
  select(CBSA_name, rank19)  %>% 
  print()

Rows: 103
Columns: 2

  CBSA_name                                   rank19
  <chr>                                        <dbl>
1 New York-Newark-Jersey City, NY-NJ-PA            1
2 Los Angeles-Long Beach-Anaheim, CA               2
3 Chicago-Naperville-Elgin, IL-IN-WI               3
4 Philadelphia-Camden-Wilmington, PA-NJ-DE-MD      8
5 Dallas-Fort Worth-Arlington, TX                  4
6 Miami-Fort Lauderdale-Pompano Beach, FL          7

Here's the plot so far:

alluvial

corybrunson commented 3 years ago

Hi @kaseyzapatka, thanks for raising the issue. I would take the following steps to try to resolve it:

  1. Omit the new object names and set up the plot using only rank2000_2019. This will ensure that whatever data transformations or plotting parameters you use will result in consistent behavior (unless you encounter a bug!).
  2. Set package options globally that will give you the desired ordering of the rectangles. The effects of the options are described in the ordering of the rectangles vignette and any stat_*() documentation, e.g. stat_stratum(), describes how to set them globally. (To locate higher-rank metro areas at greater vertical positions, i think you need to set decreasing = FALSE. Since the strata are treated categorically, you might also have to convert the axis variables rank00 and rank19 to factors in the correct numerical order.)
  3. If (2) fails, then take a preliminary step to convert your data from alluvia (wide) form to lodes (long) form. See the documentation on alluvial formats for how to do this. (This conversion is being done internally anyway, so doing it manually would bring more of the plotting process into the open and under your control.)

If these don't work, please let me know and i'll take a closer look.

kaseyzapatka commented 3 years ago

hmm... I'm still having problems.

I think it's better if the data are organized in lodes (long) form, so I converted them manually (new data is here).

Here's my code and the figure I have now:

# plot
alluvial <-  rank2000_2019_lodes %>% 
      mutate(CBSA_name = as_factor(CBSA_name)) %>% 
      mutate(year = as.numeric(year)) %>%
  ggplot(data = .,
  aes(x = as_factor(year),
           stratum = forcats::fct_reorder(CBSA_name, rank),
           alluvium = rank)) +
  theme_void() +
  geom_flow(aes(fill = class), width = .5,  alpha = 0.7) +
  geom_lode(aes(fill = forcats::fct_reorder(class, year)), width = .5, alpha = 0.7)  +
  # labels
  stat_stratum(geom = "text", aes(label = forcats::fct_reorder(CBSA_name, year), order = rank, decreasing = FALSE)) +
  scale_fill_manual(values = c("down" = "#D2413C", "up" = "#3F5941", "NA" = "white")) 

alluvial

#> Ignoring unknown aesthetics: order, decreasing

alluvial

Switching to lodes enabled more control over ordering, so now they are ordered correct; except that both 2000 (left-hand side) and 2019 (right-hand side) have the same label order when they shouldn't. The whole point of this sankey is to show how metros have changed in rank between years. So for example, Dallas moves from 5th (in 2000) to 4th (in 2019). You can see it in the correct place on the right hand side (2019 axis) but not on the left (2000). I think I need to specify some filter to order the labels by rank instead of assigning both year labels for both axes. order and decreasing don't seem to be recognized and didn't work.

The second problem is the coloring (flow) from 2000 to 2019. All green should be going up, while all red should be going down. I imagine this will be corrected when the first problem is fixed?

Thanks for you help again. I really appreciate it.

Best, Kasey

corybrunson commented 3 years ago

Oh, i should have said previously: Unless all other avenues have been exhausted, don't use variable transformations within a plot layer. I expect the fct_reorder() calls are causing the mismatches between the flows, 2019 lodes, and labels. Same for as_factor(year). Even if the plot doesn't look quite right, it is almost always best to begin with a plot that is consistent and then gradually make aesthetic changes to it. So see if you can do all of the data transformations first, then create the plot using the same aesthetics for every layer. That includes order: Passing rank to that aesthetic was, i think, the right call, but it needs to be done either in ggplot(aes()) or else in every plot layer. Keeping the aesthetics consistent is similar to keeping the options consistent (see previous comment) in that it must be done either once upstream or everywhere downstream.

I don't have time tonight, but if you still can't get it looking the way you want, then i can tinker with it myself this weekend! Glad to be of help where i can, and also gratified that you're making good use of the extension. : )

kaseyzapatka commented 3 years ago

@corybrunson, so I moved all the transformations to before the plotting begins like you suggested and now pass order to the ggplot(aes()); however, I'm left with a slightly worse plot because the labels are even more messed up now.

alluvial <-  rank2000_2019_lodes %>% 
      mutate(CBSA_name = as_factor(CBSA_name)) %>% 
  ggplot(data = .,
         aes(x = year,
             stratum = CBSA_name,
             alluvium = rank, 
             order = rank,
             decreasing = TRUE)) +
  theme_classic() +
  geom_flow(aes(fill = class), width = .4,  alpha = 0.7) +
  geom_lode(aes(fill = class), width = .4, alpha = 0.7)  +
  # labels
  stat_stratum(geom = "text", aes(label = CBSA_name), decreasing = TRUE) +
  scale_fill_manual(values = c("down" = "#D2413C", "up" = "#3F5941", "NA" = "white")) 

alluvial

alluvial

The test for when the map is correct : Philly should be in the 4th position in 2000 and move to the 8th position in 2019 while Dallas should move from the 5th position in 2000 to the 4th in 2019. The flows were correct in the previous post's figure but are not out of order along with the labels.

Thanks for your help, I'd much appreciate it if you could look at it over the weekend. I'm a little exasperated at this point.

Best, Kasey

corybrunson commented 3 years ago

Sure, i'll be glad to try it out myself this weekend! I see a few remaining problems but i'm not completely sure that resolving them will be the end of the story.

corybrunson commented 3 years ago

Hi @kaseyzapatka, i thought more carefully about your data, and i think the code below generates the plot you want:

ggplot(rank2000_2019_lodes,
       aes(x = year, stratum = year, alluvium = CBSA_name, order = rank)) +
  geom_alluvium(aes(fill = class), width = .4, alpha = .7) +
  stat_alluvium(geom = "text", aes(label = CBSA_name)) +
  scale_fill_manual(values = c(down = "#D2413C", up = "#3F5941", `NA` = "white"))

Does it work for you?

For reference, here's how i arrived at it (refer to the ordering of the rectangles vignette for more detail on specific steps):

  1. There really are no "strata" in this plot, only alluvia (with their lodes and flows). For convenience, if only alluvium is specified, then its value is internally passed to stratum. To avoid this, i specified stratum to year, simply because it forced the plot to use only one stratum on each axis.
  2. For the same reason, i replaced the separate flow and lode layers with a single alluvium layer.
  3. Again for the same reason, i used the alluvium stat, rather than the stratum stat, to render the text labels.
  4. The ordering parameters like decreasing only apply to strata, so they are no longer appropriate. To order the lodes within each stratum, i specified the aesthetic order = rank. (To reverse the vertical positions, you could specify order = -rank.)
kaseyzapatka commented 3 years ago

@corybrunson, this worked perfectly. Thanks for all your help and the detailed explanation too. I guess I assumed CBSA_name were the "strata" where every element was part of its own "strata", but now that you mention it, it seems obvious that there are no "strata". I think the rest of the code makes sense now that the strata are out of the equation.

Thanks again, really appreciate it. I think I'll use several different iterations of these alluvial plots in my work. Will make sure to cite appropriately!

Posting the final plot for posterity:

alluvial