duplicate id–axis pairings should only throw errors when found within PANELs #65

Closed liupfskygre closed 4 years ago

liupfskygre commented 4 years ago

Hi, I run into this error with ggalluvial 0.12.1 by running following code, which is not the case a few days ago and could not figure out why.

summarized_arch_for_alluvial<-read.delim("alluvia.txt",header=T, check.names = FALSE)

is_alluvia_form(summarized_arch_for_alluvial, weight = "mean") 


ggplot(data = summarized_arch_for_alluvial, aes(x = Month, y = mean, alluvium =gene_type_META )) +geom_alluvium(aes(fill = gene_type_META, colour = gene_type_META),alpha = .75, decreasing = FALSE) +theme_bw()+scale_fill_manual(values=color_gene) + scale_color_manual(values=color_gene)+theme(axis.text.x = element_text(angle = -45, hjust = 0),text = element_text(size=16)) +facet_grid(Depth~Eco_sites, scales = "fixed") +theme(legend.position="bottom")+theme(legend.text = element_text( size = 14))

error info

Error in f(...) : Data is not in a recognized alluvial form (see help('alluvial-data') for details).


is_alluvia_form(summarized_arch_for_alluvial) Missing alluvia for some stratum combinations. [1] TRUE

but when I run the example code here:

data(Refugees, package = "alluvial")
country_regions <- c(
  Afghanistan = "Middle East",
  Burundi = "Central Africa",
  `Congo DRC` = "Central Africa",
  Iraq = "Middle East",
  Myanmar = "Southeast Asia",
  Palestine = "Middle East",
  Somalia = "Horn of Africa",
  Sudan = "Central Africa",
  Syria = "Middle East",
  Vietnam = "Southeast Asia"
Refugees$region <- country_regions[Refugees$country]
ggplot(data = Refugees,
       aes(x = year, y = refugees, alluvium = country)) +
  geom_alluvium(aes(fill = country, colour = country),
                alpha = .75, decreasing = FALSE) +
  scale_x_continuous(breaks = seq(2003, 2013, 2)) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = -30, hjust = 0)) +
  scale_fill_brewer(type = "qual", palette = "Set3") +
  scale_color_brewer(type = "qual", palette = "Set3") +
  facet_wrap(~ region, scales = "fixed") +
  ggtitle("refugee volume by country and region of origin")

all fine (strange)


is_alluvia_form(Refugees) Missing alluvia for some stratum combinations. [1] TRUE

since my dataset had a similar structure with the example dataset, not sure what is going wrong here.

> sessionInfo()

R version 3.6.3 (2020-02-29) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS Mojave 10.14.6

Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] grid stats graphics grDevices utils datasets methods base

other attached packages: [1] remotes_2.2.0 vegan_2.5-6 lattice_0.20-41 permute_0.9-5 RColorBrewer_1.1-2 colorspace_1.4-1
[7] ggalluvial_0.12.1 phyloseq_1.30.0 hrbrthemes_0.8.0 viridis_0.5.1 viridisLite_0.3.0 forcats_0.5.0
[13] stringr_1.4.0 dplyr_1.0.1 purrr_0.3.4 readr_1.3.1 tidyr_1.1.1 tibble_3.0.3
[19] tidyverse_1.3.0 ggplot2_3.3.2 alluvial_0.1-2

example file here



corybrunson commented 4 years ago

Hi @liupfskygre and thank you for raising the issue. I'm able to replicate the problem: Your code runs fine in v0.11.3 (the most recent before v0.12.0), but it fails in v0.12.0 as well as v0.12.1. You can revert to v0.11.3 for the time being as follows:

remotes::install_version("ggalluvial", version = "0.11.3")

In fact, the problem seems to have arisen at commit d31ff48204016da7737680d16484c97498d05c09. As of this commit, is_lodes_form() (which is used inside the stats) returns FALSE if any combination of id and axis appears in more than one row of the data—that is, if any alluvium flows through the same axis more than once. It is this checking function, not is_alluvia_form(), that corresponds to setting x, alluvium, and stratum in the aes() part of the ggplot() call, which is why the problem is not detected in the above code.

However, i see what you're trying to do: The same alluvia flow across the same axes in different facets, so in fact there's no conceptual problem with translating the data into an alluvial plot. Rather, the problem is that the check that no duplicate idaxis pairings exist is performed without first grouping the data by the faceting variable. You're right that it's a bug. I'll come back to this within a couple of weeks and send a patch to CRAN ASAP.

liupfskygre commented 4 years ago

thanks for your quick response, will move back to 11.3.


Hi @liupfskygre and thank you for raising the issue. I'm able to replicate the problem: Your code runs fine in v0.11.3 (the most recent before v0.12.0), but it fails in v0.12.0 as well as v0.12.1. You can revert to v0.11.3 for the time being as follows:

remotes::install_version("ggalluvial", version = "0.11.3")

In fact, the problem seems to have arisen at commit d31ff48 As of this commit, is_lodes_form() (which is used inside the stats) returns FALSE if any combination of id and axis appears in more than one row of the data—that is, if any alluvium flows through the same axis more than once. It is this checking function, not as_alluvia_form(), that corresponds to setting x, alluvium, and stratum in the aes() part of the ggplot() call, which is why the problem is not detected in the above code.

However, i see what you're trying to do: The same alluvia flow across the same axes in different facets, so in fact there's no conceptual problem with translating the data into an alluvial plot. Rather, the problem is that the check that no duplicate id–axis pairings exist is performed without first grouping the data by the faceting variable. You're right that it's a bug. I'll come back to this within a couple of weeks and send a patch to CRAN ASAP.

corybrunson commented 4 years ago

@liupfskygre a patch is on its way to CRAN as v0.12.2. I verified that it works on the example you shared, but i should have also asked you if you would give it a try on your original problem. If you have the change, would you install from main and see if the problem is resolved? Here's how to install if the patch is not on CRAN yet:

epjungd commented 2 years ago

Hello! I'm having trouble doing a plot. I want to have 3 axis, as shown in the plot: alluvial_loans_fdi1

I did this plot and it worked, but when I use a dataset in the long form (as suggested in issue #72) the plot doesn't show the alluvia.

This is a part of my data: top30companies.csv

This is my code:

library(tidyverse) library(ggplot2) library(reprex) library(ggalluvial) library(readxl)

top30companies_reprex <- read_csv("top30companies.csv")

top30companies_reprex$variable <- factor(top30companies_reprex$variable, levels = c("corporate_name", "country_name", "bank_name"))

top30companies_reprex$variable2 <- factor(top30companies_reprex$variable2, levels = c("Company", "Country", "Bank"))

is_lodes_form(top30companies_reprex, key = "variable", value = "value", id = "group_strata")

ggplot(top30companies_reprex, aes(x = variable2, stratum = value, alluvium = group_strata, fill = value, y = freq, label = value)) + geom_flow(stat = "alluvium") + geom_stratum(na.rm = TRUE) + guides(fill = FALSE) + geom_fit_text(stat = "stratum", width = 1/4, min.size = 3, reflow = T, grow = T)

This is the plot without the alluvia: alluvial_reprex

I dont know how to go on, maybe the problem is related to the warning message that I get when I use the is_lodes_form command to check my data frame: "Missing id-axis pairings (at some sites)."

I would appreciate any kind of help! Thanks in advance

corybrunson commented 2 years ago

Hi @epjungd, thank you for the very clear description of the problem. I cannot get to it right now but should have time within a week.

I see why you commented on this issue rather than open a new issue. Depending on how it plays out, i might ask to make this a new issue.

corybrunson commented 2 years ago

Hi @epjungd, i hope i've resolved the issue. Some alternative code is below, with changes commented to explain what i did.

Basically, i realized that the column "group_strata" took a unique value for each row, making it impotent for the alluvium aesthetic, since the alluvium should be the identifier that links values taken at different axes. It looked like "unique_alluvium_entries" was better-suited for this role, but it failed the is_lodes_form() test due to the presence of duplicate pairings of the alluvium identifier and the axis aesthetic "variable". There turned out to be only one duplicated identifier, however, so i removed it from the database and fed the result into your ggplot() call, with only the aforementioned aesthetic specifications changed.

Please let me know if this is not what you're after!

#> Warning: package 'tidyr' was built under R version 4.1.2
#> Warning: package 'readr' was built under R version 4.1.2
#> Warning: package 'dplyr' was built under R version 4.1.2
# attach library for fit-text geom
# set working directory to ggalluvial local repo

top30companies_reprex <- read_csv("sandbox/issues/top30companies.csv")
#> Rows: 351 Columns: 6
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (4): variable, value, variable2, group_strata
#> dbl (2): unique_alluvium_entries, freq
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

top30companies_reprex$variable <- factor(
  levels = c("corporate_name", "country_name", "bank_name")

top30companies_reprex$variable2 <- factor(
  levels = c("Company", "Country", "Bank")

# check whether using `unique_alluvium_entries` as id satisfies lodes form
              key = "variable",
              value = "value",
              id = "unique_alluvium_entries")
#> Duplicated id-axis pairings.
#> [1] FALSE
# how many duplicated id-axis pairings are there?
top30companies_reprex %>%
  count(variable, unique_alluvium_entries, name = "count") %>%
#> # A tibble: 2 × 2
#>   count     n
#>   <int> <int>
#> 1     1   345
#> 2     2     3
# which `unique_alluvium_entries` appear twice at any axis?
top30companies_reprex %>%
  group_by(unique_alluvium_entries) %>%
  add_count(name = "count") %>%
  filter(count > 3L) %>%
  select(unique_alluvium_entries, variable, count)
#> # A tibble: 6 × 3
#> # Groups:   unique_alluvium_entries [1]
#>   unique_alluvium_entries variable       count
#>                     <dbl> <fct>          <int>
#> 1                      33 country_name       6
#> 2                      33 corporate_name     6
#> 3                      33 bank_name          6
#> 4                      33 corporate_name     6
#> 5                      33 country_name       6
#> 6                      33 bank_name          6
# remove 33 from consideration and render the alluvia plot
top30companies_reprex %>%
  filter(unique_alluvium_entries != 33) %>%
  ggplot(aes(x = variable2,
             stratum = value,
             alluvium = unique_alluvium_entries,
             fill = value,
             y = freq,
             label = value)) +
  geom_flow(stat = "alluvium") +
  geom_stratum(na.rm = TRUE) +
  guides(fill = FALSE) +
  geom_fit_text(stat = "stratum",
                width = 1/4,
                min.size = 3,
                reflow = T,
                grow = T)
#> Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
#> "none")` instead.
#> Warning: Removed 2 rows containing missing values (geom_fit_text).

Created on 2022-02-25 by the reprex package (v2.0.1)

epjungd commented 2 years ago

Thank you very much, I could finally get my plot with your indications!

I leave you the final plot that I was trying to do :)


corybrunson commented 2 years ago

You're welcome! And that is an intense plot! Feel free to raise a new issue if you encounter a new problem.