corybrunson / ggalluvial

ggplot2 extension for alluvial plots
http://corybrunson.github.io/ggalluvial/
GNU General Public License v3.0
499 stars 34 forks source link

[Question] Bacterial Relative Abundance Over Time #136

Open AntonKjellberg opened 1 month ago

AntonKjellberg commented 1 month ago

Hi!

What an amazing package.

I'm trying to display how the mean relative abundance of different bacterial genera develops over time. I filtered for IDs that have samples for all three time points; however, I still can't get it to work.

Here is a fraction of the dataset. abundance has one entry for every combination of time(3), id(only 2 here) and genus(11). 3x2x11=66 in total.

toy <- tibble(
    id = c(rep(1, 33), rep(2, 33)),
    abundance = c(
      5.097338e-01, 1.447320e-01, 1.391562e-01, 6.961131e-02, 3.244924e-02, 2.139261e-02, 7.220953e-02,
      5.860208e-03, 2.465460e-03, 2.361152e-03, 2.844761e-05, 9.675987e-01, 1.484639e-02, 1.070846e-02,
      3.937304e-03, 8.777429e-04, 8.275862e-04, 1.203762e-03, 0.000000e+00, 0.000000e+00, 0.000000e+00,
      0.000000e+00, 5.081549e-01, 2.959873e-01, 8.429322e-02, 4.622756e-02, 2.640779e-02, 2.235469e-02,
      1.338496e-02, 2.144936e-03, 1.044612e-03, 0.000000e+00, 0.000000e+00,
      9.718995e-01, 2.220788e-02, 5.055938e-03, 4.302926e-04, 1.434309e-04, 1.195257e-04, 1.195257e-04,
      2.390514e-05, 0.000000e+00, 0.000000e+00, 0.000000e+00, 7.839328e-01, 1.552875e-01, 5.078613e-02,
      5.729054e-03, 2.110704e-03, 8.184364e-04, 5.169072e-04, 4.738316e-04, 2.153780e-04, 1.292268e-04,
      0.000000e+00, 8.063558e-01, 9.668371e-02, 4.877554e-02, 2.358499e-02, 2.120435e-02, 3.081920e-03,
      2.768192e-04, 3.690922e-05, 0.000000e+00, 0.000000e+00, 0.000000e+00),
    genus = c(
      "Staphylococcus", "Haemophilus", "Moraxella", "Corynebacterium", "Streptococcus", "Veillonella", "Other",
      "Gemella", "Escherichia-Shigella", "Neisseria", "Dolosigranulum", "Staphylococcus", "Streptococcus", 
      "Corynebacterium", "Veillonella", "Moraxella", "Gemella", "Other", "Escherichia-Shigella", "Neisseria", 
      "Haemophilus", "Dolosigranulum", "Moraxella", "Staphylococcus", "Corynebacterium", "Streptococcus", 
      "Dolosigranulum", "Veillonella", "Other", "Gemella", "Haemophilus", "Escherichia-Shigella", "Neisseria", 
      "Staphylococcus", "Streptococcus", "Gemella", "Other", "Corynebacterium", "Haemophilus", "Moraxella", 
      "Escherichia-Shigella", "Veillonella", "Neisseria", "Dolosigranulum", "Other", "Streptococcus", 
      "Staphylococcus", "Moraxella", "Veillonella", "Gemella", "Corynebacterium", "Haemophilus", "Dolosigranulum", 
      "Neisseria", "Escherichia-Shigella", "Streptococcus", "Moraxella", "Other", "Staphylococcus", "Dolosigranulum", 
      "Corynebacterium", "Haemophilus", "Gemella", "Veillonella", "Escherichia-Shigella", "Neisseria"),
    time = rep(c("1w", "1m", "3m"), each = 11, times = 2)
  )

ggplot(toy, aes(x = time, stratum = genus, alluvium = id, y = abundance)) +
  geom_stratum() +
  geom_flow()

Error in geom_stratum(): ! Problem while computing stat. ℹ Error occurred in the 1st layer. Caused by error in setup_data(): ! Data is not in a recognized alluvial form (see help('alluvial-data') for details). Run rlang::last_trace() to see where the error occurred.

corybrunson commented 1 month ago

Hi @AntonKjellberg, thanks for raising the issue.

Looking back, i think the query functions is_alluvia_form() and is_lodes_form() need to be better documented and their parameters overhauled to match the aesthetic mappings. Here's the check you want to run, based on the aesthetic mappings you've specified:

is_lodes_form(toy, key = time, value = genus, id = id)

When i run it, i get the following message:

#> Duplicated id-axis pairings.
#> [1] FALSE

So, the problem is that some values of id appear with the same value of time more than once, which is not allowed in an alluvial plot. In fact, there are many such duplications:

#> count(toy, time, id)
#> # A tibble: 6 × 3
#>   time     id     n
#>   <chr> <dbl> <int>
#> 1 1m        1    11
#> 2 1m        2    11
#> 3 1w        1    11
#> 4 1w        2    11
#> 5 3m        1    11
#> 6 3m        2    11

You'll need to think carefully about what information you want to convey in the plot. What are the individuals or groups (alluvium) that you want to track across multiple measurements (x), and what values can they take (stratum)? Is there a plot in the examples that is similar to what you want?

AntonKjellberg commented 1 month ago

Thank you for your reply, Cory

That makes sense! Unfortunately, I still struggle to display the data how I want.

I want a plot like this where streams connect the blocks based on the genus abundance within the different ids

ggplot(toy, aes(x = time, stratum = genus, y = abundance, fill = genus)) +
  geom_stratum()

image

This plot represents the same overall structure, but the data wasn't available. (wave as time, n as abundance, key as genus, and alluvium id)

image

https://longitudinalanalysis.com/visualizing-transitions-in-time-using-r-and-alluvial-graphs/

I couldn't find a similar plot in the examples

corybrunson commented 1 month ago

Hi @AntonKjellberg—notice from the source that the second plot is based on an id variable derived from a row index when the data were in wide (or "alluvia") form, which is why each value of id only appears once in the same row with any value of wave. In your data, id is manually defined to be several repetitions of only two values, which would only allow for two alluvia in the plot. You'll need a different identifier if you want a similar plot; since i don't know the provenance of your data i don't want to speculate on how it's structured, and therefore how the identifiers should be defined.