davidsjoberg / ggsankey

Make sankey, alluvial and sankey bump plots in ggplot
Other
245 stars 30 forks source link

question: What makes a ribbon to cross over other ribbons in sankey plot ? #37

Open Valentin-Bio opened 1 year ago

Valentin-Bio commented 1 year ago

Hello developer! ,

I'm using geom_sankey() to plot microbial taxonomies by given taxonomic ranks. This is what I did:

colnames(taxonomy_table)

"Domain" "Phylum" "Class" "Order" "Family" "Genus"

tableforsankey <- taxonomy_table %>% make_long(Domain, Phylum, Class, Order, Family, Genus)


phylum_colors <- c(
  "Bacteria" = "cadetblue3",
  "Proteobacteria" = "antiquewhite2",
  "Cyanobacteria" = "chocolate1",
  "Bacteroidota" = "aquamarine3",
  "Actinobacteriota" = "bisque4",
  "Gammaproteobacteria" = "antiquewhite2",
  "Burkholderiales" = "antiquewhite2",
  "BACL14" = "antiquewhite2",
  "Amylibacter" = "antiquewhite2",
  "Alphaproteobacteria" = "antiquewhite2",
  "Thioglobus" = "antiquewhite2",
  "Rhodobacterales" = "antiquewhite2",
  "Rhizobiales_B" = "antiquewhite2",
  "Pseudomonadales" = "antiquewhite2", 
  "PS1" = "antiquewhite2",
  "Thioglobaceae" = "antiquewhite2",
  "TMED25" = "antiquewhite2",
  "Rhodobacteraceae" = "antiquewhite2",
  "Pseudohongiellaceae" = "antiquewhite2",
  "Methylophilaceae" = "antiquewhite2",
  "Bacteroidia" = "aquamarine3",
  "Flavobacteriales" = "aquamarine3",
  "Flavobacteriaceae" = "aquamarine3",
  "MED-G11" = "aquamarine3",
  "Algibacter_B" = "aquamarine3",
  "Cyanobacteriia" = "chocolate1",
  "PCC-6307" = "chocolate1",
  "Cyanobiaceae" = "chocolate1",
  "Synechococcus_E" = "chocolate1",
  "Synechococcus_C" = "chocolate1", 
  "Acidimicrobiia" = "bisque4",
  "Actinomarinales" = "bisque4",
  "Actinomarinaceae" = "bisque4",
  "Actinomarina" = "bisque4"
)

ggplot(tableforsankey, 
       aes(x = x,
           next_x = next_x,
           node = node,
           next_node = next_node,
           fill = factor(node),
           label = node)) + 
  geom_sankey(flow.alpha = 0.75,node.color = 1, type = "sankey") +
  geom_sankey_label(size = 2.5, color = 1, fill = "aliceblue") + 
  scale_fill_manual(values = phylum_colors) + 
  theme_sankey(base_size = 16) +
  theme(legend.position = "none", axis.text = element_text(size = 9)) +
  xlab("") + ggtitle("Bacteria")

and this is the sankey that I get:

image

ribbons from phylum starts to intercross, is there a way in which I can display the sankey plot but specifying the ribbons to not cross over other ribbons ?

best regards,

Valentín.

giacomomutti commented 8 months ago

Hey Valentin, I am in a very similar situation, did you find a solution for this?

Valentin-Bio commented 8 months ago

Hello @giacomomutti , I could not figure out how to make it.

bests.

keithnewman commented 8 months ago

Your nodes have a character names, so standard ggplot behaviour is to display these as categories in alphabetical order. If you notice at each x coordinate (or column if you prefer to think about it that way), the nodes are in alphabetical order (with A at the bottom to Z at the top, but with capital letters coming before lower case equivalents if we look at the order of TMED25 before Thioglobaceae). This ordering determines the node locations, which causes the overlaps to happen.

To control the order of character labels, you can convert them your node and next_node data columns to factor objects and specify the ordering you want as the factor levels. They'll order themselves using this level-ordering rather than alphabetical ordering. Forcats may assist with handling factors.

However, I'm finding factors can mess up the sankey label positioning, which is why I'm browsing the issue board in the first place.

giacomomutti commented 8 months ago

I solved this issue by converting the node and next_node column to factor but the levels are all the names in your dataset.

First you need to arrange your dataset for all the columns you are interested in. Then you get the levels of all the columns and apply the same ordering to all the columns and the node and next_node variable and it should work. Then both the labels and the sankey will be correctly positioned.

This may not work if you have the same label for different taxonomic levels, in this case you can add a prefix to each clade like "cHaptophyta" and "fHaptophyta" so that they are unique and then remove the prefix, in this case that's the label column.

df <- df %>% 
  arrange(phylum, class, order, family, genus, species, count)

lvls_tax <- c("Eukaryota",unique(c(unique(df$phylum), unique(df$class), unique(df$class), 
                                   unique(df$order), unique(df$family),unique(df$genus))))

df <- df %>% 
  mutate(phylum=factor(phylum, ordered = T, lvls_tax),
         class=factor(class, ordered = T, lvls_tax),
         order=factor(order, ordered = T, lvls_tax),
         family=factor(family, ordered = T, lvls_tax),
         genus=factor(genus, ordered = T, lvls_tax),
         species=factor(species, ordered = T, lvls_tax))

df_long <- df %>% 
  make_long(colnames(df)[1:6], value = count) %>% 
  mutate(node=factor(node, lvls_tax), next_node=factor(next_node, lvls_tax),
         label=gsub(".*_", "", node)) %>% 
  filter(!is.na(node))

ggplot(df_long, aes(x = x, next_x = next_x, node = node, next_node = next_node, fill = node, label=label)) +
  geom_alluvial(space = 2, width = .3, flow.alpha = .6) +
  geom_alluvial_label(size = 2.5, space = 2, color = 1, fill = "aliceblue") +
  theme(legend.position = "none", axis.text.y = element_blank(),
        axis.ticks.y = element_blank(), axis.title.x = element_blank(),
        axis.text.x = element_text(angle=0, family = "Helvetica", colour = "black"))

This is the resulting plot:

image

Hope it helps!