corybrunson / ggalluvial

ggplot2 extension for alluvial plots
http://corybrunson.github.io/ggalluvial/
GNU General Public License v3.0
497 stars 34 forks source link

profile code and optimize bottlenecks #16

Open corybrunson opened 6 years ago

corybrunson commented 6 years ago

Description of the issue

Diagrams for large datasets take a long time to render. The bottlenecks might be due to inefficiencies in the code. Profile the code, identify the bottlenecks, and benchmark alternative implementations. (See this chapter in Advanced R.)

Reproducible example (preferably using reprex::reprex())

(Need a suitable public dataset.)

cenuno commented 6 years ago

@corybrunson This package is awesome. Thank you for taking the time to build it! I would love to help out.

Could you tell me which scripts in your /ggalluvial/R folder are relevant when running the following lines of code?

data(vaccinations)
levels(vaccinations$response) <- rev(levels(vaccinations$response))
ggplot(vaccinations,
       aes(x = survey, stratum = response, alluvium = subject,
           weight = freq,
           fill = response, label = response)) +
  geom_flow() +
  geom_stratum(alpha = .5) +
  geom_text(stat = "stratum", size = 3) +
  theme(legend.position = "none") +
  ggtitle("vaccination survey responses at three points in time")

In the meantime, I'm hoping to create a data set that contains 5 million rows and 3 columns to use in the reprex.

corybrunson commented 6 years ago

@cenuno thank you for saying so! I'd be very glad for the large-scale example. The code chunk you shared relies on functions defined in the files stat-flow.r, geom-flow.r, stat-stratum.r, and geom-stratum.r, and possibly indirectly some code in stat-utils.r, geom-utils.r, and lode-guidance-functions.r. (In general, a layer—usually stat_*() or geom_*()—invokes one stat and one geom, and the stats and geoms are roughly paired up in this package.)

cenuno commented 6 years ago

Sweet. I'll start investigating using the vaccinations data set just to get a sense of the workflow. It will probably take awhile but I want - as I'm sure others do as well - this to work with larger data sets.

universal commented 2 years ago
library(tidyverse)
library(ggalluvial)

i <- 100
waves <- 10
alluvial_test <- as_tibble(data.frame(id = as.numeric(rep(1:i, each = waves)), 
                             wave = factor(rep(1:waves, i)), 
                             status = factor(sample(rep(c("A", "B", "C", "D"), each = i*waves/4)), levels = c("A", "B", "C", "D"), labels = c("A", "B", "C", "D")))) 

p <- ggplot(data = alluvial_test, aes( x = wave, stratum = status, alluvium = id, fill = status, label = status)) 
p + geom_flow(stat = "alluvium", lode.guidance = "frontback", color = "darkgray") + geom_stratum()

Created on 2021-12-02 by the reprex package (v2.0.1)

increasing i and waves will quickly result in a very slow plot ;-) anyways, for myself grouping by status and just have the transitions between the groups and not the individual ones would be enough... Currently thinking about how to regroup the data. :-) but am currently drawing a blank...