corybrunson / ggalluvial

ggplot2 extension for alluvial plots
http://corybrunson.github.io/ggalluvial/
GNU General Public License v3.0
489 stars 35 forks source link

Attempt to order rectangles while minimizing crossing of flows #79

Open Alex7722 opened 3 years ago

Alex7722 commented 3 years ago

A first attempt to order rectangles while minimizing crossing of flows:

Data used to build the function can be found here

minimize_crossing <- function(alluv_dt = alluv_dt, stratum = new_Id_com, alluvium = Id, x = Window){

  #' This function 
  #' 
  #' @alluv_dt
  #' The dt used for the alluvial
  #' @stratum
  #' Stratum column
  #' @alluvium
  #' Alluvium column
  #' @x
  #' x column
  require(tidyverse)
  require(data.table)
  require(ggalluvial)
  require(tidygraph)
  require(ggplot2)
  require(forcats)

  new_Id_com <- deparse(substitute(stratum))
  Id <- deparse(substitute(alluvium))
  Window <- deparse(substitute(x))

  dt<-alluv_dt[order(Id,Window)][,.(new_Id_com, Id, Window)]

  dt[,tot_window_leiden:=.N,.(Window,new_Id_com)]

  dt[,Source:=new_Id_com,Id]
  dt[,Target:=shift(new_Id_com),Id]

  dt <- dt %>% rename(tot_window_leiden_Source = tot_window_leiden)
  dt[,tot_window_leiden_Target:=shift(tot_window_leiden_Source),Id]

  dt <- dt[Source > Target, c("Target", "Source") := list(Source, Target)] # exchanging
  dt <- dt[Source > Target, c("tot_window_leiden_Target", "tot_window_leiden_Source") := list(tot_window_leiden_Source, tot_window_leiden_Target)] # exchanging

  dt[,link_strength:=.N,.(Source,Target,Window)]

  dt <- dt[is.na(Target)==FALSE & Source!=Target]

  dt[,cosine_strength:=link_strength/sqrt(tot_window_leiden_Target*tot_window_leiden_Source)]
  dt[,max_cosine_strength:=max(cosine_strength),.(Source,Target)]

  dt<-dt[,.N,.(Source,Target,max_cosine_strength)][order(-N)]

  #Make the dt for naming
  edges_meta<-dt
  edges_meta[,Source:=as.character(Source)]
  edges_meta[,Target:=as.character(Target)]
  edges_meta[,from:=Source]
  edges_meta[,to:=Target]
  edges_meta[,weight:=max_cosine_strength]

  nodes_meta <-alluv_dt[,.N,new_Id_com]
  nodes_meta <-nodes_meta[,new_Id_com:=as.character(new_Id_com)]
  nodes_meta <-nodes_meta[,Id:=new_Id_com]

  tbl_meta<-tbl_graph(nodes = nodes_meta, edges = edges_meta, directed = FALSE, node_key = "new_Id_com")
  components <- tbl_meta %>% 
    activate(nodes) %>% 
    mutate(components_att = group_components(type = "weak")) %>% 
    as.data.table()

  components[,size_compo:=.N,components_att][order(-N)]
  components <- components[size_compo==1,components_att:=0]
  setnames(components, "components_att", paste0("components_att_","0"))
  components <- components[,.(new_Id_com, "components_att_0"= get("components_att_0"))]

  for (links_to_remove in unique(dt[order(max_cosine_strength)]$max_cosine_strength)) {
    dt<-dt[max_cosine_strength>links_to_remove]
    edges_meta<-dt
    edges_meta[,Source:=as.character(Source)]
    edges_meta[,Target:=as.character(Target)]
    edges_meta[,from:=Source]
    edges_meta[,to:=Target]
    edges_meta[,weight:=max_cosine_strength]

    nodes_meta <-alluv_dt[,.N,new_Id_com]
    nodes_meta <-nodes_meta[,new_Id_com:=as.character(new_Id_com)]
    nodes_meta <-nodes_meta[,Id:=new_Id_com]

    tbl_meta<-tbl_graph(nodes = nodes_meta, edges = edges_meta, directed = FALSE, node_key = "Id")

    components2 <- tbl_meta %>% 
      activate(nodes) %>% 
      mutate(components_att = group_components(type = "weak")) %>% 
      as.data.table()

    components2[,size_compo:=.N,components_att][order(-N)]
    components2 <- components2[size_compo==1,components_att:=0]
    name <- paste0("components_att_", links_to_remove)
    setnames(components2, "components_att", name)
    components2 <- components2[,.(new_Id_com, get(name))]
    setnames(components2, "V2", name)

    components <- merge(components, components2, all.x = TRUE, all.y = TRUE, by= "new_Id_com")

  }

  columns_to_paste <- names(components)
  columns_to_paste <- columns_to_paste[columns_to_paste != "new_Id_com"] 

  community_order <- components %>% unite(order, c(columns_to_paste), sep = " ", remove = FALSE)
  community_order <- community_order[,.(new_Id_com,order)][order(order)]

  alluv_dt_meta <-merge(alluv_dt,community_order, by="new_Id_com", all.x = TRUE)

  alluv_dt_meta$new_Id_com <- fct_reorder(alluv_dt_meta$new_Id_com, alluv_dt_meta$order,min, .desc = TRUE)
  return(alluv_dt_meta)
}
corybrunson commented 3 years ago

@Alex7722 thank you! As i mentioned by email, i'll come back to this when i have more bandwidth, but it would be fantastic to finally address this need in the package itself.

I wonder if you could edit the issue into a formal reproducible example and include the generation of the plot you shared by email? Reproducible examples can be relatively easily prepared using the reprex package, and this vignette offers some guidance. (Since the data set is large, it's OK with me if you include a readRDS() call rather than use dput() to reconstruct it from scratch.) If this turns out to be too time-consuming, then i can put one together from your function code when i come back to this issue.

corybrunson commented 3 years ago

Note: This would address #6.

corybrunson commented 3 years ago

I may not come back to this for another month or two, but i've come across two more points of reference: This IEEE paper formalizes the optimization problem, and the CRAN Task View on Optimization lists several packages that might be used to implement it.

ajwilk commented 3 years ago

Hi @corybrunson, do you have an update on this by chance? I saw that you posted an article on this a couple weeks ago, but I wasn't able to reproduce the behavior in that article with the development version or the optimization branch. I can share examples but figured I'd check and see if these changes were even pushed yet. Thanks!!

corybrunson commented 3 years ago

Hi @ajwilk, sorry, i haven't gotten started on this. The vignette you linked to has actually been around for a while and does not contain any optimization of this kind. Once something is implemented, though, i'll definitely include an example of it in the vignette! (Note: If the method from this issue is used for optimization, then the optimization branch will be rendered obsolete.)

Alex7722 commented 1 year ago

My and a coauthor just released a beta version of our package for dynamic network analysis. We cleaned up the function to minimize the crossing of flows in a way that is very flexible with some examples: https://github.com/agoutsmedt/networkflow/blob/master/R/minimize_crossing_alluvial.R

It seems to work well even for complex alluvials, however, right now it is for a particular kind of alluvial. An example: image

corybrunson commented 1 year ago

@Alex7722 nifty! Could you provide a side-by-side comparison of plots prepared using versus not using your function? That would provide solid justification for me to make {networkflow} an optional dependency for the next release. (Smaller examples would be included in documentation, but a large example might be suitable for a vignette.)

Alex7722 commented 1 year ago

Here is an example using as.factor(dt$stratum) vs the function: image

image

On the student example, the difference is not as great, but a little bit better (before/after): image image

As you can see, the function only orders stratums a particular way, but not alluvium inside stratum. That why the student example does not change much. The function works bests when there are a lot of stratums that have strong relationships with a few other stratums + the student example is a bit lucky in its randomness.

In the way we use it, it is better than nothing and good enough, but it might not be for different uses.

corybrunson commented 1 year ago

It is great to be able to describe the cases for which the technique works best. Do you mind my adapting the larger example into a vignette? Would you prefer to? And when do you expect networkflow to be on CRAN? (I don't think it can be included as a dependency until then.)

Alex7722 commented 1 year ago

We should be on CRAN in the coming months, but no specific timeline yet. However the big drawback of our function is that it's also dependent on quite a few packages, and most notably the latest development version of data.table(). Also, the function parameters are really suited for our package as it is really a middle-step function in a larger workflow for network analysis and its name doesn't make a lot of sense. Finally, the alluvial needs to (1) be structured as a long format and (2) be a data.table.

I made a minimal example by changing the student data with a bad-case scenario where "Sculpure" (btw a small typo there) go into "Digital Art" :

data(majors)
majors <- majors %>% as.data.table()
majors[(student == 6 | student == 9 | student == 12) & curriculum !="Sculpure", curriculum := "Digital Art"] 
majors$curriculum <- as.factor(majors$curriculum)

ggplot(majors,
       aes(x = semester, stratum = curriculum, alluvium = student,
           fill = curriculum, label = curriculum)) +
  scale_fill_brewer(type = "qual", palette = "Set2") +
  # geom_text(label = student) +
  geom_flow(stat = "alluvium", lode.guidance = "frontback",
            color = "darkgray") +
  geom_stratum() +
  theme(legend.position = "bottom") +
  ggtitle("student curricula across several semesters")

image

data(majors)
majors <- majors %>% as.data.table()
majors[(student == 6 | student == 9 | student == 12) & curriculum !="Sculpure", curriculum := "Digital Art"] 
majors <- networkflow::minimize_crossing_alluvial(majors, intertemporal_cluster_column = "curriculum", node_key = "student", window_column = "semester")
majors$curriculum <- forcats::fct_reorder(majors$curriculum, majors$minimize_crossing_order,min, .desc = TRUE)

ggplot(majors,
       aes(x = semester, stratum = curriculum, alluvium = student,
           fill = curriculum, label = curriculum)) +
  scale_fill_brewer(type = "qual", palette = "Set2") +
  # geom_text(label = student) +
  geom_flow(stat = "alluvium", lode.guidance = "frontback",
            color = "darkgray") +
  geom_stratum() +
  theme(legend.position = "bottom") +
  ggtitle("student curricula across several semesters")

image

Its even possible to play with the Iode.guidance (backfront) to get results close to optimum, but it would be better if a function also ordered the alluvium in an optimal way inside the stratums.

image

corybrunson commented 1 year ago

Ah, that's an interesting caveat. For comparison, {ggalluvial} doesn't Import: {ggrepel}, but that package is used in a vignette, so it's listed under Suggests: instead. I don't know whether it would make more sense to suggest a dependency-heavy package like {networkflow} or to simply leave it out but refer to it in the README. At least it's mentioned here. What do you think? When {networkflow} is on CRAN, we can revisit this.

Alex7722 commented 1 year ago

I will let you know once we are on CRAN.