bnosac / BTM

Biterm Topic Modelling for Short Text with R
Apache License 2.0
94 stars 15 forks source link

plot.BTM #4

Closed jwijffels closed 3 years ago

jwijffels commented 5 years ago

Example of vis, still need to define how to incorporate this in the package.

library(data.table)
library(igraph)
library(ggraph)
library(BTM)
library(udpipe)
data("brussels_reviews_anno", package = "udpipe")

## Select nouns of reviews in Dutch
x <- subset(brussels_reviews_anno, language == "nl")
x <- subset(x, xpos %in% c("NN", "NNP", "NNS"))
x <- x[, c("doc_id", "lemma")]

## Building the model
set.seed(321)
model <- BTM(x, k = 5, alpha = 1, beta = 0.01, iter = 1000, window = 5, trace = 50, background = TRUE)
model
cluster_terminology <- terms(model, type = "tokens")
cluster_biterms <- terms(model, type = "biterm")

## Visualise the topics
plot(model, topic_nr = 1, title = "Background topic")
plot(model, topic_nr = 2)
plot(model, topic_nr = 3)
plot(model, topic_nr = 4)
plot(model, topic_nr = 5)

plot.BTM <- function(model, 
                     tokens = terms(model, type = "tokens", top_n = 5), 
                     biterms = terms(model, type = "biterm", top_n = 5),
                     topic_nr = 1, 
                     top_n = 25, 
                     title = sprintf("Biterm Topic Model Cluster %s", topic_nr),
                     subtitle = "cooccurrences of most important words", 
                     edge_colour = "orange",
                     node_colour = "darkgreen",
                     node_size = 4,
                     ...){
  stopifnot(requireNamespace("data.table"))
  stopifnot(requireNamespace("ggraph"))
  stopifnot(requireNamespace("igraph"))
  stopifnot(requireNamespace("ggplot2"))
  topic_nr <- as.integer(topic_nr)
  stopifnot(topic_nr > 0 & topic_nr <= length(tokens))

  cooc <- biterms$biterms
  cooc <- cooc[cooc$topic %in% topic_nr & 
                 (cooc$term1 %in% tokens[[topic_nr]]$token | 
                    cooc$term2 %in% tokens[[topic_nr]]$token), ]
  cooc <- setDT(cooc)
  cooc <- cooc[, list(freq = .N), by = list(term1, term2)]
  cooc <- setDF(cooc)

  set.seed(123456789)
  wordnetwork <- head(cooc, top_n)
  wordnetwork <- wordnetwork[, c("term1", "term2", "freq")]
  wordnetwork <- igraph::graph_from_data_frame(wordnetwork)
  g <- ggraph::ggraph(wordnetwork, layout = "fr") +
    ggraph::geom_edge_link(ggplot2::aes(width = freq, edge_alpha = freq), edge_colour = edge_colour) +
    ggraph::geom_node_text(ggplot2::aes(label = name), col = node_colour, size = node_size) +
    ggraph::theme_graph(...) +
    ggplot2::theme(legend.position = "none") +
    ggplot2::labs(title = title, subtitle = subtitle)
  g
}

animation

manuelbickel commented 5 years ago

Quick comment on this. I would recommend to only set up a function that generates the network data so that users may define the plotting options. There are so many variations for plotting networks depending on the context that one might probably not cover in a single function. E.g. highlightig vertex size based on centrality metrics (betweenness, degree, etc) or setting thresholds for plotting only the strongest edges, etc. This is just a recommendation from my personal experience, it might still make sense to provide a plotting option for a simple standard case, but I think it would makes sense that the output of the function includes the network data in addition to the plot.

jwijffels commented 5 years ago

Thanks for the feedback. Agree on this but mainly because I don't want to include the ggraph dependency chain in the package (functions terms(model, type = "tokens", top_n = 5) and terms(model, type = "biterm", top_n = 5) give you already all data you need for the visualisation). Probably a separate package with plotting facilities for all kinds of topic models would be better. Which packages do you tend to use for visualisation of text networks?

manuelbickel commented 5 years ago

I also agree that it makes sense to keep it simple. I am not an outstanding expert in network science and simply use igraph (which is wrapped in ggraph) - some of my colleagues use the software solutions Gephi or Cytoscape, which are also interesting options depending on the purpose. So probably you are right with the idea to set up an additional package but still provide examples here for BTM using this package...

jwijffels commented 3 years ago

I've added plot.BTM as part of the texplot package https://cran.r-project.org/package=textplot somewhere in april 2020. So closing.