PolMine / biglda

Tools for fast LDA topic modelling for big corpora
5 stars 1 forks source link

Use efficiency of output of `$printDenseDocumentTopics()` #12

Closed ablaette closed 1 year ago

ablaette commented 1 year ago

The save_topic_documents() function uses printDocumentTopics() - but the aforementioned method is much, much more efficient!

Sys.setenv(MALLET_DIR="/opt/mallet/Mallet-202108")
library(biglda)

library(polmineR)
use("polmineR")
speeches <- polmineR::as.speeches("GERMAPARLMINI", s_attribute_name = "speaker", s_attribute_date = "date")
instance_list <- as.instance_list(speeches)

BTM  <- BigTopicModel(n_topics = 25L, alpha_sum = 5.1, beta = 0.1)
BTM$addInstances(instance_list)
BTM$estimate()

file <- rJava::.jnew("java/io/File", path.expand("~/Lab/tmp/dense.tsv"))
file_writer <- rJava::.jnew("java/io/FileWriter", file)
print_writer <- rJava::new(rJava::J("java/io/PrintWriter"), file_writer)
BTM$printDenseDocumentTopics(print_writer)
print_writer$close()

file <- rJava::.jnew("java/io/File", path.expand("~/Lab/tmp/notdense.tsv"))
file_writer <- rJava::.jnew("java/io/FileWriter", file)
print_writer <- rJava::new(rJava::J("java/io/PrintWriter"), file_writer)
BTM$printDocumentTopics(print_writer)
print_writer$close()

a <- data.table::fread("~/Lab/tmp/dense.tsv")
b <- data.table::fread("~/Lab/tmp/notdense.tsv")
ablaette commented 1 year ago

save_document_topics() has started to use $printDenseDocumentTopics(). More efficient indeed.