TommyJones / tidylda

Implements an algorithim for Latent Dirichlet Allocation using style conventions from the [tidyverse](https://style.tidyverse.org/) and [tidymodels](https://tidymodels.github.io/model-implementation-principles/index.html).
Other
41 stars 3 forks source link

Cannot create lexicon with large matrices #47

Closed TommyJones closed 3 years ago

TommyJones commented 3 years ago

When attempting to allocate from a TCM that is ~ 128k X 128k I receive the following error:

Error in create_lexicon(Cd_in = Cd_start, Beta_in = beta_initial, dtm_in = dtm,  : 
  SpMat::init(): requested size is too large; suggest to enable ARMA_64BIT_WORD

Code to reproduce this is below

# load libraries
library(tidyverse)
library(textmineR)
library(tidylda)

# load raw data
sbir <- read_csv("https://data.www.sbir.gov/awarddatapublic/award_data.csv")

colnames(sbir) <- 
  colnames(sbir) %>%
  tolower() %>%
  str_replace_all(" +", "_")

# add unique identifier
sbir$sbir_id <- 1:nrow(sbir)

# pull out text columns
sbir_text <- sbir %>%
  select(
    sbir_id,
    award_title,
    abstract
  ) %>%
  mutate(
    award_title = str_conv(award_title, "UTF-8"),
    abstract = str_conv(abstract, "UTF-8")
  )

# Create a TCM with 10-degree skipgrams
# Creating titles and abstracts as separate documents so that titles are handled
# in isolation when constructing skipgrams
sbir_tcm <- CreateTcm(
  doc_vec = c(sbir_text$award_title, sbir_text$abstract),
  skipgram_window = 10, # arbitrary but standard
  stopword_vec = stopwords::stopwords("en"),
  verbose = TRUE
)

# completing a step that textmineR should've done
sbir_tcm <- sbir_tcm + t(sbir_tcm) 

# vocabulary pruning
sbir_tf <- TermDocFreq(sbir_tcm) %>%
  as_tibble()

vocab_keep <- sbir_tf$term[sbir_tf$doc_freq > 20]

sbir_tcm <- sbir_tcm[vocab_keep, vocab_keep]

# train an LDA model off of it
sbir_embedding <- tidylda(
  data = sbir_tcm,
  k = 100, # arbitrary and arguably should be much bigger
  iterations = 200,
  burnin = 175,
  calc_likelihood = TRUE,
  calc_r2 = TRUE,
  verbose = TRUE
)
TommyJones commented 3 years ago

https://github.com/TommyJones/tidylda/commit/aa9a25c99c66260dbb37ce3ac4624aedda3d5d92