OscarKjell / text

Using Transformers from HuggingFace in R
https://r-text.org
135 stars 30 forks source link

textTopic has fatal error #173

Closed scm1210 closed 2 months ago

scm1210 commented 6 months ago

I'm trying to run textTopic to create a topic model in R v. 4.3.3 using the following code with the CRAN version of text (v. 1.2)

topics_result <- textTopics(
  data = textData,
  variable_name = "Text",  
  embedding_model = "miniLM",  
  umap_model = "default",
  hdbscan_model = "default",
  vectorizer_model = "default",
  representation_model = "mmr",
  num_top_words = 10,
  n_gram_range = c(1, 3),
  stopwords = "english",
  min_df = 5,
  bm25_weighting = FALSE,
  reduce_frequent_words = TRUE,
  set_seed = 8,
  save_dir = "~/study1_results") #save results

However, whenever I try to run the script, I encounter a fatal error that crashes R and the session is aborted with no explanation (e.g., vector memory exhausted). The last code that is printed in the console is:

2024-04-17 10:59:23,493 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
OMP: Info #276: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.

Does anyone have insights into how to possibly resolve this issue?

scm1210 commented 6 months ago

This seems to be the root of the issue

*** caught segfault ***

 *** caught segfault ***
address 0x600, cause 'memory not mapped'
address 0x600, cause 'memory not mapped'

 *** caught segfault ***

 *** caught segfault ***
address 0x600, cause 'memory not mapped'
address 0x600, cause 'memory not mapped'

 *** caught segfault ***

 *** caught segfault ***
address 0x600, cause 'memory not mapped'
address 0x600, cause 'memory not mapped'
OscarKjell commented 6 months ago

Thanks for the feedback.

Are you able to run the test within the package? Please try running:

  # Load and prepare data
  data1 <- Language_based_assessment_data_8[c("satisfactiontexts", "swlstotal")]
  colnames(data1) <- c("text", "score")

  data2 <- Language_based_assessment_data_8[c("harmonytexts", "hilstotal")]
  colnames(data2) <- c("text", "score")

  data3 <- Language_based_assessment_data_3_100[1:2]
  colnames(data3) <- c("text", "score")

  data <- dplyr::bind_rows(data1, data2, data3)

  # Create BERTopic model trained on data["text"] help(textTopics)
  bert_model <- textTopics(data = data,
                           variable_name = "text",
                           embedding_model = "distilroberta",
                           min_df = 2,
                           set_seed = 8,
                           save_dir="./results")
scm1210 commented 6 months ago

Running this code generates the same result: fatal error, R session aborted

*** caught segfault ***

 *** caught segfault ***

 *** caught segfault ***
address 0x600, cause 'memory not mapped'
address 0x600, cause 'memory not mapped'

 *** caught segfault ***
address 0x600, cause 'memory not mapped'

 *** caught segfault ***
address 0x600, cause 'memory not mapped'

 *** caught segfault ***
address 0x600, cause 'memory not mapped'
address 0x600, cause 'memory not mapped'

Traceback:
 1: py_call_impl(callable, call_args$unnamed, call_args$named)
 2: create_bertopic_model(data = data, data_var = variable_name,     embedding_model = embedding_model, umap_model = umap_model,     hdbscan_model = hdbscan_model, vectorizer_model = vectorizer_model,     representation_model = representation_model, top_n_words = num_top_words,     n_gram_range = n_gram_range, min_df = min_df, bm25_weighting = bm25_weighting,     reduce_frequent_words = reduce_frequent_words, stop_words = stopwords,     seed = set_seed, save_dir = save_dir)
 3: 
Traceback:

Traceback:

Traceback:

Traceback:
OscarKjell commented 6 months ago

after installing the GitHub version – have you rerun:

textrpp_install()
textrpp_initialize(save_profile = TRUE)
scm1210 commented 6 months ago

yes i did.

i did a hard uninstall of my conda environment too and found that there was a file permission issue on my end (not sure how that happened). so i addressed it using this.

i just ran

data1 <- Language_based_assessment_data_8[c("satisfactiontexts", "swlstotal")]
colnames(data1) <- c("text", "score")

data2 <- Language_based_assessment_data_8[c("harmonytexts", "hilstotal")]
colnames(data2) <- c("text", "score")

data3 <- Language_based_assessment_data_3_100[1:2]
colnames(data3) <- c("text", "score")

data <- dplyr::bind_rows(data1, data2, data3)

# Create BERTopic model trained on data["text"] help(textTopics)
# Create BERTopic model trained on data["text"] help(textTopics)
bert_model <- textTopics(data = data,
                         variable_name = "text",
                         embedding_model = "distilroberta",
                         min_df = 2,
                         set_seed = 8,
                         save_dir="./results")

again and am still getting the fatal error

OscarKjell commented 6 months ago

hmm, which OS are you using.

The example above is tested on different OS, where you can find more specifics here: https://github.com/OscarKjell/text/actions