PolMine / duplicates

0 stars 0 forks source link

speed up polmineR:::.character_ngrams() #1

Closed ablaette closed 1 year ago

ablaette commented 1 year ago

Dropping characters is necessarily slow. What about strings::str_remove() with fixed()

.character_ngrams <- function(x, n, char){
  if (char[1] != ""){
    splitted <- unlist(strsplit(x, ""))
    splitted <- ifelse(splitted %in% char, splitted, NA)
    x <- paste(splitted[which(!is.na(splitted))], sep = "", collapse = "")
  }
  ngrams <- stringi::stri_sub(x, from = 1L:(nchar(x) - n + 1L), to = n:nchar(x))
  dt <- data.table(ngram = ngrams)[, .N, by = "ngram"]
  setnames(dt, old = "N", new = "count")
  dt
}
ablaette commented 1 year ago

This is nothing I will pursue further. The efficient solution is to get rid of unwanted characters at the lexicon stage, something I implemented in the latest dev version of polmineR.