Hi Dmitriy,
this really is a minor issue but depending on the use case it might improve quality of results. Currently the function fed to preprocessor in the iterator is not automatically applied to the stopwords that might be used in create_vocabulary. Experienced users would probably automatically think of applying the prep_fun to their stopwords, but less experienced users might stumble over several stopwords that might not removed from results (see example below). Do you think it might make sense to provide a hint or something on this somewhere (e.g., in the vignette).
prep_fun <- function(x) {gsub("\\d+", "", tolower(x), perl = T)}
it_train = itoken("an irrelevant Name and a chemical compound with irrelevant abbreviation CO2"
,preprocessor = prep_fun
,progressbar = FALSE)
stopwords <- c("an", "Name", "and", "a", "with", "irrelevant", "abbreviation", "CO2")
#prep_fun NOT applied to stopwords
create_vocabulary(it_train, stopwords = stopwords)
# term term_count doc_count
# 1: name 1 1 <- undesired
# 2: compound 1 1
# 3: chemical 1 1
# 4: co 1 1 <- undesired
#prep_fun IS applied to stopwords
create_vocabulary(it_train, stopwords = prep_fun(stopwords))
# term term_count doc_count
# 1: compound 1 1
# 2: chemical 1 1
Hi Dmitriy, this really is a minor issue but depending on the use case it might improve quality of results. Currently the function fed to
preprocessor
in theiterator
is not automatically applied to thestopwords
that might be used increate_vocabulary
. Experienced users would probably automatically think of applying the prep_fun to their stopwords, but less experienced users might stumble over several stopwords that might not removed from results (see example below). Do you think it might make sense to provide a hint or something on this somewhere (e.g., in the vignette).