dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
850 stars 135 forks source link

Just a thought: Include a notice that prep_fun is not applied to stopwords? #228

Closed manuelbickel closed 6 years ago

manuelbickel commented 6 years ago

Hi Dmitriy, this really is a minor issue but depending on the use case it might improve quality of results. Currently the function fed to preprocessor in the iterator is not automatically applied to the stopwords that might be used in create_vocabulary. Experienced users would probably automatically think of applying the prep_fun to their stopwords, but less experienced users might stumble over several stopwords that might not removed from results (see example below). Do you think it might make sense to provide a hint or something on this somewhere (e.g., in the vignette).

prep_fun <- function(x) {gsub("\\d+", "", tolower(x), perl = T)}
it_train = itoken("an irrelevant Name and a chemical compound with irrelevant abbreviation CO2" 
                  ,preprocessor = prep_fun
                  ,progressbar = FALSE)
stopwords <- c("an", "Name", "and", "a", "with", "irrelevant", "abbreviation", "CO2")

#prep_fun NOT applied to stopwords
create_vocabulary(it_train, stopwords = stopwords)
# term term_count doc_count
# 1:     name          1         1  <- undesired
# 2: compound          1         1
# 3: chemical          1         1
# 4:       co          1         1  <- undesired

#prep_fun IS applied to stopwords
create_vocabulary(it_train, stopwords = prep_fun(stopwords))
# term term_count doc_count
# 1: compound          1         1
# 2: chemical          1         1