tfidf fitting much slower than expected

bogedy commented 2 years ago

Hi! I came across this package because I have a dataset of ~2 million text sequences (each <500 chars long) and I wanted to get faster performance than sklearn's tfidf vectorizer while I play with different configurations. Sklearn's vectorizer is single threaded and written in python.

It takes about 5 minutes to vectorize and transform in sklearn in python:

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(sublinear_tf=True,
                        norm='l2',
                        encoding='latin-1', ngram_range=(1, 2),
                        stop_words=None)

%time X = tfidf.fit_transform(dataset.text)

CPU times: user 4min 27s, sys: 14.1 s, total: 4min 41s
Wall time: 4min 49s

I can see on top that this is only using a single thread.

with text2vec (I hope I'm using it right! I tried to follow the example http://text2vec.org/vectorization.html#tf-idf):

dt = fread('dataset.csv.tar.gz')

setkey(dt, id)

prep_fun = tolower
tok_fun = word_tokenizer

my_iterator = itoken_parallel(dt$text,
                  preprocessor = prep_fun,
                  tokenizer = tok_fun,
                  ids = dt$id,
                  progressbar = TRUE)

t10 = Sys.time()
vocab = create_vocabulary(my_iterator, ngram=c(1L, 2L))
vectorizer = vocab_vectorizer(vocab)
dtm_train = create_dtm(my_iterator, vectorizer)

# define tfidf model
tfidf = TfIdf$new(norm = 'l2', sublinear_tf = TRUE)
# fit model to train data and transform train data with fitted model
dtm_train_tfidf = fit_transform(dtm_train, tfidf)
# tfidf modified by fit_transform() call!

paste('Time to build tfidf:', difftime(Sys.time(), t10, units = 'sec'))

I've left it running on an AWS. I can see on top that 4 threads are going. But they've been going much much longer than 5 minutes. Had to kill the process eventually. If I work on a smaller subset of a few thousand articles it works fine.

Am I missing something? Or do I just lack patience? Thanks for your help.

dselivanov commented 2 years ago

Hi. Code looks fine. Can you try single process (itoken() instead of itoken_parallel)?

bogedy commented 2 years ago

edit: the parallel one has been going for 2 hours now. Seems broken.

Just ran it. It took about 11 minutes on a single thread. Running the parallel again, more than 20 minutes so far and still going.

I forgot to add, when I run the parallel tokenizer I get the following warnings every few seconds while its running:

Warning message in selectChildren(jobs, timeout):
“cannot wait for child 30598 as it does not exist”
Warning message in selectChildren(jobs, timeout):
“cannot wait for child 30731 as it does not exist”
Warning message in parallel::mccollect(jobs = jobs_in_progress, wait = FALSE):
“1 parallel job did not deliver a result”
Warning message in selectChildren(jobs, timeout):
“cannot wait for child 31722 as it does not exist”
Warning message in parallel::mccollect(jobs = jobs_in_progress, wait = FALSE):
“1 parallel job did not deliver a result”
Warning message in selectChildren(jobs, timeout):
“cannot wait for child 31721 as it does not exist”
Warning message in parallel::mccollect(jobs = jobs_in_progress, wait = FALSE):
“1 parallel job did not deliver a result”
Warning message in selectChildren(jobs, timeout):
“cannot wait for child 31732 as it does not exist”
Warning message in selectChildren(jobs, timeout):
“cannot wait for child 31736 as it does not exist”
Warning message in parallel::mccollect(jobs = jobs_in_progress, wait = FALSE):
“1 parallel job did not deliver a result”
Warning message in selectChildren(jobs, timeout):
“cannot wait for child 32259 as it does not exist”
Warning message in parallel::mccollect(jobs = jobs_in_progress, wait = FALSE):
“1 parallel job did not deliver a result”
Warning message in selectChildren(jobs, timeout):
“cannot wait for child 32368 as it does not exist”
Warning message in selectChildren(jobs, timeout):
“cannot wait for child 32441 as it does not exist”

And earlier when I interrupted R early:

Warning message in selectChildren(jobs, timeout):
“cannot wait for child 17021 as it does not exist”
Warning message in selectChildren(jobs, timeout):
“cannot wait for child 17049 as it does not exist”
as(<dgTMatrix>, "dgCMatrix") is deprecated since Matrix 1.5-0; do as(., "CsparseMatrix") instead

dselivanov commented 2 years ago

This means, workers (processes which process chunks of the input data) are dying for some reason and don't deliver results of their job. You might need to investigate somehow why this happens.

dselivanov / text2vec

tfidf fitting much slower than expected #335