Open bogedy opened 2 years ago
Hi. Code looks fine. Can you try single process (itoken()
instead of itoken_parallel
)?
edit: the parallel one has been going for 2 hours now. Seems broken.
Just ran it. It took about 11 minutes on a single thread. Running the parallel again, more than 20 minutes so far and still going.
I forgot to add, when I run the parallel tokenizer I get the following warnings every few seconds while its running:
Warning message in selectChildren(jobs, timeout):
“cannot wait for child 30598 as it does not exist”
Warning message in selectChildren(jobs, timeout):
“cannot wait for child 30731 as it does not exist”
Warning message in parallel::mccollect(jobs = jobs_in_progress, wait = FALSE):
“1 parallel job did not deliver a result”
Warning message in selectChildren(jobs, timeout):
“cannot wait for child 31722 as it does not exist”
Warning message in parallel::mccollect(jobs = jobs_in_progress, wait = FALSE):
“1 parallel job did not deliver a result”
Warning message in selectChildren(jobs, timeout):
“cannot wait for child 31721 as it does not exist”
Warning message in parallel::mccollect(jobs = jobs_in_progress, wait = FALSE):
“1 parallel job did not deliver a result”
Warning message in selectChildren(jobs, timeout):
“cannot wait for child 31732 as it does not exist”
Warning message in selectChildren(jobs, timeout):
“cannot wait for child 31736 as it does not exist”
Warning message in parallel::mccollect(jobs = jobs_in_progress, wait = FALSE):
“1 parallel job did not deliver a result”
Warning message in selectChildren(jobs, timeout):
“cannot wait for child 32259 as it does not exist”
Warning message in parallel::mccollect(jobs = jobs_in_progress, wait = FALSE):
“1 parallel job did not deliver a result”
Warning message in selectChildren(jobs, timeout):
“cannot wait for child 32368 as it does not exist”
Warning message in selectChildren(jobs, timeout):
“cannot wait for child 32441 as it does not exist”
And earlier when I interrupted R early:
Warning message in selectChildren(jobs, timeout):
“cannot wait for child 17021 as it does not exist”
Warning message in selectChildren(jobs, timeout):
“cannot wait for child 17049 as it does not exist”
as(<dgTMatrix>, "dgCMatrix") is deprecated since Matrix 1.5-0; do as(., "CsparseMatrix") instead
This means, workers (processes which process chunks of the input data) are dying for some reason and don't deliver results of their job. You might need to investigate somehow why this happens.
Hi! I came across this package because I have a dataset of ~2 million text sequences (each <500 chars long) and I wanted to get faster performance than sklearn's tfidf vectorizer while I play with different configurations. Sklearn's vectorizer is single threaded and written in python.
It takes about 5 minutes to vectorize and transform in sklearn in python:
I can see on top that this is only using a single thread.
with text2vec (I hope I'm using it right! I tried to follow the example http://text2vec.org/vectorization.html#tf-idf):
I've left it running on an AWS. I can see on top that 4 threads are going. But they've been going much much longer than 5 minutes. Had to kill the process eventually. If I work on a smaller subset of a few thousand articles it works fine.
Am I missing something? Or do I just lack patience? Thanks for your help.