dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
849 stars 135 forks source link

Questions about itoken_parallel for Windows. #322

Closed y-he2 closed 4 years ago

y-he2 commented 4 years ago

However Im still wondering what was the main reason itoken_parallel for Windows was not recommended.

Saw you mentioned somewhere that the performance for such a parallelism while during "create_dtm" or similar tasks, should be identical to the sequential version.

However I noticed that while vectorizing huge datasets (more than 10M records) the CPU usage was only around 40%, thus I suspect that an embarringsly parallelism would speed it up pretty much.

dselivanov commented 4 years ago

It is hard to fine control memory footprint and we had many reports with bugs in edge cases. Which requires quite some time to maintain, which I can't afford. But nobody stops you from implementing such functionality yourself.