dselivanov / text2vec

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
http://text2vec.org
Other
847 stars 135 forks source link

performance comparison with quanteda #23

Closed kbenoit closed 8 years ago

kbenoit commented 8 years ago

Hi - Interesting package, and I agree wholeheartedly with your API and performance qualms with tm(). That's why we started https://github.com/kbenoit/quanteda. My replication of your performance comparisons is very similar to what you reported. Here is quanteda:

> quantedaDtm <- quanteda::dfm(dt[['review']])
Creating a dfm from a character vector ...
   ... lowercasing
   ... tokenizing
   ... indexing documents: 25,000 documents
   ... indexing features: 100,605 feature types
   ... created a 25000 x 100605 sparse dfm
   ... complete. 
Elapsed time: 5.701 seconds.
> print(object.size(quantedaDtm), quote = FALSE, units = "Mb")
47.7 Mb

Note that this does everything in one pass, including lowercasing and tokenisation. There are methods defined for corpus management etc and dfm() methods for those objects as well, but this is the quickest way to go from the input text into a matrix representation.

dselivanov commented 8 years ago

Thanks for feedback. This benchmark is based on old base regexp functions. Now it is made on top of stringi package and should be considerably faster. I'm on vacation till this weekend, so can rerun benchmark after return.

dselivanov commented 8 years ago

@kbenoit , see this wiki page. If I missed something and comparison don't look fair, please, let me know.

kbenoit commented 8 years ago

Impressive! if you supply the argument what = “fastestword” it nearly halves the unigram tokenisation time, since this splits just on whitespace. The default tokeniser uses stringi’s word boundaries functions.

For ngrams quanteda is just slower because of the way ngrams are formed. tmlite’s impressive performance here is a very good indication that there are big performance improvements to be made in how quanteda handles this.

Overall, dfm() is designed to be a swiss army knife for people who want to go from a text or corpus to a matrix in one step. There are also many lower-level functions that can be used to build up a dfm through more manual control. The idea is to build a package for experience programmers (as you state in the descriptions of the motivations behind tmlite) but also to provide something simple and robust enough that novices can use it too. My students tend to find tm’s way of doing things hard to understand.

On 11 Oct 2015, at 08:17, Dmitriy Selivanov notifications@github.com<mailto:notifications@github.com> wrote:

@kbenoithttps://github.com/kbenoit , see this wiki pagehttps://github.com/dselivanov/tmlite/wiki/Comparison-with-quanteda. If I missed something and comparison don't look fair, please, let me know.

— Reply to this email directly or view it on GitHubhttps://github.com/dselivanov/tmlite/issues/23#issuecomment-147186880.

dselivanov commented 8 years ago

well, tmlite also relies on word boundary tokenization, so comparison seems fair in this case =)

kbenoit commented 8 years ago

Fair enough! I will take a (much) closer look soon and see if I can improve the performance of the quanteda code.

Would be very interested in any of your comments on quanteda, since it’s aimed at exactly (but not exclusively) at the needs you describe in the motivation wiki page.

On 11 Oct 2015, at 09:31, Dmitriy Selivanov notifications@github.com<mailto:notifications@github.com> wrote:

well, tmlite also relies on word boundary tokenization, so comparison seems fair in this case =)

— Reply to this email directly or view it on GitHubhttps://github.com/dselivanov/tmlite/issues/23#issuecomment-147192408.

dselivanov commented 8 years ago

@kbenoit, if I knew about quanteda earlier, It is little chance I will start text2vec =).

One drawback (for me) - it is not possible to vectorize documents that don't fit into RAM (And this is natural R constraint - almost all objects are immutable - it is hard to effectively grow some data structure). Also with C++ core classes/functions I have much more control on RAM usage and potentially shared memory concurrency algorithms - I think it is very useful. Also as vowpal wabbit authors said:

There are two ways to have a fast learning algorithm: (a) start with a slow algorithm and speed it up, or (b) build an intrinsically fast learning algorithm. This project is about approach (b)...

Seems you are going with a) approach, and I'm going with b).

dselivanov commented 8 years ago

Closing this, benchmarks provided in wiki.

kbenoit commented 8 years ago

Worth checking out the latest GitHub master branch (currently 0.8.7-5). I got the following results:

> system.time(quantedaDtm <- dfm(dt[, review], ngrams = 1:2, verbose = FALSE))
   user  system elapsed 
 56.251   2.078  34.831

We just reimplemented ngrams/skipgrams as C+ functions. Would be interesting to compare the tokenisers head to head, without the dfm construction part.