Suppporting a list of tokens

koheiw commented 1 year ago

Hi @jwijffels

Your word2vec looks great. I wanted to use word2vec to generate word vectors for my LSX package, so I wonder if you have a plan to support a list of tokens as an input.

As far as I understand, I need to convert quanteda's tokens object (list of token IDs) to character strings to pass that to word2vec(). I think it is more efficient to feed texts to the C++ functions without writing them to temporary files.

## Current 
toks_char <- stringi::stri_c_list(as.list(toks), sep = " ")
w2v <- word2vec(x = toks_char, dim = 100, iter = 20, threads = 8, split = c(" ", "\n"))

## My ideal
w2v <- word2vec(x = as.list(toks), dim = 100, iter = 20, threads = 8, split = NULL)

If you like the idea, I think I can contribute.

jwijffels commented 1 year ago

Hi Kohei. Would love to have that as well. Main obstacle here is rewriting the C++ backend such that the text is passed directly instead of working through a file (https://github.com/bnosac/word2vec/issues/11). But that requires some nontrivial rewriting (at least for me) of the C++ code. If you are up to the task, go ahead :) As long as there are no extra R packages introduced which increases the dependency chain of this R package, I'd be happy to include change you provide which don't write text to temporary files first before building the model.

koheiw commented 1 year ago

Glad that you like the idea. It needs a lot of work but I think it is worth the effort. We should first try to replace text files with list of character vectors; second try to support list of integer vectors with vocabulary vector (like quanteda's tokens object). I think we can do it without adding quanteda to the dependencies.

The following is the key parts that need changes:

Read word from std::vector<std::string>. https://github.com/bnosac/word2vec/blob/12b015e5c3f4b754a251ad9b6ea536d7ddafeca2/src/word2vec/lib/trainThread.cpp#L88-L112

Make the vocabulary object (token ID and frequency) from the list in C++ or R. https://github.com/bnosac/word2vec/blob/12b015e5c3f4b754a251ad9b6ea536d7ddafeca2/src/word2vec/lib/vocabulary.hpp#L34-L52

jwijffels commented 1 year ago

Yes, that´s indeed what´s needed. The vocabulary needs to be constructed from that tokenlist and plugged into that Worddata and the parallel training loop has to receive somehow the words. Go for it! Make it happen 😀 I need this as well to speed up what I had brewed together in function sentencepiece::BPEembedder

koheiw commented 1 year ago

I started modifying the code to better understand how it works, and noticed that vocabulary_t::vocabulary_t() removes low frequency words and stopwrods. I prefer doing it in R to make the C++ code simpler. What do you think?

https://github.com/bnosac/word2vec/blob/12b015e5c3f4b754a251ad9b6ea536d7ddafeca2/src/word2vec/lib/vocabulary.cpp

jwijffels commented 1 year ago

I think this is just a tiny detail in all the changes which are needed, we can always work around this. In general, I'm in favor of keeping the parameters to word2vec as is to avoid removing arguments which might breaks current code. You can always set the min_count argument of word2vec to 0 if you prefer doing this upfront in R and the default of stopwords is nothing.

koheiw commented 1 year ago

I am testing on this branch https://github.com/koheiw/word2vec/commits/test-nextword. The library is complex because it does a lot of basic things in C++. I know how difficult to it is to do tokenization and feature selection in C++...

jwijffels commented 1 year ago

I also tend to do tokenisation outside this library, then tend to paste text together with one basic specific letter before running word2vec

koheiw commented 1 year ago

The dev-texts works without crashing at least.

library(quanteda)
#> Package version: 4.0.0
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 6 of 6 threads used.
#> See https://quanteda.io for tutorials and examples.
library(word2vec)

corp <- data_corpus_inaugural %>% 
    corpus_reshape()
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE)
lis <- as.list(toks)
txt <- stringi::stri_c_list(lis, " ")

mod_lis <- word2vec(lis, dim = 50, iter = 5, min_count = 5,
                    verbose = TRUE, threads = 4)
predict(mod_lis, c("people", "American"), type = "nearest")
#> $people
#>     term1        term2 similarity rank
#> 1  people      Nations  0.9756919    1
#> 2  people institutions  0.9723659    2
#> 3  people     pursuits  0.9704540    3
#> 4  people        parts  0.9704391    4
#> 5  people    relations  0.9698635    5
#> 6  people    sovereign  0.9691449    6
#> 7  people     sections  0.9678864    7
#> 8  people       defend  0.9676723    8
#> 9  people      against  0.9676545    9
#> 10 people         into  0.9676042   10
#> 
#> $American
#>       term1         term2 similarity rank
#> 1  American righteousness  0.9931667    1
#> 2  American      products  0.9915134    2
#> 3  American  preservation  0.9904682    3
#> 4  American    scientific  0.9904510    4
#> 5  American   foundations  0.9897572    5
#> 6  American     cultivate  0.9896089    6
#> 7  American    industrial  0.9891499    7
#> 8  American         amity  0.9889630    8
#> 9  American   development  0.9888043    9
#> 10 American    prosperity  0.9887399   10

I noticed that the progress bar exceeds 100% on the master. Do you know why? I am not sure how it works yet...

> mod_txt <- word2vec(txt, dim = 50, iter = 5, split = c("[ \n]", "\n"), min_count = 5,
+                     verbose = TRUE, threads = 4)
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
*************************************************************************|

jwijffels commented 1 year ago

You got it to work on a tokenlist, magic :) and the embedding similarities look the same - even better :)

> library(udpipe)
> library(word2vec)
> data(brussels_reviews, package = "udpipe")
> x <- subset(brussels_reviews, language == "nl")
> x <- txt_clean_word2vec(x$feedback, ascii = TRUE, alpha = TRUE, tolower = TRUE, trim = TRUE)
> set.seed(123456789)
> model <- word2vec(x = strsplit(x, split = "[[:space:]]+"), type = "cbow", dim = 15, iter = 20, threads = 1)
> embedding <- as.matrix(model)
> embedding <- predict(model, c("bus", "toilet"), type = "embedding")
> lookslike <- predict(model, c("bus", "toilet"), type = "nearest", top_n = 5)
> lookslike
$bus
  term1    term2 similarity rank
1   bus     auto  0.9914712    1
2   bus     voet  0.9914547    2
3   bus parkeren  0.9891535    3
4   bus      etc  0.9868931    4
5   bus      ben  0.9836714    5

$toilet
   term1      term2 similarity rank
1 toilet      prive  0.9925081    1
2 toilet     werkte  0.9882500    2
3 toilet   koelkast  0.9872757    3
4 toilet      boven  0.9756840    4
5 toilet verdieping  0.9755659    5

> set.seed(123456789)
> model <- word2vec(x = x, type = "cbow", dim = 15, iter = 20, threads = 1)
> embedding <- as.matrix(model)
> embedding <- predict(model, c("bus", "toilet"), type = "embedding")
> lookslike <- predict(model, c("bus", "toilet"), type = "nearest", top_n = 5)
> lookslike
$bus
  term1 term2 similarity rank
1   bus  auto  0.9916762    1
2   bus  voet  0.9896209    2
3   bus   ben  0.9881520    3
4   bus   dus  0.9867552    4
5   bus  tram  0.9852284    5

$toilet
   term1     term2 similarity rank
1 toilet     prive  0.9896117    1
2 toilet  koelkast  0.9802536    2
3 toilet vertoeven  0.9771889    3
4 toilet    werkte  0.9745187    4
5 toilet    douche  0.9705700    5

That verbose argument never worked with threads > 1, I think the printing out was even the culprit of crashes if I recall, as the printing out was not thread-safe I think. And the embeddings were only 100% reproducible when using thread = 1 probably due to parallel updating of the embeddings space. Maybe we should check the examples in RcppProgress at https://github.com/kforner/rcpp_progress/tree/master/inst/examples to see how to make the progress bar not above 100%

koheiw commented 1 year ago

Printing from multiple threads is often risky, but I will look into the original code and PcppProgress. So far, there is no performance gain from passing a token list due to token to ID conversion in the library. Next step will be passing a list of token IDs.

jwijffels commented 1 year ago

I've pushed some continuous integration. Did you manage to make the embeddings reproducible between the model building from file and from the tokenlist?

jwijffels commented 1 year ago

So far, there is no performance gain from passing a token list due to token to ID conversion in the library.

Yes, the performance gain will be merely removing file operations.

Next step will be passing a list of token IDs.

Does this as well require modifications of the new C++ internals or can we just do this from R and keep the mapping between tokens en token ids in R? FYI. for Byte Pair Encoding tokenisation: these C++ wrappers https://cran.r-project.org/package=tokenizers.bpe / https://cran.r-project.org/package=sentencepiece can be used.

koheiw commented 1 year ago

My idea is to convert a list of tokens to list of IDs in this way. If lis and voc are passed to the C++ code, it will be much simpler.

library(udpipe)
data(brussels_reviews_anno, package = "udpipe")
x <- subset(brussels_reviews_anno, language == "fr")

voc <- unique(x$token)
lis <- lapply(split(x$token, x$sentence_id), match, voc)
head(lis)
#> $`2345`
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13  7 14 15 16 17 18 19 12 13 20 21 16
#> [26] 17 18 22 23 24 23 13  7 25 26 27 28 29 23  7 10 30 31 32 33 34  7 35 28 36
#> [51]  7 37 38 39 28 40 41 21
#> 
#> $`2346`
#>  [1] 42 23 43 44 45 46 47 48 49 50 47 51 52 53 54 55
#> 
#> $`2347`
#>  [1] 56 57 58 59 12 22 12 23 60 55
#> 
#> $`2348`
#>  [1] 61 58 62 53 63 27 64 28 65 12 34 66 45 67 68  7 69 28 70 53 71 23 45 72 73
#> [26] 29  7 37 55
#> 
#> $`2349`
#>  [1] 74 58 47 75 76 12 77 23 78 12 79 80 81 82 83 84 28 85 86 87 88 72 89 90 55
#> 
#> $`2350`
#>  [1] 91 92 93 94 23 95 49 96 58 81 40 46 55

^{Created on 2023-09-21 with reprex v2.0.2}

koheiw commented 1 year ago

The method for list should be like this to support both lists of characters and integers. If we ask users to tokenize their texts before hand, we can drop word2vec.character().


word2vec.list(x, vocaburary = NULL, ...) {

    v <- unique(unlist(x. use.names = FALSE))
    if (is.character(v)) {
        if (!is.null(vocaburary))
            stop("vocaburary is not used when x is a list of characters")
        x <- lapply(x, match, v) # fastmatch::fmatch is faster
    } else if (is.numeric(v)) {
        if (min(v) < 0 || length(v) < max(v)) # 0 will be ignored in C++
            stop("vocaburary does not match token index")
        v <- vocaburary
    }
    model <- w2v_train(x, vocaburary = v, ...)
    return(model)
}

# udpipe
x <- subset(brussels_reviews_anno, language == "fr")
lis <- split(x$token, x$sentence_id)
word2vec(lis)

# quanteda
toks <- tokens(corpus(data_corpus_inaugural))
word2vec(toks, types(toks))

jwijffels commented 1 year ago

Ok for me if you want to create the vocabulary upfront and pass it on to the C++ layer, it does simplify the construction at the C++ side for word2vec.list But I do prefer to keep as well the file-based approach next to it as I have quite some processes which now use that and prefer to have a high level of backward compatibility when possible.

koheiw commented 1 year ago

The most important part of the library related to changes in input format is this block. This code is executed by individual threads for the texts between range.first and range.second.

https://github.com/koheiw/word2vec/blob/d739cf2911d9f6e3868537ef27eec5a344fd0fdf/src/word2vec/lib/trainThread.cpp#L139-L159

Within the loop, sentence is constructed as a vector of pairs that records a token ID and its frequency in the corpus. With a sentence object, word vectors are trained with the skipgram or CBOW models.

https://github.com/koheiw/word2vec/blob/d739cf2911d9f6e3868537ef27eec5a344fd0fdf/src/word2vec/lib/trainThread.cpp#L162-L166

You may be surprised how simple the core of the library is. All other functions and objects are for file access, multi-threading, tokenization, serialization, feature selection. It is an impressive peace of work but over-complicated. If we remove file access, and complete tokenization, serialization, feature selection in R (or using other tools), and implement multi-threading using using Intel TBB (RcppParallel), we can make a compact package that we can understand and maintain more easily. We can even enhance.

If we use TBB, we only need to wrap the parallel code by tbb::parallel_for. For example,

https://github.com/koheiw/seededlda/blob/338509e7bebae69690aee24fc21dac7fc9bf2711/src/lda.h#L282C3-L282C3

I understand that backward compatibility is important but it should be possible to produce near identical results using a new library if tokenization is performed in a the same way.

jwijffels commented 1 year ago

I have to provide some courses on text analytics with R this week. I'll look into the code the week after that such that we can integrate it already and if to test if I can completely make the embedding matrix reproducible with a toy problem where tokenisation is only based on single space and sentences are based on a dot.

Once there is 100% reproducibility, we certainly flesh out more of the library.

It's been a while since I looked into the details of the implementation. I thought the library implemented multithreading by reading in parallel from the file and if I understand you correctly, you would prefer to use RcppParallel instead of a multithreaded reading from file. Your end goal will have as a consequence that building the model from file-based dump of wikipedia will not be possible any more, all texts will need to be loaded in R somehow? Or do you envision another iterator way of implementation similar as text2vec?

koheiw commented 1 year ago

It is true that a file-based corpus is more efficient than a memory-based corpus. Yet, I was trying to make tokens objects more efficient by keeping the data in C++ as much as possible (avoid copying large objects between C++ and R) using the XPtr object. I would develop a file-mapped tokens object in the future if the memory usage need to be even lower. Intel TBB (via RcppParallel) can be used for multi-threading here too.

We may not be able to train the algorithm on Wikipedia dump if it need to be kept on memory, but I doubt the usefulness of such models in applied research. My approach has been training word vectors on a subject specific corpus and use them to perform higher-level analysis (e.g. domain-specific sentiment analysis).

jwijffels commented 1 year ago

I've had a look to the code changes and I think if we incorporate the same logic on the vocabulary for the list-based approach and the file-based approach as suggested in https://github.com/koheiw/word2vec/pull/2, I'm fine with the changes as it gives the exact same embeddings for both approaches. We just need to update the documentation and that's it. We can than in a later step do all kind of optimisations if that is fine for you?

jwijffels commented 1 year ago

I've improved the documentation in that pull request. For me this is fine as is, embeddings are the same with the 2 approaches with that additional pull request. If I just remove the train.R in the tests folder to avoid adding packages to Suggests, I think this is good to go to CRAN.

jwijffels commented 1 year ago

@koheiw I'll integrate this pull request, unless you have any remarks. https://github.com/bnosac/word2vec/pull/18

koheiw commented 1 year ago

@jwijffels thanks for preparing a PR with nice examples. I thought dev-texts branch need more work such as cleaning up old code, testing with inputs, and fixing the progress bar. If you find the current version is sufficiently clean and stable, you can merge it to the master branch.

I found bpe_encode() interesting, but its integer IDs are converted to characters. Also I understand why you restore </s>, but it is strange to have a word vector for the tag in the output. We can ignore these problmes for now but we must address them in the next upgrade. Ultimately we need to abandon its built-in tokenizer to fix them.

We should also consider adding proper CI tests for the functions in the package.

jwijffels commented 1 year ago

I thought dev-texts branch need more work such as cleaning up old code, testing with inputs, and fixing the progress bar. If you find the current version is sufficiently clean and stable, you can merge it to the master branch.

I think the most important prove is that the embeddings are 100% the same for the file-based and list-based approach which is the case. I also checked that without doing the ordering of the vocabulary as indicated in the remarks of https://github.com/bnosac/word2vec/pull/18 that the embeddings were still the same as with version 0.3.4 of the package. So that is ok.

I completely agree there should be more unit tests on the different settings though. But I wonder how they will look like. Eventually these numbers are only usefull in further downstream processing.

I found bpe_encode() interesting, but its integer IDs are converted to characters. Also I understand why you restore , but it is strange to have a word vector for the tag in the output. We can ignore these problmes for now but we must address them in the next upgrade. Ultimately we need to abandon its built-in tokenizer to fix them.

Yes, I know, I ignored these for now. Changing these would make the changes too complex to test for now. Better to do it in different steps.

Thanks again for all the work 👍

koheiw commented 1 year ago

The unit tests would be to ensure that word vector from this version and next version is the same, and methods like as.matrix() like as expected, word2vec() return word vectors with expected sizes etc.

koheiw commented 1 year ago

I can develop a version for a list of token IDs in the coming months. How quickly do you want to change? It would be nice to have shared mid- to long-term milestones.

jwijffels commented 1 year ago

I've uploaded the current status of the package to CRAN. Hopefully it will land there without problems. I've created 2 more issues, one for more unit tests and one for future discussions on future improvements

Regarding quickness of change. I'm happy to incorporate any elements you see as improvements. I don't have any specific timing for that in mind. In general the only requirement that I have is keep the interfaces as stable as possible with backward compatibility, avoid doing breaking changes and limit R package dependencies to not let someone else break the package installation.

jwijffels commented 1 year ago

Changes are on CRAN now. Many thanks again. Looking forward to see further improvements. Closing this issue for now.

bnosac / word2vec

Suppporting a list of tokens #14