Closed koheiw closed 1 year ago
Hi Kohei. Would love to have that as well. Main obstacle here is rewriting the C++ backend such that the text is passed directly instead of working through a file (https://github.com/bnosac/word2vec/issues/11). But that requires some nontrivial rewriting (at least for me) of the C++ code. If you are up to the task, go ahead :) As long as there are no extra R packages introduced which increases the dependency chain of this R package, I'd be happy to include change you provide which don't write text to temporary files first before building the model.
Glad that you like the idea. It needs a lot of work but I think it is worth the effort. We should first try to replace text files with list of character vectors; second try to support list of integer vectors with vocabulary vector (like quanteda's tokens object). I think we can do it without adding quanteda to the dependencies.
The following is the key parts that need changes:
Read word
from std::vector<std::string>
.
https://github.com/bnosac/word2vec/blob/12b015e5c3f4b754a251ad9b6ea536d7ddafeca2/src/word2vec/lib/trainThread.cpp#L88-L112
Make the vocabulary
object (token ID and frequency) from the list in C++ or R.
https://github.com/bnosac/word2vec/blob/12b015e5c3f4b754a251ad9b6ea536d7ddafeca2/src/word2vec/lib/vocabulary.hpp#L34-L52
Yes, that´s indeed what´s needed. The vocabulary needs to be constructed from that tokenlist and plugged into that Worddata and the parallel training loop has to receive somehow the words. Go for it! Make it happen 😀 I need this as well to speed up what I had brewed together in function sentencepiece::BPEembedder
I started modifying the code to better understand how it works, and noticed that vocabulary_t::vocabulary_t()
removes low frequency words and stopwrods. I prefer doing it in R to make the C++ code simpler. What do you think?
I think this is just a tiny detail in all the changes which are needed, we can always work around this.
In general, I'm in favor of keeping the parameters to word2vec as is to avoid removing arguments which might breaks current code.
You can always set the min_count
argument of word2vec to 0 if you prefer doing this upfront in R and the default of stopwords
is nothing.
I am testing on this branch https://github.com/koheiw/word2vec/commits/test-nextword. The library is complex because it does a lot of basic things in C++. I know how difficult to it is to do tokenization and feature selection in C++...
I also tend to do tokenisation outside this library, then tend to paste text together with one basic specific letter before running word2vec
The dev-texts works without crashing at least.
library(quanteda)
#> Package version: 4.0.0
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 6 of 6 threads used.
#> See https://quanteda.io for tutorials and examples.
library(word2vec)
corp <- data_corpus_inaugural %>%
corpus_reshape()
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE)
lis <- as.list(toks)
txt <- stringi::stri_c_list(lis, " ")
mod_lis <- word2vec(lis, dim = 50, iter = 5, min_count = 5,
verbose = TRUE, threads = 4)
predict(mod_lis, c("people", "American"), type = "nearest")
#> $people
#> term1 term2 similarity rank
#> 1 people Nations 0.9756919 1
#> 2 people institutions 0.9723659 2
#> 3 people pursuits 0.9704540 3
#> 4 people parts 0.9704391 4
#> 5 people relations 0.9698635 5
#> 6 people sovereign 0.9691449 6
#> 7 people sections 0.9678864 7
#> 8 people defend 0.9676723 8
#> 9 people against 0.9676545 9
#> 10 people into 0.9676042 10
#>
#> $American
#> term1 term2 similarity rank
#> 1 American righteousness 0.9931667 1
#> 2 American products 0.9915134 2
#> 3 American preservation 0.9904682 3
#> 4 American scientific 0.9904510 4
#> 5 American foundations 0.9897572 5
#> 6 American cultivate 0.9896089 6
#> 7 American industrial 0.9891499 7
#> 8 American amity 0.9889630 8
#> 9 American development 0.9888043 9
#> 10 American prosperity 0.9887399 10
I noticed that the progress bar exceeds 100% on the master. Do you know why? I am not sure how it works yet...
> mod_txt <- word2vec(txt, dim = 50, iter = 5, split = c("[ \n]", "\n"), min_count = 5,
+ verbose = TRUE, threads = 4)
0% 10 20 30 40 50 60 70 80 90 100%
[----|----|----|----|----|----|----|----|----|----|
*************************************************************************|
You got it to work on a tokenlist, magic :) and the embedding similarities look the same - even better :)
> library(udpipe)
> library(word2vec)
> data(brussels_reviews, package = "udpipe")
> x <- subset(brussels_reviews, language == "nl")
> x <- txt_clean_word2vec(x$feedback, ascii = TRUE, alpha = TRUE, tolower = TRUE, trim = TRUE)
> set.seed(123456789)
> model <- word2vec(x = strsplit(x, split = "[[:space:]]+"), type = "cbow", dim = 15, iter = 20, threads = 1)
> embedding <- as.matrix(model)
> embedding <- predict(model, c("bus", "toilet"), type = "embedding")
> lookslike <- predict(model, c("bus", "toilet"), type = "nearest", top_n = 5)
> lookslike
$bus
term1 term2 similarity rank
1 bus auto 0.9914712 1
2 bus voet 0.9914547 2
3 bus parkeren 0.9891535 3
4 bus etc 0.9868931 4
5 bus ben 0.9836714 5
$toilet
term1 term2 similarity rank
1 toilet prive 0.9925081 1
2 toilet werkte 0.9882500 2
3 toilet koelkast 0.9872757 3
4 toilet boven 0.9756840 4
5 toilet verdieping 0.9755659 5
> set.seed(123456789)
> model <- word2vec(x = x, type = "cbow", dim = 15, iter = 20, threads = 1)
> embedding <- as.matrix(model)
> embedding <- predict(model, c("bus", "toilet"), type = "embedding")
> lookslike <- predict(model, c("bus", "toilet"), type = "nearest", top_n = 5)
> lookslike
$bus
term1 term2 similarity rank
1 bus auto 0.9916762 1
2 bus voet 0.9896209 2
3 bus ben 0.9881520 3
4 bus dus 0.9867552 4
5 bus tram 0.9852284 5
$toilet
term1 term2 similarity rank
1 toilet prive 0.9896117 1
2 toilet koelkast 0.9802536 2
3 toilet vertoeven 0.9771889 3
4 toilet werkte 0.9745187 4
5 toilet douche 0.9705700 5
That verbose argument never worked with threads > 1, I think the printing out was even the culprit of crashes if I recall, as the printing out was not thread-safe I think. And the embeddings were only 100% reproducible when using thread = 1 probably due to parallel updating of the embeddings space. Maybe we should check the examples in RcppProgress at https://github.com/kforner/rcpp_progress/tree/master/inst/examples to see how to make the progress bar not above 100%
Printing from multiple threads is often risky, but I will look into the original code and PcppProgress. So far, there is no performance gain from passing a token list due to token to ID conversion in the library. Next step will be passing a list of token IDs.
I've pushed some continuous integration. Did you manage to make the embeddings reproducible between the model building from file and from the tokenlist?
So far, there is no performance gain from passing a token list due to token to ID conversion in the library.
Yes, the performance gain will be merely removing file operations.
Next step will be passing a list of token IDs.
Does this as well require modifications of the new C++ internals or can we just do this from R and keep the mapping between tokens en token ids in R? FYI. for Byte Pair Encoding tokenisation: these C++ wrappers https://cran.r-project.org/package=tokenizers.bpe / https://cran.r-project.org/package=sentencepiece can be used.
My idea is to convert a list of tokens to list of IDs in this way. If lis
and voc
are passed to the C++ code, it will be much simpler.
library(udpipe)
data(brussels_reviews_anno, package = "udpipe")
x <- subset(brussels_reviews_anno, language == "fr")
voc <- unique(x$token)
lis <- lapply(split(x$token, x$sentence_id), match, voc)
head(lis)
#> $`2345`
#> [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 7 14 15 16 17 18 19 12 13 20 21 16
#> [26] 17 18 22 23 24 23 13 7 25 26 27 28 29 23 7 10 30 31 32 33 34 7 35 28 36
#> [51] 7 37 38 39 28 40 41 21
#>
#> $`2346`
#> [1] 42 23 43 44 45 46 47 48 49 50 47 51 52 53 54 55
#>
#> $`2347`
#> [1] 56 57 58 59 12 22 12 23 60 55
#>
#> $`2348`
#> [1] 61 58 62 53 63 27 64 28 65 12 34 66 45 67 68 7 69 28 70 53 71 23 45 72 73
#> [26] 29 7 37 55
#>
#> $`2349`
#> [1] 74 58 47 75 76 12 77 23 78 12 79 80 81 82 83 84 28 85 86 87 88 72 89 90 55
#>
#> $`2350`
#> [1] 91 92 93 94 23 95 49 96 58 81 40 46 55
Created on 2023-09-21 with reprex v2.0.2
The method for list should be like this to support both lists of characters and integers. If we ask users to tokenize their texts before hand, we can drop word2vec.character()
.
word2vec.list(x, vocaburary = NULL, ...) {
v <- unique(unlist(x. use.names = FALSE))
if (is.character(v)) {
if (!is.null(vocaburary))
stop("vocaburary is not used when x is a list of characters")
x <- lapply(x, match, v) # fastmatch::fmatch is faster
} else if (is.numeric(v)) {
if (min(v) < 0 || length(v) < max(v)) # 0 will be ignored in C++
stop("vocaburary does not match token index")
v <- vocaburary
}
model <- w2v_train(x, vocaburary = v, ...)
return(model)
}
# udpipe
x <- subset(brussels_reviews_anno, language == "fr")
lis <- split(x$token, x$sentence_id)
word2vec(lis)
# quanteda
toks <- tokens(corpus(data_corpus_inaugural))
word2vec(toks, types(toks))
Ok for me if you want to create the vocabulary upfront and pass it on to the C++ layer, it does simplify the construction at the C++ side for word2vec.list But I do prefer to keep as well the file-based approach next to it as I have quite some processes which now use that and prefer to have a high level of backward compatibility when possible.
The most important part of the library related to changes in input format is this block. This code is executed by individual threads for the texts between range.first
and range.second
.
Within the loop, sentence
is constructed as a vector of pairs that records a token ID and its frequency in the corpus. With a sentence
object, word vectors are trained with the skipgram or CBOW models.
You may be surprised how simple the core of the library is. All other functions and objects are for file access, multi-threading, tokenization, serialization, feature selection. It is an impressive peace of work but over-complicated. If we remove file access, and complete tokenization, serialization, feature selection in R (or using other tools), and implement multi-threading using using Intel TBB (RcppParallel), we can make a compact package that we can understand and maintain more easily. We can even enhance.
If we use TBB, we only need to wrap the parallel code by tbb::parallel_for
. For example,
I understand that backward compatibility is important but it should be possible to produce near identical results using a new library if tokenization is performed in a the same way.
I have to provide some courses on text analytics with R this week. I'll look into the code the week after that such that we can integrate it already and if to test if I can completely make the embedding matrix reproducible with a toy problem where tokenisation is only based on single space and sentences are based on a dot.
Once there is 100% reproducibility, we certainly flesh out more of the library.
It's been a while since I looked into the details of the implementation. I thought the library implemented multithreading by reading in parallel from the file and if I understand you correctly, you would prefer to use RcppParallel instead of a multithreaded reading from file. Your end goal will have as a consequence that building the model from file-based dump of wikipedia will not be possible any more, all texts will need to be loaded in R somehow? Or do you envision another iterator way of implementation similar as text2vec?
It is true that a file-based corpus is more efficient than a memory-based corpus. Yet, I was trying to make tokens objects more efficient by keeping the data in C++ as much as possible (avoid copying large objects between C++ and R) using the XPtr object. I would develop a file-mapped tokens object in the future if the memory usage need to be even lower. Intel TBB (via RcppParallel) can be used for multi-threading here too.
We may not be able to train the algorithm on Wikipedia dump if it need to be kept on memory, but I doubt the usefulness of such models in applied research. My approach has been training word vectors on a subject specific corpus and use them to perform higher-level analysis (e.g. domain-specific sentiment analysis).
I've had a look to the code changes and I think if we incorporate the same logic on the vocabulary for the list-based approach and the file-based approach as suggested in https://github.com/koheiw/word2vec/pull/2, I'm fine with the changes as it gives the exact same embeddings for both approaches. We just need to update the documentation and that's it. We can than in a later step do all kind of optimisations if that is fine for you?
I've improved the documentation in that pull request. For me this is fine as is, embeddings are the same with the 2 approaches with that additional pull request.
If I just remove the train.R
in the tests
folder to avoid adding packages to Suggests, I think this is good to go to CRAN.
@koheiw I'll integrate this pull request, unless you have any remarks. https://github.com/bnosac/word2vec/pull/18
@jwijffels thanks for preparing a PR with nice examples. I thought dev-texts
branch need more work such as cleaning up old code, testing with inputs, and fixing the progress bar. If you find the current version is sufficiently clean and stable, you can merge it to the master branch.
I found bpe_encode()
interesting, but its integer IDs are converted to characters. Also I understand why you restore </s>
, but it is strange to have a word vector for the tag in the output. We can ignore these problmes for now but we must address them in the next upgrade. Ultimately we need to abandon its built-in tokenizer to fix them.
We should also consider adding proper CI tests for the functions in the package.
I thought dev-texts branch need more work such as cleaning up old code, testing with inputs, and fixing the progress bar. If you find the current version is sufficiently clean and stable, you can merge it to the master branch.
I think the most important prove is that the embeddings are 100% the same for the file-based and list-based approach which is the case. I also checked that without doing the ordering of the vocabulary as indicated in the remarks of https://github.com/bnosac/word2vec/pull/18 that the embeddings were still the same as with version 0.3.4 of the package. So that is ok.
I completely agree there should be more unit tests on the different settings though. But I wonder how they will look like. Eventually these numbers are only usefull in further downstream processing.
I found bpe_encode() interesting, but its integer IDs are converted to characters. Also I understand why you restore , but it is strange to have a word vector for the tag in the output. We can ignore these problmes for now but we must address them in the next upgrade. Ultimately we need to abandon its built-in tokenizer to fix them.
Yes, I know, I ignored these for now. Changing these would make the changes too complex to test for now. Better to do it in different steps.
Thanks again for all the work 👍
The unit tests would be to ensure that word vector from this version and next version is the same, and methods like as.matrix()
like as expected, word2vec()
return word vectors with expected sizes etc.
I can develop a version for a list of token IDs in the coming months. How quickly do you want to change? It would be nice to have shared mid- to long-term milestones.
I've uploaded the current status of the package to CRAN. Hopefully it will land there without problems. I've created 2 more issues, one for more unit tests and one for future discussions on future improvements
Regarding quickness of change. I'm happy to incorporate any elements you see as improvements. I don't have any specific timing for that in mind. In general the only requirement that I have is keep the interfaces as stable as possible with backward compatibility, avoid doing breaking changes and limit R package dependencies to not let someone else break the package installation.
Changes are on CRAN now. Many thanks again. Looking forward to see further improvements. Closing this issue for now.
Hi @jwijffels
Your word2vec looks great. I wanted to use word2vec to generate word vectors for my LSX package, so I wonder if you have a plan to support a list of tokens as an input.
As far as I understand, I need to convert quanteda's tokens object (list of token IDs) to character strings to pass that to
word2vec()
. I think it is more efficient to feed texts to the C++ functions without writing them to temporary files.If you like the idea, I think I can contribute.