Closed jwijffels closed 4 years ago
Hi, I moved institutions and unfortunately cannot check this anymore. All I can say is that the version was from around the end of 2017. However, I'm not aware of any compatibility-breaking changes in sentencepiece and am using BPEmb with recent versions. What kind of inconsistencies did you find?
I'm using the R wrapper around sentencepiece I created myself at https://github.com/bnosac/sentencepiece. Using the following code to encode and decode some text. I looks like the model encodes correctly if using subwords (https://github.com/bnosac/sentencepiece/blob/master/src/rcpp_sentencepiece.cpp#L50) but not when using the ids (which is implemented here https://github.com/bnosac/sentencepiece/blob/master/src/rcpp_sentencepiece.cpp#L64)
Looks like it is messing up the ids... Hence my question which version of sentencepiece you used (the R wrapper is using release v0.1.84 from Oct 12, 2019.
> library(sentencepiece)
> ## Get model
> download.file(url = "https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.model",
+ destfile = "english.model",
+ mode = "wb")
trying URL 'https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.model'
Content type '' length 400869 bytes (391 KB)
downloaded 391 KB
> model <- sentencepiece_load_model("english.model")
>
> ## Encode/Decode using words
> x <- sentencepiece_encode(model, "I am just testing out", type = "subwords")
> x
[[1]]
[1] "▁" "I" "▁am" "▁just" "▁testing" "▁out"
> sentencepiece_decode(model, x)
[1] "I am just testing out"
>
> ## Encode/Decode using ids
> x <- sentencepiece_encode(model, "I am just testing out", type = "ids")
> x
[[1]]
[1] 9912 0 425 1025 6083 371
> sentencepiece_decode(model, c(9912L, 0L, 425L, 1025L, 6083L, 371L))
[1] " ⁇ am just testing out"
>
> model$vocabulary[9910:9915, ]
id subword
9910 9909 <U+2581>politicians
9911 9910 eff
9912 9911 <U+2581>humid
9913 9912 <U+2581>
9914 9913 e
9915 9914 a
Ah, you need to lowercase the input, all sentencepiece models in BPEmb are lowercase-only.
You should get this:
> x <- sentencepiece_encode(model, "i am just testing out", type = "subwords")
> x
[[1]]
[1] "▁i" "▁am" "▁just" "▁testing" "▁out"
And with ids:
> x <- sentencepiece_encode(model, "i am just testing out", type = "ids")
> x
[[1]]
[1] 386 425 1025 6083 371
:-) That was it. How silly of me. So all the models have been trained on lowercased Wikipedia. Many thanks for the input!
hi, I'm writing an R wrapper around sentencepiece and tried to load a few of the models and vocabulary provided here. I have found some inconsistencies. In order to make sure this is not related to the version of SentencePiece you used, can you let me know with which version / commit of SentencePiece the models were constructed? I'm making the R wrapper around sentencepiece release v0.1.84 from Oct 12, 2019.