bheinzerling / bpemb

Pre-trained subword embeddings in 275 languages, based on Byte-Pair Encoding (BPE)
https://nlp.h-its.org/bpemb
MIT License
1.18k stars 101 forks source link

version of sentencepiece used #34

Closed jwijffels closed 4 years ago

jwijffels commented 4 years ago

hi, I'm writing an R wrapper around sentencepiece and tried to load a few of the models and vocabulary provided here. I have found some inconsistencies. In order to make sure this is not related to the version of SentencePiece you used, can you let me know with which version / commit of SentencePiece the models were constructed? I'm making the R wrapper around sentencepiece release v0.1.84 from Oct 12, 2019.

bheinzerling commented 4 years ago

Hi, I moved institutions and unfortunately cannot check this anymore. All I can say is that the version was from around the end of 2017. However, I'm not aware of any compatibility-breaking changes in sentencepiece and am using BPEmb with recent versions. What kind of inconsistencies did you find?

jwijffels commented 4 years ago

I'm using the R wrapper around sentencepiece I created myself at https://github.com/bnosac/sentencepiece. Using the following code to encode and decode some text. I looks like the model encodes correctly if using subwords (https://github.com/bnosac/sentencepiece/blob/master/src/rcpp_sentencepiece.cpp#L50) but not when using the ids (which is implemented here https://github.com/bnosac/sentencepiece/blob/master/src/rcpp_sentencepiece.cpp#L64)

Looks like it is messing up the ids... Hence my question which version of sentencepiece you used (the R wrapper is using release v0.1.84 from Oct 12, 2019.

> library(sentencepiece)
> ## Get model
> download.file(url = "https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.model", 
+               destfile = "english.model", 
+               mode = "wb")
trying URL 'https://nlp.h-its.org/bpemb/en/en.wiki.bpe.vs10000.model'
Content type '' length 400869 bytes (391 KB)
downloaded 391 KB

> model <- sentencepiece_load_model("english.model")
> 
> ## Encode/Decode using words
> x <- sentencepiece_encode(model, "I am just testing out", type = "subwords")
> x
[[1]]
[1] "▁"        "I"        "▁am"      "▁just"    "▁testing" "▁out"    

> sentencepiece_decode(model, x)
[1] "I am just testing out"
> 
> ## Encode/Decode using ids
> x <- sentencepiece_encode(model, "I am just testing out", type = "ids")
> x
[[1]]
[1] 9912    0  425 1025 6083  371

> sentencepiece_decode(model, c(9912L, 0L, 425L, 1025L, 6083L, 371L))
[1] " ⁇  am just testing out"
> 
> model$vocabulary[9910:9915, ]
       id             subword
9910 9909 <U+2581>politicians
9911 9910                 eff
9912 9911       <U+2581>humid
9913 9912            <U+2581>
9914 9913                   e
9915 9914                   a
bheinzerling commented 4 years ago

Ah, you need to lowercase the input, all sentencepiece models in BPEmb are lowercase-only.

You should get this:

> x <- sentencepiece_encode(model, "i am just testing out", type = "subwords")
> x
[[1]]
[1] "▁i"        "▁am"      "▁just"    "▁testing" "▁out" 

And with ids:

> x <- sentencepiece_encode(model, "i am just testing out", type = "ids")
> x
[[1]]
[1] 386  425 1025 6083  371
jwijffels commented 4 years ago

:-) That was it. How silly of me. So all the models have been trained on lowercased Wikipedia. Many thanks for the input!