-
Hi,
I am trying to use pretrained model en-de from (http://data.statmt.org/rsennrich/wmt16_systems/ ) and translate english sentence with this script:
```
# this sample script translates a test …
-
Thank you for the great work.
The tokenizer for multilingual models put whitespaces around Chinese characters (Kanji), but this treatment will unintentionally break the Japanese words consisting of…
-
![screen shot 2018-11-26 at 14 44 39](https://user-images.githubusercontent.com/28839356/49021146-e6375680-f189-11e8-8c70-0eb0b11a0428.png)
Running bpemb_en.encode is solely splitting the words by …
-
Hi,
I used the SentencePiece with uni-gram algorithm to achieve segmentation of protein sequence.
The result is two columns data. I know the first column is subword segmentation.
But what does …
-
This is regarding the pip package.
After training the unigram model using `sentencepiece.SentencePieceTrainer.Train(train_args)`, suppose I want to sample a subword segmentation for a sentence. I a…
-
It would be great if in sentencepiece the word boundary character can be chosen by the users. For example, '@@' is commonly seen in other libraries, so supporting that would help making it easier to i…
-
I'm not sure if this is a bug or by design, but I am experiencing some weird segmentation behaviour when using **--user_defined_symbols** to train sentencepiece.
It seems that sentencepiece does …
-
I train transformer model with en-fr data, I run it for several times but it seems deadlock when finish a batch at every time, log is as follow
[2018-09-19 20:47:48] Training started
[2018-09-19 2…
-
Please could you tell me the features used to train the average perceptron model to parse the OSM addresses
-
Hi, I have been using `apply_bpe` from October 2016. I tested a recent copy of `apply_bpe` and the number of segments are significantly lower than before. I am using exactly the same settings and code…