-
I was just reading [Junyoung's paper](http://arxiv.org/abs/1603.06147) on using a character-level decoder. Although it's nice to see it works, I think the results are slightly misleading because the p…
-
# BPE as input tokens of the transformer model
The transformer model proposed by "_Attention is all you need_" encodes the 4.5M sentence input data into a small vocabulary generated by learning sha…
-
## Motivation
There are multiple libraries that implement subword models within the compression-based space. There is fastBPE, SentencePiece, YouTokenToMe, etc.
As far as I can tell there are f…
-
Language identification with fasttext is great,
[https://fasttext.cc/blog/2017/10/02/blog-post.html](url)
But the training process is not clear, I am wondering if for language identification, subwor…
-
Hello,
I am currently trying to get a transformer going for segmentation of scripta continua languages. I noticed that decreasing the vocab_size increased the performance of the transformer in this …
-
I have trained a base transformer model using the sub-word segmentation approach of Sennrich et al. (https://github.com/rsennrich/subword-nmt). This requires me to set the subword_tokenizer in the new…
-
LMVR modifies FlatCat and allows for an output lexicon size to be set.
Since we used 3 different settings for BPE (2500, 5000, 7500), it could
be worthwhile to investigate the settings for LMVR as we…
-
I am currently training a transformer model and have followed the MTM labs to apply BPE to my own corpus. However, I'm unsure of the effect that providing a pre-determined vocabulary has. Does it impa…
-
For many languages, there are lots of unpaired words, but lots of paired phrase.
-
For this, I think it is possible that BPE vocabulary is too small that training corpus is overly segmented, making it harder for model training and reasoning.
In our experiment, the scale of Chinese…