Open AlexisTercero55 opened 5 months ago
Making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units.
BPE allows for the representation of an open vocabulary through a fixed-size vocabulary of variable-length character sequences, making it a very suitable word segmentation strategy for neural network models.
Neural machine translation differs from phrasebased methods in that there are strong incentives to minimize the vocabulary size of neural models to increase time and space efficiency, and to allow for translation without back-off models. At the same time, we also want a compact representation of the text itself, since an increase in text length reduces efficiency and increases the distances over which neural models need to pass information.
A simple method to manipulate the trade-off between vocabulary size and text size is to use shortlists of unsegmented words, using subword units only for rare words. As an alternative, we propose a segmentation algorithm based on byte pair encoding (BPE), which lets us learn a vocabulary that provides a good compression rate of the text.
This mean that the model can work with almost all sequences because of the PBE compression principle.
especially for languages with productive word formation processes such as agglutination and compounding, translation models require mechanisms that go below the word level.
Byte Pair Encoding (BPE) (Gage, 1994) is a simple data compression technique that iteratively replaces the most frequent pair of bytes in a sequence with a single, unused byte. We adapt this algorithm for word segmentation. Instead of merging frequent pairs of bytes, we merge characters or character sequences.
At test time, we first split words into sequences of characters, then apply the learned operations to merge the characters into larger, known symbols. This is applicable to any word, and allows for open-vocabulary networks with fixed symbol vocabularies
BPE as input tokens of the transformer model
The transformer model proposed by "Attention is all you need" encodes the 4.5M sentence input data into a small vocabulary generated by learning shared subword units using Byte Pair Encoding. In detail, the initial transformer model uses a BPE variant optimized for word segmentation tasks proposed in 2016 by Rico Sennrich, Barry Haddow, and Alexandra Birch in the paper Neural Machine Translation of Rare Words with Subword Units