AlexisTercero55 commented 5 months ago

BPE as input tokens of the transformer model

The transformer model proposed by "Attention is all you need" encodes the 4.5M sentence input data into a small vocabulary generated by learning shared subword units using Byte Pair Encoding. In detail, the initial transformer model uses a BPE variant optimized for word segmentation tasks proposed in 2016 by Rico Sennrich, Barry Haddow, and Alexandra Birch in the paper Neural Machine Translation of Rare Words with Subword Units

AlexisTercero55 commented 5 months ago

What does BPE contribute to the transformers model?

Making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units.

BPE allows for the representation of an open vocabulary through a fixed-size vocabulary of variable-length character sequences, making it a very suitable word segmentation strategy for neural network models.

Neural machine translation differs from phrasebased methods in that there are strong incentives to minimize the vocabulary size of neural models to increase time and space efficiency, and to allow for translation without back-off models. At the same time, we also want a compact representation of the text itself, since an increase in text length reduces efficiency and increases the distances over which neural models need to pass information.

A simple method to manipulate the trade-off between vocabulary size and text size is to use shortlists of unsegmented words, using subword units only for rare words. As an alternative, we propose a segmentation algorithm based on byte pair encoding (BPE), which lets us learn a vocabulary that provides a good compression rate of the text.

This mean that the model can work with almost all sequences because of the PBE compression principle.

Requirements for translation models

Understand text data as subword units that are made up of pattern sequences in the text.

especially for languages with productive word formation processes such as agglutination and compounding, translation models require mechanisms that go below the word level.

AlexisTercero55 commented 4 months ago

About Byte Pair Encoding BPE for Neural Machine Translation NMT

Byte Pair Encoding (BPE) (Gage, 1994) is a simple data compression technique that iteratively replaces the most frequent pair of bytes in a sequence with a single, unused byte. We adapt this algorithm for word segmentation. Instead of merging frequent pairs of bytes, we merge characters or character sequences.

At test time, we first split words into sequences of characters, then apply the learned operations to merge the characters into larger, known symbols. This is applicable to any word, and allows for open-vocabulary networks with fixed symbol vocabularies

AlexisTercero55 commented 4 months ago

Resources

BPE from Neural Machine Translation of Rare Words with Subword Units

u/rsennrich/subword-nmt

Tokenization with Moses

AlexisTercero55 commented 4 months ago

Derived tasks

[ ] Create a BPE pipeline using subword-nmt

AlexisTercero55 commented 4 months ago

AlexisTercero55 / AI-Research

Byte Pair Encoding on MWT 14 EN2DE #14

BPE as input tokens of the transformer model

What does BPE contribute to the transformers model?

Requirements for translation models

About Byte Pair Encoding BPE for Neural Machine Translation NMT

Resources

BPE from Neural Machine Translation of Rare Words with Subword Units

Tokenization with Moses

Derived tasks

Future readings