AlexisTercero55 / AI-Research

AI Data Science, research and development

MIT License

0 stars 0 forks source link

Attention is all you need dataset wmt 2014 english-german dataset #12

Open AlexisTercero55 opened 6 months ago

AlexisTercero55 commented 6 months ago

Requirements

[ ] Read about input embeddings technique (byte-pair encoding) used by Google's team on "Attention Is All You Need" paper.
[ ] Design the input embeddings pipeline for wmt 2014 english-german dataset

Attention is all you need dataset wmt 2014 english-german dataset

Dataset: https://nlp.stanford.edu/projects/nmt/

from the paper:

Sentences were encoded using byte-pair encoding.

see Massive Exploration of Neural Machine Translation Architectures

Originally posted by @AlexisTercero55 in #11

AlexisTercero55 commented 5 months ago

References

AlexisTercero55 commented 5 months ago

Dataset structure

The wmt 2014 english-german dataset contains two text files that contains 4.5M sentence pairs of English-German:

AlexisTercero55 commented 5 months ago

Design the input embeddings pipeline for wmt 2014 english-german dataset

Motivation

We tokenize and clean all datasets with the scripts in Moses and learn shared subword units using Byte Pair Encoding (BPE) (Sennrich et al., 2016b) using 32,000 merge operations for a final vocabulary size of approximately 37k. (Google Brain 2017)

Sentences were encoded using byte-pair encoding.

Similarly to other sequence transduction models, we use learned embeddings to convert the input tokens and output tokens to vectors of dimension $d_{\text{model}}$ . (Google 2017)

Pipeline

[ ] Text data input.
[ ] Tokenize text.
[ ] Clean text.
[ ] Apply Byte Pais Encoding BPE (input tokens).
[ ] Generate vector embedings (word2vec).

AlexisTercero55 commented 3 months ago

Acoording "Attention is all you need" paper the python implementation of transformer architecture is placed in this repo/script

https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/models/transformer.py

FYI @xavierVG