Transformer-XL : Unlocking Long-term Dependencies for Effective Language Modelling

Talks about the background on transformers, how they work, and the gaps that the transformer XL solves

Attention modules: opposed to RNN that processes tokens one by one(Forward & backward RNN), transformers take a group of tokens and "learns the dependencies between all of them at once using three learned weight matrices i.e Query, Key and Value, which form an Attention Head.
Due to the concurrent processing in the attention module, the model also needs to add information about the order of the tokens, a step named Positional Encoding, that helps the network learn their position. (If the model processes everything at once, it has no understanding of order and so it creates a positional encoding to preserve order)
Transformers could only take in small data sets of 512 characters. Every batch of data larget than 512 would have to be trained separately from scratch
Transformer XL: When the following segment is processed, each hidden layer receives two inputs i.e the output of the previous hidden layer of that segment, and the output of the previous hidden layer (this is concatenated together and creates long term dependencies and is used
Citation

Capstone2021 / Issue-Only