Confusing GPT Terminology in NeMo Models tutorial

NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)

Apache License 2.0

11.84k stars 2.46k forks source link

In the NeMo Models tutorial, in the refactored minGPT we have GPTEmbedding, GPTTransformerEncoder, GPTDecoder. I think GPTTransformerEncoder and GPTDecoder are confusing names both with respect to the original GPT architecture and the Transformer architecture. In the original Tranformer paper, namely the left stack of identical layers is called encoder and the right stack of identical layers is called decoder and that's why in general both Transformer encoder and decoder are used to describe two different and specific sub-architectures in the NLP community. .

Based on the original Transformer architecture we have all the subsequent Transformer architecture descriptions using the encoder and decoder terminology. For example, authors of BERT describe it as

BERT’s model architecture is a multi-layer bidirectional Transformer encoder based on the original implementation described in Vaswani et al. (2017).

Similarly, the original GPT paper describes the architecture of GPT as a

12-layer decoder-only transformer

So there is no "encoder" in GPT's architecture and that's why to a beginner the tutorial's GPTTransformerEncoder would come across as confusing. Especially because the work it is doing is that of the decoder as defined in the GPT paper. Similarly, the tutorial calls the final layer (LayerNorm + Linear) in the architecture GPTDecoder. Again, with the respect to the GPT paper this is a confusing name because the work this module is doing is not actually that of the GPT decoder. Including the GPT architecture from the original paper for reference:

While not consistent with the paper, this is a tutorial to explain how to refactor a monolithic concrete model into an abstract NeMo model - comprising of neural modules that represent the abstraction with neural types, and then instantiate them with actual concrete modules (which can be anything that fits the neural types) as a Nemo Model.

The concept of a NeMo models components don't need to be 1:1 with a papers notation of terms - they are represented as "encoder" that takes certain neural type and returns some neural type, and a "decoder" that takes some neural type from the encoder and returns some other neural type. They don't refer to the encoder and decoder blocks of the gpt blocks only.

See ASR models who also present encoder decoder architecture.

The encoder neural module can be a cnn based QuartzNet or a self attention based transformer decoder block or a Conformer sandwich.

Similarly, the decoder neural module can be a basic linear layer, a RNN based Autoregressive decoder, or even an RNNT Transducer based decoding scheme.

For now, to be clear I will add a note there

NVIDIA / NeMo

Confusing GPT Terminology in NeMo Models tutorial #1638