Vision transformer models are based on the transformer architecture, which was introduced in the paper "Attention is All You Need" in 2017. While CNNs rely on convolutional operations to extract spatial features from the input image, ViT uses an attention mechanism to capture the relationships between different patches.
Rationale
By adding this new type of architecture into GDL we can experiment, compare, and seek performance gains.
Description
Vision transformer models are based on the transformer architecture, which was introduced in the paper "Attention is All You Need" in 2017. While CNNs rely on convolutional operations to extract spatial features from the input image, ViT uses an attention mechanism to capture the relationships between different patches.
Rationale
By adding this new type of architecture into GDL we can experiment, compare, and seek performance gains.
Possible Implementation