Add Vision Transformer Model

Description

Vision transformer models are based on the transformer architecture, which was introduced in the paper "Attention is All You Need" in 2017. While CNNs rely on convolutional operations to extract spatial features from the input image, ViT uses an attention mechanism to capture the relationships between different patches.

Rationale

By adding this new type of architecture into GDL we can experiment, compare, and seek performance gains.

Possible Implementation

Call mix transformer encoder from segmentation.pytorch.model >= 0.3.1
Implement decoder architecture from SegFormer Paper

NRCan / geo-deep-learning

Add Vision Transformer Model #501

Description

Rationale

Possible Implementation