model architecture - Githubissues

Great question! The decoder-only model is indeed more popular and used in most of the recent systems. From my personal view, the difference between encoder-decoder and decoder-only models is not that significant, considering the visual backbone. You can always consider the visual feature (ViT) as an encoder in the decoder-model. Part of our efforts is focused on training a generalized encoder that can take many different modalities. So, that is why we chose the encoder-decoder system in our project.

allenai / unified-io-2

model architecture #13