At the moment, we have a novel architecture that's very powerful in language modelling. However, we don't know whether it will transfer as well to other domains as the transformer. That's why it'd be interesting to test its versatility by training it on ImageNet.\
This issue is about implementing the input projection for image tokens (as in ViT), the necessary data pipelines and testing the model on this new modality.
At the moment, we have a novel architecture that's very powerful in language modelling. However, we don't know whether it will transfer as well to other domains as the transformer. That's why it'd be interesting to test its versatility by training it on ImageNet.\ This issue is about implementing the input projection for image tokens (as in ViT), the necessary data pipelines and testing the model on this new modality.