[data2vec] The pre-trained vision model and code

Yingdong-Hu commented 2 years ago

Hi! When will the pre-trained vision model and code of data2vec be released ?

cashincashout commented 2 years ago

@alexeib Hi, I'm wondering whether the code of data2vec (vision part) will be released recently.

alexeib commented 2 years ago

yes i am working on it. the problem is that we implemented it in a different codebase (beit) and now need to port it over to fairseq

if you want to get started asap you could try just taking the beit code (https://github.com/microsoft/unilm/tree/master/beit) and replace the loss with what we describe in the paper (following examples of text/audio) you should be able to get good results quickly!

mattiasstahre commented 2 years ago

Great! I'm also looking for the vision part of the model :D

RobertoArayaDay commented 2 years ago

Me too :)

mritunjaymusale commented 2 years ago

Has anyone been able to implement it based on @alexeib 's suggestion ?

jsedoc commented 2 years ago

@alexeib is there an ETA on the update or should we try to modify the beit code?

Echoing @mritunjaymusale 's request, has anyone implemented the suggestion? We'd rather not reproduce work.

arxyzan commented 2 years ago

I'm currently working on it, you can review my implementation here According to the paper, I think Data2Vec for vision is more than just changing the loss calculation method of Beit, because Beit uses two images; one is the masked image of 224x224 pixels as inputs and the second is the visual tokens of the smaller image (112x112 encoded by a discrete Variational Auto Encoder like DALL-E) as targets. The task is for the model to predict these tokens based on the masked pixel inputs. Based on what I understand from the paper, Data2Vec uses only one image tokenized by the dVAE model. My hypothesis is that the method would be as follows:

Transform the input image (according to the paper, random resized crop, horizontal flipping, color jittering)
Split the image into 14x14 patches of size 16x16 pixels each and tokenize the image using DALL-E to convert to visual tokens. (flatten 14x14 patches to a tensor of 196 items)
Apply masking to the tokens

And the rest is like NLP because we're now dealing with discrete tokens.

@alexeib I'd be glad to know if the above hypothesis is correct or not. In addition what would be the mask token id here? because I suppose all 8192 tokens are reserved by the image tokenizer (DALL-E)! Can I manually set a random number in that boundary?

Best, Aryan

arxyzan commented 2 years ago

I just found this nice implementation from @Guillem96 who has exclusively worked on the vision part.

arxyzan commented 2 years ago

Alright, some of my assumptions above are wrong! The targets are contextualized and there's no need for a tokenizer. The images are patched, masked and fed to the network. I just completed the vision implementation in my repo. There might be some minor details left to do which I will apply later on but the main body of work is done.

arxyzan commented 2 years ago

Official code and models released here: https://github.com/facebookresearch/data2vec_vision/tree/main/beit I've reviewed the code briefly and can verify that most of the codes in my repo and @Guillem96 's conform to their implementation, but I'll go through it deeply to resolve any missing parts.

facebookresearch / fairseq

[data2vec] The pre-trained vision model and code #4193