Open Yingdong-Hu opened 2 years ago
@alexeib Hi, I'm wondering whether the code of data2vec (vision part) will be released recently.
yes i am working on it. the problem is that we implemented it in a different codebase (beit) and now need to port it over to fairseq
if you want to get started asap you could try just taking the beit code (https://github.com/microsoft/unilm/tree/master/beit) and replace the loss with what we describe in the paper (following examples of text/audio) you should be able to get good results quickly!
Great! I'm also looking for the vision part of the model :D
Me too :)
Has anyone been able to implement it based on @alexeib 's suggestion ?
@alexeib is there an ETA on the update or should we try to modify the beit code?
Echoing @mritunjaymusale 's request, has anyone implemented the suggestion? We'd rather not reproduce work.
I'm currently working on it, you can review my implementation here According to the paper, I think Data2Vec for vision is more than just changing the loss calculation method of Beit, because Beit uses two images; one is the masked image of 224x224 pixels as inputs and the second is the visual tokens of the smaller image (112x112 encoded by a discrete Variational Auto Encoder like DALL-E) as targets. The task is for the model to predict these tokens based on the masked pixel inputs. Based on what I understand from the paper, Data2Vec uses only one image tokenized by the dVAE model. My hypothesis is that the method would be as follows:
And the rest is like NLP because we're now dealing with discrete tokens.
@alexeib I'd be glad to know if the above hypothesis is correct or not. In addition what would be the mask token id here? because I suppose all 8192 tokens are reserved by the image tokenizer (DALL-E)! Can I manually set a random number in that boundary?
Best, Aryan
I just found this nice implementation from @Guillem96 who has exclusively worked on the vision part.
Alright, some of my assumptions above are wrong! The targets are contextualized and there's no need for a tokenizer. The images are patched, masked and fed to the network. I just completed the vision implementation in my repo. There might be some minor details left to do which I will apply later on but the main body of work is done.
Official code and models released here: https://github.com/facebookresearch/data2vec_vision/tree/main/beit I've reviewed the code briefly and can verify that most of the codes in my repo and @Guillem96 's conform to their implementation, but I'll go through it deeply to resolve any missing parts.
Hi! When will the pre-trained vision model and code of data2vec be released ?