arxyzan / data2vec-pytorch

PyTorch implementation of "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language" from Meta AI
MIT License
172 stars 26 forks source link

Model weights? #4

Closed arxyzan closed 2 years ago

arxyzan commented 2 years ago

Although pretraining these models requires a lot of hardware resources and is almost impossible for an individual like me to do, there is the possibility to port the weights from HuggingFace models that actually use the same encoders as fairseq (and this repo). Otherwise this repo would be benefitial only for educational purposes.

Obviously, this task must be carried out so carefully but before that, the possibility of it must be verified. As this model "slightly" outperforms previous SOTA models, messing up even a single layer weight can ruin the whole thing!

The progress and issues regarding this task, will be stated here.

arxyzan commented 2 years ago

Just transfered weights for BEiT and RoBERTa because the model used in Data2VecModel of HuggingFace are the same but for Wav2Vec2 it's not the case. there are some architectural differences and I haven't still come up with a solution for that. I'll be pushing those weights to the HF hub and provide a finetuning code for it in this repo.

arxyzan commented 2 years ago

Finally, had to get mean of the encoder.pos_conv_embed.layers[:].conv layers and assign all of them to one layer (wav2vec2.encoder.pos_conv_embed.conv) and apply weight norm. It makes sense for finetuning and I don't think it would make much difference. Moreover, in order to have the identical weights and architecture, using wav2vec2 from HuggingFace would not be possible so I'll probably leave it as is.