arxyzan / data2vec-pytorch

PyTorch implementation of "data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language" from Meta AI
MIT License
172 stars 26 forks source link

doubt about finetuning #3

Closed rafaelvp-db closed 2 years ago

rafaelvp-db commented 2 years ago

first of all great work @AryanShekarlaban and @kabouzeid!

quick question - if I want to fine tune data2vec with a given backbone (e.g. wav2vec2) - would freezing the feature extractor be enough? or should I also add an nn.Linear layer?

I see that by design trainer.py finetunes with TIMIT - but I also seen in another issue that we're actually training it from scratch (not sure if I'm missing something here)

thanks!

arxyzan commented 2 years ago

Hello Rafael, I'm glad you find this repo useful. In order to finetune Data2Vec's encoder for any modality, you have to just freeze the encoder itself (e.g. wav2vec2) and append head layers based on your specific task. To make it clear, Data2Vec is just needed for pretraining and finetuning can be done directly on the encoder model, meaning you have to take the encoder part of the trained Data2Vec model (which would give you a transformers.models.Wav2Vec2Model() instance in your case) and finetune it for a downstream task. The trainer.py files do the pretraining part but because the original dataset to pretrain wav2vec2 is the 960 hour librispeech (80GB or so!) I couldn't use it as the dataset and decided to go with TIMIT but I recommend you edit audio/dataset.py file to use librispeech_asr from HuggingFace datasets.

IMPORTANT NOTE: When I started this project months ago, my goal was to make data2vec easily understandable and customizable, becuase models in fairseq are generally hard to understand and reproduce due to the nature of large scale training. Right now, Data2Vec has been implemented in HuggingFace and the weights have been carefully assigned from the fairseq version. The only issue is that HF version does not take any encoder as argument for data2vec, but they've actually decided to use Wav2Vec2 for audio, BERT for text and BEiT for vision which might be against the original author's purpose (data2vec can be a wrapper around any encoder) and there is no training ability in HF version as of now. You can just use them for fintuning and inference. (see this issue).

Best, Aryan