wav2vec 2.0 for audio classification

SerK0 commented 3 years ago

What is your question?

I have pretrained wav2vec 2.0 and labeled audiofiles (0 / 1 target). Is it a way how to correct base finetuning pipeline in order to finetune for audio classification task?

Is it a good idea just take embeddings after feature_extractor or from context layer, pass them to projection layer and propagate with BCE loss?

jasonppy commented 3 years ago

I wanted to do the same thing. But couldn't find the way to fine-tune it for classification task.

On the other hand, I tried averaging contextual embeddings from wav2vec 2.0 (I got the embeddings by simply using the forward method, hopefully this is the correct way to do it), but the result is pretty bad. But the contextual embeddings from wav2vec 1.0 (I got the embeddings by the feature_aggregator method) did a much better job. I wonder if I what I did for wav2vec 2.0 is correct.

Thanks!

SerK0 commented 3 years ago

@jasonppy Hi! Currently I cut pretrained wav2vec after CNN extractor, added pooling layer and proj_head and trained it with BCE loss with different lr schedulers. I found results usefull.

dimejimudele commented 3 years ago

What wav2vec (or its other variants like wav2vec2 and vq-wav2vec) learns is the discrete latent embedding (i.e discrete encoder output) Thus as @SerK0 rightly puts it here, you (@jasonppy) need to cut the pretrained extractor, and then add the layers needed for your specific task on top. The aggregator only served in training the wav2vec model in a self-supervised way in other to optimize a contrastive loss function. For example, If you want to perform binary classification, you should use the extractor to embed your audio signals into a lower freq feature space, and then pass the extractor output into your classifier.

I am wrong here. The context embedding is what Wav2Vec2 uses

ketan0 commented 3 years ago

Does it make sense to completely cut out the transformer, though (Referring specifically to wav2vec 2.0)? Wouldn't it have learned some useful representations through the masked pretraining?

jasonppy commented 3 years ago

What wav2vec (or its other variants like wav2vec2 and vq-wav2vec) learns is the discrete latent embedding (i.e discrete encoder output) Thus as @SerK0 rightly puts it here, you (@jasonppy) need to cut the pretrained extractor, and then add the layers needed for your specific task on top. The aggregator only served in training the wav2vec model in a self-supervised way in other to optimize a contrastive loss function. For example, If you want to perform binary classification, you should use the extractor to embed your audio signals into a lower freq feature space, and then pass the extractor output into your classifier.

Thanks for your reply. This makes sense to me, I wanted to use contextualized embeddings because this is what NLP folks usually do with language pretraining models like BERT. I thought contextualized embeddings could also be useful in this case. Although the training objective is very different.

JNaranjo-Alcazar commented 3 years ago

Hi everybody, I am new in the field of NLP, what I understand from @dimejimudele is that wav2vec 2.0 does not provide an embedding (denoted as z in the wav2vec example) that can be useful in order to train an audio classifier? Am I right?

Can someone give more insight about the difference between latent representations and contextualized embeddings? Why audio2vec 2.0 can not be used as feature extractor?

Thanks in advance!

heibaidaolx123 commented 3 years ago

Hi, guys, I'm wondering if someone have tried pretraining wav2vec 2.0 from AudioSet?

JNaranjo-Alcazar commented 3 years ago

I worder if also wav2vec 1.0 has been trained with Audioset...

stale[bot] commented 3 years ago

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

JNaranjo-Alcazar commented 3 years ago

bump

cxcxcxcx commented 2 years ago

Please see https://github.com/m3hrdadfi/soxan and https://arxiv.org/pdf/2012.06185.pdf

It seems that using the average of the transformer outputs and then apply a few layers on top of it works with finetuning.

Simply extracting the features from the convolutional layers will make it prone to overfitting.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

stale[bot] commented 2 years ago

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!

facebookresearch / fairseq

wav2vec 2.0 for audio classification #3006

What is your question?