Paper Title: CCC-wav2vec 2.0: Clustering aided Cross Contrastive Self-Supervised Learning of Speech Representations. At IEEE SLT 2022 (arxiv link).
ccc-wav2vec 2.0 is a pre-training mechanism which uses clustering and an augmentation-based cross-contrastive loss as its self-supervised objective. Through the clustering module, we scale down the influence of those negative examples that are highly similar to the positive. The Cross-Contrastive loss is computed between the encoder output of the original sample and the quantizer output of its augmentation and vice-versa, bringing robustness to the pre-training strategy.
Primary Contributions:
The ccc-wav2vec 2.0 BASE model pre-trained on LibriSpeech-960h has been evaluated on the multiple downstream tasks over the SUPERB benchmark. The proposed method comprehensively outperforms the baseline wav2vec 2.0 BASE model over the array of downstream tasks presented over SUPERB.
The WERs specified are without the use of any language model.
Model | Pre-training data | Fine-tuning data | Model Links | WER (test-clean \ | test-other) |
---|---|---|---|---|---|
wav2vec 2.0 Base | LibriSpeech-360h | No fine-tuning | fairseq | huggingface | --- | |
wav2vec 2.0 Base | LibriSpeech-360h | LibriSpeech-100h | fairseq | huggingface | 12.8 | 31.7 | |
ccc-wav2vec 2.0 Base | LibriSpeech-360h | No fine-tuning | fairseq | huggingface | --- | |
ccc-wav2vec 2.0 Base | LibriSpeech-360h | LibriSpeech-100h | fairseq | huggingface | 10.8 | 27.7 | |
ccc-wav2vec 2.0 Base | LibriSpeech-960h | No fine-tuning | fairseq | huggingface | --- | |
ccc-wav2vec 2.0 Base | LibriSpeech-960h | LibriSpeech-100h | fairseq | huggingface | 5.5 | 12.4 | |
ccc-wav2vec 2.0 Base SUPERB | LibriSpeech-960h | No fine-tuning | fairseq SUPERB model | huggingface SUPERB model | --- |
git clone https://github.com/Speech-Lab-IITM/CCC-wav2vec-2.0
cd fairseq
pip install --editable ./
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
--global-option="--deprecated_fused_adam" --global-option="--xentropy" \
--global-option="--fast_multihead_attn" ./
For large datasets install PyArrow: pip install pyarrow
If you use Docker make sure to increase the shared memory size either with --ipc=host
or --shm-size
as command line options to nvidia-docker run
.
For Augmentations to work install torchaudio-augmentations:
git clone https://github.com/Speech-Lab-IITM/torchaudio-augmentations
cd torchaudio-augmentations
pip install --editable ./
The clustering module functions on GPU needs fast-pytorch-kmeans to be installed: pip install fast-pytorch-kmeans
cc_weights
parameter in the criterion
section of the pre-training configs which can be found from the pre-training config.cluster_factor
and scale_factor
parameters can be modified from the model
section of the pre-training configs which can be found from the pre-training config.path_to_musan_noise_set
variable of the getitem method of the raw_audio_dataset file.