I'm attempting to reproduce some results from the Data2Vec 2.0 paper, specifically the audio task results. I'm using the recommended commands from the Data2Vec 2.0 readme. Specifically, I've downloaded the data2vec Base model, with no fine tuning. I've downloaded the Libri-Light 10 hr data, and run libri-labels.py to obtain labels. The config I'm using for fine tuning is largely based off of the vox_10hr.yaml recommended in the readme, with a couple changes, see my full config below:
And for reference here is the command I run to fine tune:
python fairseq_cli/hydra_train.py -m \ --config-dir examples/wav2vec/config/finetuning \ --config-name vox_10h_noisyD2Vaudio \ +trainer.tensorboard_logdir=/h/myusername/fairseq/logs/tb/ \ task.data=/h/addisonw/fairseq/manifests/finetuning_data10h \ model.w2v_path=/h/myusername/fairseq/pretrained_models/base_libri.pt \ # this is the pre trained base model I downloaded common.user_dir=examples/data2vec
When running this, it is able to fine tune, and I see train loss metrics and various other things logged. My primary question is around getting WER metrics. When looking in to audio_finetuning.py, and the AudioFinetuningConfig, I see that eval_wer is only for Seq2Seq models, and I believe CTC with data2Vec would not qualify as this. How did the authors obtain WER values for their audio experiments?
EDIT: I decided to just try adding eval_wer and it actually works. Now I'm getting 100 for Validation WER constantly, meaning there's likely a mismatch between labels and predictions, particularly they mean something different. Can @alexeib or another contributor to Data2Vec 2.0 confirm if the numbers provided in the paper were from finetuning with CTC to predict phones or characters?
I'm attempting to reproduce some results from the Data2Vec 2.0 paper, specifically the audio task results. I'm using the recommended commands from the Data2Vec 2.0 readme. Specifically, I've downloaded the data2vec Base model, with no fine tuning. I've downloaded the Libri-Light 10 hr data, and run libri-labels.py to obtain labels. The config I'm using for fine tuning is largely based off of the vox_10hr.yaml recommended in the readme, with a couple changes, see my full config below:
`# @package group
common: fp16: true log_format: json log_interval: 50 log_file: /h/myusername/fairseq/logs/log.json
checkpoint: save_interval: 10 save_interval_updates: 10000 keep_interval_updates: 1 no_epoch_checkpoints: true best_checkpoint_metric: wer
task: _name: audio_finetuning data: ??? normalize: true labels: ltr
dataset: num_workers: 2 max_tokens: 1280000 skip_invalid_size_inputs_valid_test: true validate_after_updates: 0 validate_interval: 1 valid_subset: valid
distributed_training: ddp_backend: legacy_ddp distributed_world_size: 4
criterion: _name: ctc zero_infinity: true
optimization: max_update: 20000 lr: [0.0001] sentence_avg: true update_freq: [5]
optimizer: _name: adam adam_betas: (0.9,0.98) adam_eps: 1e-08
lr_scheduler: _name: tri_stage phase_ratio: [0.1, 0.4, 0.5] final_lr_scale: 0.05
model: _name: wav2vec_ctc w2v_path: ??? apply_mask: true mask_prob: 0.75 mask_channel_prob: 0.25 mask_channel_length: 64 layerdrop: 0.1 activation_dropout: 0.1 feature_grad_mult: 0.0 freeze_finetune_updates: 10000 `
And for reference here is the command I run to fine tune:
python fairseq_cli/hydra_train.py -m \ --config-dir examples/wav2vec/config/finetuning \ --config-name vox_10h_noisyD2Vaudio \ +trainer.tensorboard_logdir=/h/myusername/fairseq/logs/tb/ \ task.data=/h/addisonw/fairseq/manifests/finetuning_data10h \ model.w2v_path=/h/myusername/fairseq/pretrained_models/base_libri.pt \ # this is the pre trained base model I downloaded common.user_dir=examples/data2vec
When running this, it is able to fine tune, and I see train loss metrics and various other things logged. My primary question is around getting WER metrics. When looking in to audio_finetuning.py, and the AudioFinetuningConfig, I see that eval_wer is only for Seq2Seq models, and I believe CTC with data2Vec would not qualify as this. How did the authors obtain WER values for their audio experiments?
EDIT: I decided to just try adding eval_wer and it actually works. Now I'm getting 100 for Validation WER constantly, meaning there's likely a mismatch between labels and predictions, particularly they mean something different. Can @alexeib or another contributor to Data2Vec 2.0 confirm if the numbers provided in the paper were from finetuning with CTC to predict phones or characters?