Outdated instructions to reproduce LID results with XLS-R

🐛 Bug

I am trying to reproduce the results of the Language Identification task with the XLS-R model on the Voxligua107 dataset, but following the current instructions yields several errors.

More specifically, I can't run the first command, which according to the instructions, is the following:

CUDA_VISIBLE_DEVICES=0 PYTHONPATH=. python3 examples/wav2vec/gen_audio_embedding.py \
    /fsx/data/VoxLingua107/manifest --path "/path/to/checkpoint.pt" \
    --task audio_classification --batch-size 90 --gen-subset test \
    --infer-manifest /fsx/data/VoxLingua107/manifest/test.tsv \
    --infer-xtimes 10 --infer-max-sample-size 160000 --output-path /tmp/tmp_voxling_infer.npz

For starters, the path to gen_audio_embedding.py should be examples/wav2vec/xlsr/scripts/gen_audio_embedding.py (and not examples/wav2vec/gen_audio_embedding.py).

Then, it seems like the audio_classification task no longer exists, so the script fails in this line: https://github.com/facebookresearch/fairseq/blob/main/examples/wav2vec/xlsr/scripts/gen_audio_embedding.py#L27

We can update the line to the following, but not sure if this is correct:

from fairseq.tasks.audio_finetuning import LabelEncoder

After that, the task in the command line has also to change, and I changed it to audio_finetuning (but again, not sure if this is right).

After these changes, I still can't run the code, since it yields the following error:

> CUDA_VISIBLE_DEVICES=0 PYTHONPATH=. python3 gen_audio_embedding.py /fsx/data/VoxLingua107/manifest --path <path/to/xlsr_300m_voxlingua107_ft.pt> --task audio_finetuning --batch-size 90 --gen-subset test --infer-manifest /fsx/data/VoxLingua107/manifest/test.tsv --infer-manifest /fsx/data/VoxLingua107/manifest/test.tsv --infer-xtimes 10 --infer-max-sample-size 160000 --output-path /tmp/tmp_voxling_infer.npz
| loading model from ../../../../models/xlsr_300m_voxlingua107_ft.pt
Traceback (most recent call last):
  File "gen_audio_embedding.py", line 140, in <module>
    models, _model_args, task = checkpoint_utils.load_model_ensemble_and_task([args.path],
  File "/opt/conda/envs/fourierdev2/lib/python3.8/site-packages/fairseq/checkpoint_utils.py", line 436, in load_model_ensemble_and_task
    task = tasks.setup_task(cfg.task)
  File "/opt/conda/envs/fourierdev2/lib/python3.8/site-packages/fairseq/tasks/__init__.py", line 42, in setup_task
    assert (
AssertionError: Could not infer task type from {'_name': 'audio_classification' (...)

Additionally, it is not clear where to obtain the manifest or the test.tsv files from the VoxLingua107 dataset. Could you please clarify?

Thanks!

To Reproduce

Steps to reproduce the behavior:

Update Fairseq to 0.12.2
Download the XLS-R 300 LID Model: https://dl.fbaipublicfiles.com/fairseq/wav2vec/xlsr_300m_voxlingua107_ft.pt
Run the following command:

> CUDA_VISIBLE_DEVICES=0 PYTHONPATH=. python3 gen_audio_embedding.py /fsx/data/VoxLingua107/manifest --path <path/to/xlsr_300m_voxlingua107_ft.pt> --task audio_finetuning --batch-size 90 --gen-subset test --infer-manifest /fsx/data/VoxLingua107/manifest/test.tsv --infer-manifest /fsx/data/VoxLingua107/manifest/test.tsv --infer-xtimes 10 --infer-max-sample-size 160000 --output-path /tmp/tmp_voxling_infer.npz

See error:

| loading model from ../../../../models/xlsr_300m_voxlingua107_ft.pt
Traceback (most recent call last):
  File "gen_audio_embedding.py", line 140, in <module>
    models, _model_args, task = checkpoint_utils.load_model_ensemble_and_task([args.path],
  File "/opt/conda/envs/fourierdev2/lib/python3.8/site-packages/fairseq/checkpoint_utils.py", line 436, in load_model_ensemble_and_task
    task = tasks.setup_task(cfg.task)
  File "/opt/conda/envs/fourierdev2/lib/python3.8/site-packages/fairseq/tasks/__init__.py", line 42, in setup_task
    assert (
AssertionError: Could not infer task type from {'_name': 'audio_classification', 'data': '/fsx/data/VoxLingua107/manifest/', 'labels': 'label', 'sample_rate': 16000, 'normalize': True, 'enable_padding': False, 'max_sample_size': 160000, 'min_sample_size': 16000, 'multiple_train_files': False, 'num_batch_buckets': 0, 'precompute_mask_indices': False, 'mask_length': 10, 'mask_prob': 0.5, 'mask_selection': 'static', 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 64, 'mask_channel_prob': 0.1, 'mask_channel_selection': 'static', 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_feature_layers': '[(512, 10, 5)] + [(512, 3, 2)] * 4 + [(512,2,2)] + [(512,2,2)]', 'encoder_embed_dim': 768, 'tpu': False}. Available argparse tasks: dict_keys(['translation_multi_simple_epoch', 'legacy_masked_lm', 'sentence_ranking', 'speech_to_text', 'text_to_speech', 'frm_text_to_speech', 'language_modeling', 'translation', 'hubert_pretraining', 'translation_from_pretrained_bart', 'denoising', 'translation_from_pretrained_xlm', 'sentence_prediction', 'sentence_prediction_adapters', 'speech_unit_modeling', 'multilingual_masked_lm', 'online_backtranslation', 'audio_pretraining', 'audio_finetuning', 'speech_to_speech', 'multilingual_denoising', 'simul_speech_to_text', 'simul_text_to_text', 'multilingual_translation', 'semisupervised_translation', 'cross_lingual_lm', 'multilingual_language_modeling', 'translation_lev', 'masked_lm', 'dummy_lm', 'dummy_masked_lm', 'dummy_mt']). Available hydra tasks: dict_keys(['language_modeling', 'translation', 'hubert_pretraining', 'translation_from_pretrained_xlm', 'sentence_prediction', 'sentence_prediction_adapters', 'speech_unit_modeling', 'audio_pretraining', 'audio_finetuning', 'simul_text_to_text', 'multilingual_language_modeling', 'translation_lev', 'masked_lm', 'dummy_lm', 'dummy_masked_lm'])

Expected behavior

The logits/embeddings from the XLSR model for the VoxLingua107 dataset should be extracted and put them into /tmp/tmp_voxling_infer.npz.

Environment

fairseq Version (e.g., 1.0 or main): 0.12.2
PyTorch Version (e.g., 1.0): 1.12.1+cu102
OS (e.g., Linux): x86_64 GNU/Linux
How you installed fairseq (pip, source): pip
Build command you used (if compiling from source):
Python version: 3.8.10
CUDA/cuDNN version: 11.2
GPU models and configuration: four V100s
Any other relevant information:

facebookresearch / fairseq