X-LANCE / SLAM-LLM

Speech, Language, Audio, Music Processing with Large Language Model
MIT License
464 stars 36 forks source link

Mismatch Issue in the EAT Checkpoint Dictionary for the AAC Inference Task #97

Closed RookieJunChen closed 2 months ago

RookieJunChen commented 3 months ago

System Info

Consistent with the official repository's environment requirements

Information

🐛 Describe the bug

I encountered a bug during inference as shown in the Error logs:

AssertionError: Could not infer task type from {'_name': 'mae_image_classification', 'data': '/hpc_stor03/sjtu_home/wenxi.chen/mydata/audio/AS2M', 'multi_data': None, 'input_size': 224, 'local_cache_path': None, 'key': 'imgs', 'beit_transforms': False, 'target_transform': False, 'no_transform': False, 'rebuild_batches': True, 'precompute_mask_config': None, 'subsample': 1.0, 'seed': 1, 'dataset_type': 'imagefolder', 'audio_mae': True, 'h5_format': True, 'downsr_16hz': True, 'target_length': 1024, 'flexible_mask': False, 'esc50_eval': False, 'spcv2_eval': False, 'AS2M_finetune': True, 'spcv1_finetune': False, 'roll_aug': True, 'noise': False, 'weights_file': '/hpc_stor03/sjtu_home/wenxi.chen/mydata/audio/AS2M/weight_train_all.csv', 'num_samples': 200000, 'is_finetuning': False, 'label_descriptors': 'label_descriptors.csv', 'labels': 'lbl'}. Available argparse tasks: dict_keys(['sentence_prediction', 'hubert_pretraining', 'speech_unit_modeling', 'translation', 'online_backtranslation', 'language_modeling', 'speech_to_text', 'text_to_speech', 'cross_lingual_lm', 'translation_multi_simple_epoch', 'denoising', 'multilingual_denoising', 'multilingual_translation', 'legacy_masked_lm', 'masked_lm', 'sentence_prediction_adapters', 'sentence_ranking', 'translation_from_pretrained_bart', 'speech_to_speech', 'translation_from_pretrained_xlm', 'multilingual_masked_lm', 'frm_text_to_speech', 'audio_pretraining', 'audio_finetuning', 'multilingual_language_modeling', 'translation_lev', 'simul_speech_to_text', 'simul_text_to_text', 'semisupervised_translation', 'dummy_lm', 'dummy_masked_lm', 'dummy_mt']). Available hydra tasks: dict_keys(['sentence_prediction', 'hubert_pretraining', 'speech_unit_modeling', 'translation', 'language_modeling', 'masked_lm', 'sentence_prediction_adapters', 'translation_from_pretrained_xlm', 'audio_pretraining', 'audio_finetuning', 'multilingual_language_modeling', 'translation_lev', 'simul_text_to_text', 'dummy_lm', 'dummy_masked_lm'])

After pinpointing the issue, I found that the problem occurred at this step: SLAM-LLM/src/slam_llm/models/encoder.py, line 77, in load EATEncoder, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task([model_config.encoder_path])

Further analysis revealed that the mismatch issue described in the Error logs is due to the checkpoint['cfg']['task'] in the pre-trained EAT checkpoint I downloaded from the repository link not matching the code.

How should I modify the dictionary values in the EAT checkpoint to ensure it runs correctly?

Error logs

AssertionError: Could not infer task type from {'_name': 'mae_image_classification', 'data': '/hpc_stor03/sjtu_home/wenxi.chen/mydata/audio/AS2M', 'multi_data': None, 'input_size': 224, 'local_cache_path': None, 'key': 'imgs', 'beit_transforms': False, 'target_transform': False, 'no_transform': False, 'rebuild_batches': True, 'precompute_mask_config': None, 'subsample': 1.0, 'seed': 1, 'dataset_type': 'imagefolder', 'audio_mae': True, 'h5_format': True, 'downsr_16hz': True, 'target_length': 1024, 'flexible_mask': False, 'esc50_eval': False, 'spcv2_eval': False, 'AS2M_finetune': True, 'spcv1_finetune': False, 'roll_aug': True, 'noise': False, 'weights_file': '/hpc_stor03/sjtu_home/wenxi.chen/mydata/audio/AS2M/weight_train_all.csv', 'num_samples': 200000, 'is_finetuning': False, 'label_descriptors': 'label_descriptors.csv', 'labels': 'lbl'}. Available argparse tasks: dict_keys(['sentence_prediction', 'hubert_pretraining', 'speech_unit_modeling', 'translation', 'online_backtranslation', 'language_modeling', 'speech_to_text', 'text_to_speech', 'cross_lingual_lm', 'translation_multi_simple_epoch', 'denoising', 'multilingual_denoising', 'multilingual_translation', 'legacy_masked_lm', 'masked_lm', 'sentence_prediction_adapters', 'sentence_ranking', 'translation_from_pretrained_bart', 'speech_to_speech', 'translation_from_pretrained_xlm', 'multilingual_masked_lm', 'frm_text_to_speech', 'audio_pretraining', 'audio_finetuning', 'multilingual_language_modeling', 'translation_lev', 'simul_speech_to_text', 'simul_text_to_text', 'semisupervised_translation', 'dummy_lm', 'dummy_masked_lm', 'dummy_mt']). Available hydra tasks: dict_keys(['sentence_prediction', 'hubert_pretraining', 'speech_unit_modeling', 'translation', 'language_modeling', 'masked_lm', 'sentence_prediction_adapters', 'translation_from_pretrained_xlm', 'audio_pretraining', 'audio_finetuning', 'multilingual_language_modeling', 'translation_lev', 'simul_text_to_text', 'dummy_lm', 'dummy_masked_lm'])

Expected behavior

An EAT checkpoint that correctly matches the inference script and code.

cwx-worst-one commented 3 months ago

Have you set up the relevant environment according to the EAT repository configuration? For example:

git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./
git clone https://github.com/cwx-worst-one/EAT
RookieJunChen commented 3 months ago

Yes, I have set up the environment. The current issue appears to be as I analyzed: the checkpoint['cfg']['task'] in EAT's checkpoint is {'_name': 'mae_image_classification', 'data': '/hpc_stor03/sjtu_home/wenxi.chen/mydata/audio/AS2M', 'multi_data': None, 'input_size': 224, 'local_cache_path': None, 'key': 'imgs', 'beit_transforms': False, 'target_transform': False, 'no_transform': False, 'rebuild_batches': True, 'precompute_mask_config': None, 'subsample': 1.0, 'seed': 1, 'dataset_type': 'imagefolder', 'audio_mae': True, 'h5_format': True, 'downsr_16hz': True, 'target_length': 1024, 'flexible_mask': False, 'esc50_eval': False, 'spcv2_eval': False, 'AS2M_finetune': True, 'spcv1_finetune': False, 'roll_aug': True, 'noise': False, 'weights_file': '/hpc_stor03/sjtu_home/wenxi.chen/mydata/audio/AS2M/weight_train_all.csv', 'num_samples': 200000, 'is_finetuning': False, 'label_descriptors': 'label_descriptors.csv', 'labels': 'lbl'} However, the inference code in SLAM-LLM seems to require the followingAvailable argparse tasks: dict_keys(['sentence_prediction', 'hubert_pretraining', 'speech_unit_modeling', 'translation', 'online_backtranslation', 'language_modeling', 'speech_to_text', 'text_to_speech', 'cross_lingual_lm', 'translation_multi_simple_epoch', 'denoising', 'multilingual_denoising', 'multilingual_translation', 'legacy_masked_lm', 'masked_lm', 'sentence_prediction_adapters', 'sentence_ranking', 'translation_from_pretrained_bart', 'speech_to_speech', 'translation_from_pretrained_xlm', 'multilingual_masked_lm', 'frm_text_to_speech', 'audio_pretraining', 'audio_finetuning', 'multilingual_language_modeling', 'translation_lev', 'simul_speech_to_text', 'simul_text_to_text', 'semisupervised_translation', 'dummy_lm', 'dummy_masked_lm', 'dummy_mt']). Available hydra tasks: dict_keys(['sentence_prediction', 'hubert_pretraining', 'speech_unit_modeling', 'translation', 'language_modeling', 'masked_lm', 'sentence_prediction_adapters', 'translation_from_pretrained_xlm', 'audio_pretraining', 'audio_finetuning', 'multilingual_language_modeling', 'translation_lev', 'simul_text_to_text', 'dummy_lm', 'dummy_masked_lm']

cwx-worst-one commented 3 months ago

It looks like your model_config.encoder_fairseq_dir is not specified correctly. Did you set it to the /path/to/EAT format as mentioned in the aac_vicuna_lora.yaml and aac_config.py?

RookieJunChen commented 3 months ago

I noticed that the model_config.encoder_fairseq_dir parameter is not set in the inference_eat_audiocaps.sh script. Can I simply add it directly to the script, or do I also need to make adjustments in aac_vicuna_lora.yaml and aac_config.py? Additionally, does /path/to/EAT refer to the address of the EAT checkpoint file, or the address of the directory where the file is located?

cwx-worst-one commented 3 months ago

Yes, you can directly modify the inference_eat_audiocaps.sh script by adding model_config.encoder_fairseq_dir parameter. The /path/to/EAT refers to the path where you have cloned the EAT repository. It should be located within the Fairseq directory, i.e., /path/to/fairseq/EAT.

RookieJunChen commented 3 months ago

We reconfigured the environment and tried again. Initially, we set up the environment according to the EAT repository. However, we discovered that the SLAM-LLM repository requires hydra-core>=1.3.2, while the EAT repository specifies fairseq==0.12.2, which demands hydra-core<1.1.0, leading to a conflict. Reducing the version of hydra-core to run SLAM-LLM results in a TypeError: main() got an unexpected keyword argument 'version_base'. Currently, there appears to be no good solution to this conflict, and we are unable to proceed with the AAC inference process.

cwx-worst-one commented 3 months ago

Yes, there is indeed a version conflict issue here. However, the higher version of hydra-core (1.3.2) is compatible with the lower versions. To resolve this conflict, you should first install the fairseq library and then install SLAM-LLM, ensuring that the hydra-core version is 1.3.2.

ddlBoJack commented 3 months ago

Have the problem solved?

RookieJunChen commented 3 months ago

Sorry, we've been experiencing some problems due to insufficient server space and are still making adjustments. We will continue to update you on any new problems or progress, and I will close this issue once everything is resolved.