facebookresearch / av_hubert

A self-supervised learning framework for audio-visual speech
Other
805 stars 128 forks source link

Finetuning Models for Visual Speech Recognition #63

Open david-gimeno opened 1 year ago

david-gimeno commented 1 year ago

Hello,

I was trying to load a finetuned model for the VSR task. I followed the indications on the repository and the jupyter notebook (below you can see that I tried to import modules from the avhubert path). I copy my script:

import fairseq from argparse import Namespace from fairseq import checkpoint_utils, options, tasks, utils import hubert_pretraining, hubert

def load_model(ckpt_path, user_dir):

utils.import_user_module(Namespace(user_dir=user_dir))

models, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task([ckpt_path])
#model = models[0]
#print(model)

if name == "main": user_dir="../../av_hubert/avhubert/" ckpt_path = "./base_vox_433h.pt"

load_model(ckpt_path, user_dir)

I run the script and then, I got the following error:

""" ... models, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task([ckpt_path]) File "/home/dgimeno/phd/av_hubert/fairseq/fairseq/checkpoint_utils.py", line 446, in load_model_ensemble_and_task model = task.build_model(cfg.model) File "/home/dgimeno/phd/av_hubert/fairseq/fairseq/tasks/fairseq_task.py", line 324, in build_model model = models.build_model(cfg, self) File "/home/dgimeno/phd/av_hubert/fairseq/fairseq/models/init.py", line 88, in build_model assert model is not None, ( AssertionError: Could not infer model type from {'_name': 'av_hubert_seq2seq', 'w2v_path': '/check ...

...

Available models: dict_keys(['wav2vec', 'wav2vec2', 'wav2vec_ctc', 'wav2vec_seq2seq', 'hubert', 'hubert_ctc', 'transformer_lm', 'av_hubert']) Requested model type: av_hubert_seq2seq """

How should I load a model for VSR? Thanks in advance,

David.

chevalierNoir commented 1 year ago

Hi,

To load a finetuned model, you should import hubert_asr: import hubert_asr.

david-gimeno commented 1 year ago

Thank so much for your reply, it solved the problem! Nonetheless, I have a new doubt :)

· First of all, I will explain you my purpose. I am working with a different audiovisual database than LRS3-TED. I would like to use your pretrained AVHubert model for Visual Speech Recognition and then fine-tuned it to my database.

· According to the repository's indications, I should use, if I am not wrong, the following command:

fairseq-hydra-train --config-dir /path/to/conf/ --config-name conf-name \ task.data=/path/to/data task.label_dir=/path/to/label \ task.tokenizer_bpe_model=/path/to/tokenizer model.w2v_path=/path/to/checkpoint \ hydra.run.dir=/path/to/experiment/finetune/ common.user_dir="pwd"

· The configurations are provided in the repository, that's perfect. After inspecting the code I managed to understand the ".tsv" and ".wrd" structure. So, the only doubt I have now is:

where is the tokenizer_bpe_model?

I am looking forward for your comments. Thanks in advance,

David.

chevalierNoir commented 1 year ago

Hi,

tokenizer_bpe_model is the sentencepiece model for subunits, which is generated by the function gen_vocab. It is trained from the training text (i.e., train.wrd). Here is the example for LRS3. Once you generate it, the *.model is the tokenizer which you should pass to task.tokenizer_bpe_model.

david-gimeno commented 1 year ago

Before exploring the AV-HuBERTsystem with my own database, I wanted to see if I can reach similar performance with the LRS3 database.

The point is that I had already prepared the LRS3 database. I mean, I did not use your preparation database scripts. Then, I run the infer_s2s.py script and everything is working; I am getting around 34%WER. How is it possible if i have not defined the tokenizer or it is built online?

On the other hand, I read in the paper "Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction" that the AV-HuBERT was fine-tuned using a Transformer-based Decoder (with a vocabulary size of 1000 subunits) and a CTC layer (with a 46-character vocabulary). My question is if your Github repository only provides scripts to S2S decoding, or are we using the CTC layer too?

chevalierNoir commented 1 year ago
  1. The tokenizer is saved in the fine-tuning checkpoint and will be loaded when you run inference.
  2. In our paper, we tried two types of fine-tuning: (a) CTC, (b) S2S, where S2S outperforms CTC. In the repo, we use S2S for finetuning given its better performance and simpler decoding process.
david-gimeno commented 1 year ago

Thank you for clarifying these aspects!! Now, i would like to ask you how I can fine-tune this pre-trained model. I run this command:

fairseq-hydra-train --config-dir ${PWD}/conf/finetune/ --config-name base_vox_433h.yaml task.data=${PWD}/data/LRS3-TED/speaker-independent/ task.label_dir=${PWD}/data/LRS3-TED/speaker-independent/ task.tokenizer_bpe_model=/checkpoint/bshi/data/lrs3//lang/spm/spm_unigram1000.model model.w2v_path=${PWD}/base_vox_433h.pt hydra.run.dir=${PWD}/exp/LRS3-TED/ common.user_dir=${PWD}

But I am getting the following error:

Traceback (most recent call last): File "/home/dgimeno/phd/av_hubert/fairseq/fairseq_cli/hydra_train.py", line 45, in hydra_main distributed_utils.call_main(cfg, pre_main) File "/home/dgimeno/phd/av_hubert/fairseq/fairseq/distributed/utils.py", line 369, in call_main main(cfg, **kwargs) File "/home/dgimeno/phd/av_hubert/fairseq/fairseq_cli/train.py", line 97, in main model = task.build_model(cfg.model) File "/home/dgimeno/phd/av_hubert/fairseq/fairseq/tasks/fairseq_task.py", line 324, in build_model model = models.build_model(cfg, self) File "/home/dgimeno/phd/av_hubert/fairseq/fairseq/models/__init__.py", line 96, in build_model return model.build_model(cfg, task) File "/home/dgimeno/phd/av_hubert/avhubert/hubert_asr.py", line 474, in build_model del state['model']['mask_emb'] KeyError: 'mask_emb'

The purpose is using another audiovisual database than LRS3 and adapt the model to that corpus but for the moment I wanted to check if I was able to run the scripts.

I guess that this script expects an AV-HuBERT from the previous step, I mean, the pre-trained AV-HuBERT. But what I want to do is to fine-tune (over a new database) an already fine-tuned (on the LRS3 database) AV-HuBERT model for VSR. Would this be possible with the scripts you provided in the repository? The reason of this double fine-tuning process is because my new database has few data.

chevalierNoir commented 1 year ago

Fine-tuning a fine-tuned model cannot be done with simply changing the argument in the finetuning command. However, it is doable with a bit of modifications on the code. Basically, you should load the fine-tuned model weights to your model at initialization by adding the following code at here:

state = checkpoint_utils.load_checkpoint_to_cpu(cfg.ft_ckpt_path)  
self.load_state_dict(state["model"])

where cfg.ft_ckpt_path is the path of the fine-tuned model, which you should add to the configuration:

ft_ckpt_path: str = field(default="", metadata={"help": "path to the finetuned checkpoint"})

Now, use the original command by appending +model.ft_ckpt_path=/path/to/finetune-checkpoint. Note the dimension of some weight matrices in your new model (e.g., outptut linear projection) may differ from the fine-tuned checkpoint you use. Thus you should take care of such inconsistency in checkpoint loading above (i.e, self.load_state_dict(state["model"])).

david-gimeno commented 1 year ago

I really appreciate everything you did for me. Thank you so much! As you indicated, the output linear projection layer had to be considered when loading the pre-trained parameters, since I suspect that, depending on the database used (above all if you are dealing with other language, as it is my case), the optimal vocabulary size of the tokenizer may be different. So, in order to solve this issue, and taking into account your previous response, I implemented the following code at here:

state = checkpoint_utils.load_checkpoint_to_cpu(cfg.ft_ckpt_path)

for key in state["model"].copy().keys():
    if key in ["decoder.embed_tokens.weight", "decoder.embed_out"]:
        state["model"].pop(key, None)

self.load_state_dict(state["model"], strict=False)
print("\nAVHubertSeq2Seq loaded from", cfg.ft_ckpt_path, "\n")

This allow us to load a fine-tuned VSR model based on AVHuBERT. Thus, we are able to re-estimated this VSR model to other database than LRS3-TED corpus, since the loading process discards the output linear projection whose dimension depends on the vocabulary size defined by the tokenizer. In other words, we can define a tokenizer with a different vocabuly size for our new database.

Now, I only have to find out which is the best configuration to get state-of-the-art results on my database :) Time will tell. Thanks again. Best regards from Spain,

David.

nobel861017 commented 1 week ago

Do we still have to provide the model.w2v_path parameter in the command?