facebookresearch / av_hubert

A self-supervised learning framework for audio-visual speech
Other
836 stars 132 forks source link

How to load a pre-trained AVHuBERT? (problems after following the instructions) #106

Open CCTN-BCI opened 9 months ago

CCTN-BCI commented 9 months ago

I follow the words on readme.md as follows to load a pre-trained model:

$ cd avhubert $ python import fairseq import hubert_pretraining, hubert ckpt_path = "/path/to/the/checkpoint.pt" models, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task([ckpt_path]) model = models[0]

The error is as follows: omegaconf.errors.ConfigKeyError: Key 'input_modality' not in 'AVHubertPretrainingConfig' full_key: input_modality reference_type=Optional[AVHubertPretrainingConfig] object_type=AVHubertPretrainingConfig

How should I remove the argument 'input_modality' (or other necessary stages)? Thank you very much!

I met these problems in a new-installed Ubuntu 22.04 and correctly installed fairseq.

chevalierNoir commented 9 months ago

Which checkpoint did you load? These commands should work for pre-trained checkpoints. For fine-tuned checkpoints, please refer to the colab notebook in the repo.

CCTN-BCI commented 9 months ago

Which checkpoint did you load? These commands should work for pre-trained checkpoints. For fine-tuned checkpoints, please refer to the colab notebook in the repo.

I load the exact model in the colab notebook named "https://dl.fbaipublicfiles.com/avhubert/model/lrs3_vox/vsr/base_vox_433h.pt". I met no errors before and at "extract mouth ROI."

chevalierNoir commented 9 months ago

Note the checkpoint you list is fine-tuned and thus shouldn't be used in the python command as pasted.

Nisarg-MARZ commented 9 months ago

I'm facing the same issue and don't see the solution by following the colab notebook. I am trying to load the fine-tuned av-hubert module in my own project. Installed it to my docker image. Unsure why the problem shows up when used this way vs the colab notebook - I've copied the entire code block and still get the above error.

import cv2
import tempfile
from argparse import Namespace
import fairseq
from fairseq import checkpoint_utils, options, tasks, utils
from fairseq.dataclass.configs import GenerationConfig
from IPython.display import HTML

def predict(video_path, ckpt_path, user_dir):
  num_frames = int(cv2.VideoCapture(video_path).get(cv2.CAP_PROP_FRAME_COUNT))
  data_dir = tempfile.mkdtemp()
  tsv_cont = ["/\n", f"test-0\t{video_path}\t{None}\t{num_frames}\t{int(16_000*num_frames/25)}\n"]
  label_cont = ["DUMMY\n"]
  with open(f"{data_dir}/test.tsv", "w") as fo:
    fo.write("".join(tsv_cont))
  with open(f"{data_dir}/test.wrd", "w") as fo:
    fo.write("".join(label_cont))
  utils.import_user_module(Namespace(user_dir=user_dir))
  modalities = ["video"]
  gen_subset = "test"
  gen_cfg = GenerationConfig(beam=20)
  models, saved_cfg, task = checkpoint_utils.load_model_ensemble_and_task([ckpt_path])
  models = [model.eval().cuda() for model in models]
  saved_cfg.task.modalities = modalities
  saved_cfg.task.data = data_dir
  saved_cfg.task.label_dir = data_dir
  task = tasks.setup_task(saved_cfg.task)
  task.load_dataset(gen_subset, task_cfg=saved_cfg.task)
  generator = task.build_generator(models, gen_cfg)

  def decode_fn(x):
      dictionary = task.target_dictionary
      symbols_ignore = generator.symbols_to_strip_from_output
      symbols_ignore.add(dictionary.pad())
      return task.datasets[gen_subset].label_processors[0].decode(x, symbols_ignore)

  itr = task.get_batch_iterator(dataset=task.dataset(gen_subset)).next_epoch_itr(shuffle=False)
  sample = next(itr)
  sample = utils.move_to_cuda(sample)
  hypos = task.inference_step(generator, models, sample)
  ref = decode_fn(sample['target'][0].int().cpu())
  hypo = hypos[0][0]['tokens'].int().cpu()
  hypo = decode_fn(hypo)
  return hypo

mouth_roi_path, ckpt_path = "/content/data/roi.mp4", "/content/data/finetune-model.pt"
user_dir = "/content/av_hubert/avhubert"
hypo = predict(mouth_roi_path, ckpt_path, user_dir)

Update: Looks like this error is occurring even if I run from av-hubert repo directly. As well as using the pretrained checkpoint instead of fine-tuned checkpoint (e.g. https://dl.fbaipublicfiles.com/avhubert/model/lrs3_vox/clean-pretrain/large_vox_iter5.pt) however it's a slightly different key.

omegaconf.errors.ConfigAttributeError: Key 'required_seq_len_multiple' not in 'AVHubertConfig'
    full_key: required_seq_len_multiple
    reference_type=Optional[AVHubertConfig]
    object_type=AVHubertConfig
chevalierNoir commented 9 months ago

I'm facing the same issue and don't see the solution by following the colab notebook. I am trying to load the fine-tuned av-hubert module in my own project. Installed it to my docker image. Unsure why the problem shows up when used this way vs the colab notebook - I've copied the entire code block and still get the above error.

import cv2
import tempfile
from argparse import Namespace
import fairseq
from fairseq import checkpoint_utils, options, tasks, utils
from fairseq.dataclass.configs import GenerationConfig
from IPython.display import HTML

def predict(video_path, ckpt_path, user_dir):
  num_frames = int(cv2.VideoCapture(video_path).get(cv2.CAP_PROP_FRAME_COUNT))
  data_dir = tempfile.mkdtemp()
  tsv_cont = ["/\n", f"test-0\t{video_path}\t{None}\t{num_frames}\t{int(16_000*num_frames/25)}\n"]
  label_cont = ["DUMMY\n"]
  with open(f"{data_dir}/test.tsv", "w") as fo:
    fo.write("".join(tsv_cont))
  with open(f"{data_dir}/test.wrd", "w") as fo:
    fo.write("".join(label_cont))
  utils.import_user_module(Namespace(user_dir=user_dir))
  modalities = ["video"]
  gen_subset = "test"
  gen_cfg = GenerationConfig(beam=20)
  models, saved_cfg, task = checkpoint_utils.load_model_ensemble_and_task([ckpt_path])
  models = [model.eval().cuda() for model in models]
  saved_cfg.task.modalities = modalities
  saved_cfg.task.data = data_dir
  saved_cfg.task.label_dir = data_dir
  task = tasks.setup_task(saved_cfg.task)
  task.load_dataset(gen_subset, task_cfg=saved_cfg.task)
  generator = task.build_generator(models, gen_cfg)

  def decode_fn(x):
      dictionary = task.target_dictionary
      symbols_ignore = generator.symbols_to_strip_from_output
      symbols_ignore.add(dictionary.pad())
      return task.datasets[gen_subset].label_processors[0].decode(x, symbols_ignore)

  itr = task.get_batch_iterator(dataset=task.dataset(gen_subset)).next_epoch_itr(shuffle=False)
  sample = next(itr)
  sample = utils.move_to_cuda(sample)
  hypos = task.inference_step(generator, models, sample)
  ref = decode_fn(sample['target'][0].int().cpu())
  hypo = hypos[0][0]['tokens'].int().cpu()
  hypo = decode_fn(hypo)
  return hypo

mouth_roi_path, ckpt_path = "/content/data/roi.mp4", "/content/data/finetune-model.pt"
user_dir = "/content/av_hubert/avhubert"
hypo = predict(mouth_roi_path, ckpt_path, user_dir)

Update: Looks like this error is occurring even if I run from av-hubert repo directly. As well as using the pretrained checkpoint instead of fine-tuned checkpoint (e.g. https://dl.fbaipublicfiles.com/avhubert/model/lrs3_vox/clean-pretrain/large_vox_iter5.pt) however it's a slightly different key.

omegaconf.errors.ConfigAttributeError: Key 'required_seq_len_multiple' not in 'AVHubertConfig'
  full_key: required_seq_len_multiple
  reference_type=Optional[AVHubertConfig]
  object_type=AVHubertConfig

Haven't checked the pasted code block yet but the colab notebook runs fine for me.

Xuan-MARZ commented 8 months ago

Solution:

pip install numpy==1.23.5
pip install git+https://github.com/facebookresearch/fairseq.git@afc77bd#egg=fairseq