facebookresearch / av_hubert

A self-supervised learning framework for audio-visual speech
Other
805 stars 128 forks source link

How to extract audio-visual features? #38

Open zuujhyt opened 2 years ago

zuujhyt commented 2 years ago

Hi, thank you for the work and the colab. In the colab, the following code snippet shows how to extract visual features.

def extract_visual_feature(video_path, ckpt_path, user_dir, is_finetune_ckpt=False):
  utils.import_user_module(Namespace(user_dir=user_dir))
  models, saved_cfg, task = checkpoint_utils.load_model_ensemble_and_task([ckpt_path])
  transform = avhubert_utils.Compose([
      avhubert_utils.Normalize(0.0, 255.0),
      avhubert_utils.CenterCrop((task.cfg.image_crop_size, task.cfg.image_crop_size)),
      avhubert_utils.Normalize(task.cfg.image_mean, task.cfg.image_std)])
  frames = avhubert_utils.load_video(video_path)
  print(f"Load video {video_path}: shape {frames.shape}")
  frames = transform(frames)
  print(f"Center crop video to: {frames.shape}")
  frames = torch.FloatTensor(frames).unsqueeze(dim=0).unsqueeze(dim=0).cuda()
  model = models[0]
  if hasattr(models[0], 'decoder'):
    print(f"Checkpoint: fine-tuned")
    model = models[0].encoder.w2v_model
  else:
    print(f"Checkpoint: pre-trained w/o fine-tuning")
  model.cuda()
  model.eval()
  with torch.no_grad():
    # Specify output_layer if you want to extract feature of an intermediate layer
    feature, _ = model.extract_finetune(source={'video': frames, 'audio': None}, padding_mask=None, output_layer=None)
    feature = feature.squeeze(dim=0)
  print(f"Video feature shape: {feature.shape}")
  return feature

I wonder how I can extract audio-visual features? can you please give an example? or specifically what to feed into the source['audio'] ? Is it a normalized [-1,1] waveform? or othter sprectral features? Thank you.

chevalierNoir commented 2 years ago

Hi,

source['audio'] is the log filterbank, see here. Also it should be normalized like here when the task.cfg.normalize=true, which I believe is the case for all the models we release. Besides, source['audio'] should be of the same sequence length as source['video'] before feeding into the model, as we assume the audio and video are synchronized.

shakeel608 commented 1 year ago

Hello,

I am using your script to extract audio visual features After extracting log filterbank using python_speech-features of shape (96,26) and frame shape is (96,88,88) It is throwing following error from hubert.py", line 327, in forward


numpy.AxisError: axis 2 is out of bounds for array of dimension 2 ```

by using following command

`feature, _ = model.extract_finetune(source={'video': frames, 'audio': audio_feat}, padding_mask=None, output_layer=None)`