Sindhu-Hegde / gestsync

Official code for the paper "GestSync: Determining who is speaking without a talking head" published at BMVC 2023
35 stars 2 forks source link

Gesture Synchronisation

This code is for our paper titled: GestSync: Determining who is speaking without a talking head published at BMVC 2023 (oral).
Authors: Sindhu Hegde, Andrew Zisserman


📝 Paper 📑 Project Page 🤗 Demo 🛠 Demo Video
Paper Website Demo Video



Clone the repository

git clone

Install the required packages (it is recommended to create a new environment)

python -m venv env_gestsync
source env_gestsync/bin/activate
pip install -r requirements.txt

Activate the environment

source env_gestsync/bin/activate

Pretrained models

Download the trained models and save in checkpoints folder

Model Description Link to the model
RGB model Weights of the RGB-based GestSync model Link ---


Predicting the audio-visual synchrnoisation offset

It is now possible to sync-correct any video solely based on Gestures (no face needed)! Give any video where the speaker's gestures are visible and use our network to predict the synchronisation offset and obtain the sync-corrected video as output.

python --checkpoint_path=<path_to_model> --video_path=<path_to_video>

Following demo videos are available for a quick test:

Video path Actual offset
samples/sync_sample_1.mp4 0
samples/sync_sample_2.mp4 25
samples/sync_sample_3.mp4 -15

Example run:

python --checkpoint_path=checkpoints/model_rgb.pth --video_path=samples/sync_sample_1.mp4

All the input and output files are saved (by default) in results folder. The result directory can be specified in arguments, similar to several other available options. The input file can be any video file with a single speaker and visible gestures. The code will pre-process the video (pre-processed files will be saved in results/input folder) and generate the sync-corrected video (result files will be saved in results/output folder).

The optional parameter num_avg_frames specifies the number of video frames used to average the scores. Higher the number of average frames, better the results. To obtain a more accurate offset prediction, give a longer video as input and set the num_avg_frames to be higher (example 100).

Example run:

python --checkpoint_path=checkpoints/model_rgb.pth --video_path=samples/sync_sample_2.mp4 --num_avg_frames=75

Predicting "who is speaking" in a multi-speaker scene

Our model can be used to predict "who is speaking" based on gestures in a multi-speaker video (no face needed). Give any video with two or more speakers with visible gestures in the scene and use our network to predict the active speaker and obtain the video output with the bounding box as shown below.

python --checkpoint_path=<path_to_model> --video_path=<path_to_video>

Following demo videos are available for a quick test: samples/asd_sample_1.mp4, samples/asd_sample_1.mp4

Example run:

python --checkpoint_path=checkpoints/model_rgb.pth --video_path=samples/asd_sample_1.mp4 --global_speaker=True

All the input and output files are saved (by default) in results folder. The result directory can be specified in arguments, similar to several other available options. The input file can be any video file with atleast two speakers with visible gestures. The code will pre-process the video (pre-processed files will be saved in results/input folder) and generate the video with the bounding-box ob the active speaker (result files will be saved in results/output folder). The above code predicts a single active speaker for the entire input video.

To obtain more fine-grained per-frame results, set the parameter global_speaker=False and specify the num_avg_frames to indicate number of video frames used to average the scores. Higher the number of average frames, better the results. To obtain a more accurate offset prediction, give a longer video as input and set the num_avg_frames to be higher (example 100).

Example run:

python --checkpoint_path=checkpoints/model_rgb.pth --video_path=samples/asd_sample_2.mp4 --global_speaker=False --num_avg_frames=50


Preprocess the data

Our model is trained using the LRS3 dataset. Adapting for other datasets might involve small modifications.

Dataset folder structure
├── all video files (mp4/avi)  
Preprocess the dataset

Pre-processing the data involves two steps:

  1. Obtaining the video and audio crops (based on scene detection and person detection)
  2. Extracting the keypoints
cd preprocess
python --data_root=<dataset-path> --preprocessed_root=<path-to-save-the-preprocessed-data> --temp_dir=<path-to-save-intermediate-results> --metadata_root=<path-to-save-the-metadata>
python --data_path=<path-to-save-the-preprocessed-data> --result_path=<path-to-save-the-keypoints>

Additional options like rank and nshard to use parallel processing can be set if needed.

Preprocessed folder structure

The final folder structure for train and validation files after pre-processing is shown below.

preprocessed_root (path of the pre-processed videos) 
├── train
|   ├── list of video-ids
|   │   ├── *.avi (extracted person track video for each scene)
|   |   ├── *.wav (extracted person track audio for each scene)
├── val
|   ├── list of video-ids
|   │   ├── *.avi (extracted person track video for each scene)
|   |   ├── *.wav (extracted person track audio for each scene)
result_path (path of the extracted keypoints) 
├── train
|   ├── list of video-ids
|   │   ├── *.pkl (extracted keypoint file for each person track)
├── val
|   ├── list of video-ids
|   │   ├── *.pkl (extracted keypoint file for each person track)

Train the model

Navigate to the main directory: cd ..

The GestSync model can be trained using:

python --data_path_videos=<path-of-preprocessed-data> --data_root_kps=<path-of-extracted-keypoints> --checkpoint_dir=<path-to-save-the-trained-model>

The model can be resumed for training as well. Look at python --help for more details. Also, additional less commonly-used parameters can be set in the file.

Licence and Citation

The software is licensed under the MIT License. Please cite the following paper if you have used this code:

  author       = "Hegde, Sindhu and Zisserman, Andrew",
  title        = "GestSync: Determining who is speaking without a talking head",
  booktitle    = "BMVC",
  year         = "2023",