This code is for our paper titled: GestSync: Determining who is speaking without a talking head published at BMVC 2023 (oral).
Authors: Sindhu Hegde, Andrew Zisserman
📝 Paper | 📑 Project Page | 🤗 Demo | 🛠 Demo Video |
---|---|---|---|
Paper | Website | Demo | Video |
Clone the repository
git clone https://github.com/Sindhu-Hegde/gestsync.git
Install the required packages (it is recommended to create a new environment)
python -m venv env_gestsync
source env_gestsync/bin/activate
pip install -r requirements.txt
Activate the environment
source env_gestsync/bin/activate
Download the trained models and save in checkpoints
folder
Model | Description | Link to the model | |
---|---|---|---|
RGB model | Weights of the RGB-based GestSync model | Link | --- |
It is now possible to sync-correct any video solely based on Gestures (no face needed)! Give any video where the speaker's gestures are visible and use our network to predict the synchronisation offset and obtain the sync-corrected video as output.
python inference_syncoffset.py --checkpoint_path=<path_to_model> --video_path=<path_to_video>
Following demo videos are available for a quick test:
Video path | Actual offset |
---|---|
samples/sync_sample_1.mp4 | 0 |
samples/sync_sample_2.mp4 | 25 |
samples/sync_sample_3.mp4 | -15 |
Example run:
python inference_syncoffset.py --checkpoint_path=checkpoints/model_rgb.pth --video_path=samples/sync_sample_1.mp4
All the input and output files are saved (by default) in results
folder. The result directory can be specified in arguments, similar to several other available options. The input file can be any video file with a single speaker and visible gestures. The code will pre-process the video (pre-processed files will be saved in results/input
folder) and generate the sync-corrected video (result files will be saved in results/output
folder).
The optional parameter num_avg_frames
specifies the number of video frames used to average the scores. Higher the number of average frames, better the results. To obtain a more accurate offset prediction, give a longer video as input and set the num_avg_frames
to be higher (example 100).
Example run:
python inference_syncoffset.py --checkpoint_path=checkpoints/model_rgb.pth --video_path=samples/sync_sample_2.mp4 --num_avg_frames=75
Our model can be used to predict "who is speaking" based on gestures in a multi-speaker video (no face needed). Give any video with two or more speakers with visible gestures in the scene and use our network to predict the active speaker and obtain the video output with the bounding box as shown below.
python inference_activespeaker.py --checkpoint_path=<path_to_model> --video_path=<path_to_video>
Following demo videos are available for a quick test: samples/asd_sample_1.mp4
, samples/asd_sample_1.mp4
Example run:
python inference_activespeaker.py --checkpoint_path=checkpoints/model_rgb.pth --video_path=samples/asd_sample_1.mp4 --global_speaker=True
All the input and output files are saved (by default) in results
folder. The result directory can be specified in arguments, similar to several other available options. The input file can be any video file with atleast two speakers with visible gestures. The code will pre-process the video (pre-processed files will be saved in results/input
folder) and generate the video with the bounding-box ob the active speaker (result files will be saved in results/output
folder). The above code predicts a single active speaker for the entire input video.
To obtain more fine-grained per-frame results, set the parameter global_speaker=False
and specify the num_avg_frames
to indicate number of video frames used to average the scores. Higher the number of average frames, better the results. To obtain a more accurate offset prediction, give a longer video as input and set the num_avg_frames
to be higher (example 100).
Example run:
python inference_activespeaker.py --checkpoint_path=checkpoints/model_rgb.pth --video_path=samples/asd_sample_2.mp4 --global_speaker=False --num_avg_frames=50
Our model is trained using the LRS3 dataset. Adapting for other datasets might involve small modifications.
data_root
├── all video files (mp4/avi)
Pre-processing the data involves two steps:
cd preprocess
python preprocess_videos.py --data_root=<dataset-path> --preprocessed_root=<path-to-save-the-preprocessed-data> --temp_dir=<path-to-save-intermediate-results> --metadata_root=<path-to-save-the-metadata>
python extract_kps.py --data_path=<path-to-save-the-preprocessed-data> --result_path=<path-to-save-the-keypoints>
Additional options like rank
and nshard
to use parallel processing can be set if needed.
The final folder structure for train and validation files after pre-processing is shown below.
preprocessed_root (path of the pre-processed videos)
├── train
| ├── list of video-ids
| │ ├── *.avi (extracted person track video for each scene)
| | ├── *.wav (extracted person track audio for each scene)
├── val
| ├── list of video-ids
| │ ├── *.avi (extracted person track video for each scene)
| | ├── *.wav (extracted person track audio for each scene)
result_path (path of the extracted keypoints)
├── train
| ├── list of video-ids
| │ ├── *.pkl (extracted keypoint file for each person track)
├── val
| ├── list of video-ids
| │ ├── *.pkl (extracted keypoint file for each person track)
Navigate to the main directory: cd ..
The GestSync model can be trained using:
python train_sync_rgb.py --data_path_videos=<path-of-preprocessed-data> --data_root_kps=<path-of-extracted-keypoints> --checkpoint_dir=<path-to-save-the-trained-model>
The model can be resumed for training as well. Look at python train_sync_rgb.py --help
for more details. Also, additional less commonly-used parameters can be set in the config.py
file.
The software is licensed under the MIT License. Please cite the following paper if you have used this code:
@InProceedings{Hegde_BMVC_2023,
author = "Hegde, Sindhu and Zisserman, Andrew",
title = "GestSync: Determining who is speaking without a talking head",
booktitle = "BMVC",
year = "2023",
}