JusperLee / CTCNet

An Audio-Visual Speech Separation Model Inspired by Cortico-Thalamo-Cortical Circuits
Apache License 2.0
69 stars 16 forks source link

An Audio-Visual Speech Separation Model Inspired by Cortico-Thalamo-Cortical Circuits

Kai Li, Fenghua Xie, Hang Chen, Kexin Yuan, and Xiaolin Hu | Tsinghua University

PyTorch Implementation of CTCNet (TPAMI 2024): An Audio-Visual Speech Separation Model Inspired by Cortico-Thalamo-Cortical Circuits.

arXiv GitHub Stars visitors

PWC PWC PWC

Audio-visual demos

https://user-images.githubusercontent.com/33806018/208616615-dab6ab87-def1-405a-897e-a3c1decb790a.mp4

Key points

Quick Started

Datasets and Pretrained Models

This method involves using the LRS2, LRS3, and Vox2 datasets to create a multimodal speech separation dataset. The corresponding folders Datasets/ in the provided GitHub repository contain the files necessary to build the datasets, and the code in the repository can be used to construct the multimodal datasets.

The generated datasets (LRS2-2Mix, LRS3-2Mix, and VoxCeleb2-2Mix) can be downloaded at the links below.

Datasets Links Pretrained Models
LRS2-2Mix Removed for copyright Google Driver
LRS3-2Mix Removed for copyright Google Driver
VoxCeleb2-2Mix Removed for copyright Google Driver

Video Pretrain model

This pre-trained model is a lip-reading model trained only on videos, and it achieves an accuracy of 84% on the LRW dataset.

Datasets Links Pretrained Models
LRS2-2Mix Removed for copyright Google Driver

Dependencies

Preprocess

python preprocess_lrs2.py --in_audio_dir audio/wav16k/min --in_mouth_dir mouths --out_dir data

Training Pipeline

Training on the LRS2

python train.py -c local/lrs2_conf_64_64_3_adamw_1e-1_blocks16_pretrain.yml

Training on the LRS3

python train.py -c local/lrs3_conf_64_64_3_adamw_1e-1_blocks16_pretrain.yml

Training on the VoxCeleb2

python train.py -c local/vox2_conf_64_64_3_adamw_1e-1_blocks16_pretrain.yml

Testing Pipeline

python eval.py --test=local/data/tt --conf_dir=exp/lrs2_64_64_3_adamw_1e-1_blocks8_pretrain/conf.yml

Testing Your Own Videos

ffmpeg -i ./test_videos/interview.mp4 -filter:v fps=fps=25 ./test_videos/interview25fps.mp4
mv ./test_videos/interview25fps.mp4 ./test_videos/interview.mp4
python ./utils/detectFaces.py --video_input_path ./test_videos/interview.mp4 --output_path ./test_videos/interview/ --number_of_speakers 2 --scalar_face_detection 1.5 --detect_every_N_frame 8
ffmpeg -i ./test_videos/interview.mp4 -vn -ar 16000 -ac 1 -ab 192k -f wav ./test_videos/interview/interview.wav
python ./utils/crop_mouth_from_video.py --video-direc ./test_videos/interview/faces/ --landmark-direc ./test_videos/interview/landmark/ --save-direc ./test_videos/interview/mouthroi/ --convert-gray --filename-path ./test_videos/interview/filename_input/interview.csv

Acknowledgements

This implementation uses parts of the code from the following Github repos: Asteroid as described in our code.

Citations

If you find this code useful in your research, please cite our work:

@article{li2024audio,
  title={An audio-visual speech separation model inspired by cortico-thalamo-cortical circuits},
  author={Li, Kai and Xie, Fenghua and Chen, Hang and Yuan, Kexin and Hu, Xiaolin},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
  year={2024},
  publisher={IEEE}
}