KylinYee / R2-Talker-code

R2-Talker: Realistic Real-Time Talking Head Synthesis with Hash Grid Landmarks Encoding and Progressive Multilayer Conditioning
MIT License
71 stars 4 forks source link

R2-Talker: Realistic Real-Time Talking Head Synthesis with Hash Grid Landmarks Encoding and Progressive Multilayer Conditioning

License arXiv GitHub Stars downloads

This is the official repository for the paper: R2-Talker: Realistic Real-Time Talking Head Synthesis with Hash Grid Landmarks Encoding and Progressive Multilayer Conditioning.

Project | ArXiv | Video


0.Supported Features

Method Driving Features Audio Encoder
R2-Talker 3D Facial Landmarks Hash grid encoder
RAD-NeRF Audio Features Audio Feature Extractor
Geneface+instant-ngp 3D facial landmarks Audio Feature Extractor


Install dependency & Build extension (optional) Tested on Ubuntu 22.04, Pytorch 1.12 and CUDA 11.6. ```bash git clone cd R2-Talker-code ``` ### Install dependency ```bash # for ubuntu, portaudio is needed for pyaudio to work. sudo apt install portaudio19-dev pip install -r requirements.txt ``` ### Build extension (optional) By default, we use [`load`]( to build the extension at runtime. However, this may be inconvenient sometimes. Therefore, we also provide the `` to build each extension: ```bash # install all extension modules bash scripts/ ```

2.Data pre-processing

Preparation & Pre-processing Custom Training Video ### Preparation: ```bash ## install pytorch3d pip install "git+" ## prepare face-parsing model wget -O data_utils/face_parsing/79999_iter.pth ## prepare basel face model # 1. download `01_MorphableModel.mat` from and put it under `data_utils/face_tracking/3DMM/` # 2. download other necessary files from AD-NeRF's repository: wget -O data_utils/face_tracking/3DMM/exp_info.npy wget -O data_utils/face_tracking/3DMM/keys_info.npy wget -O data_utils/face_tracking/3DMM/sub_mesh.obj wget -O data_utils/face_tracking/3DMM/topology_info.npy # 3. run cd data_utils/face_tracking python cd ../.. ## prepare ASR model # if you want to use DeepSpeech as AD-NeRF, you should install tensorflow 1.15 manually. # else, we also support Wav2Vec in PyTorch. ``` ### Pre-processing Custom Training Video * Put training video under `data//.mp4`. The video **must be 25FPS, with all frames containing the talking person**. The resolution should be about 512x512, and duration about 1-5min. ```bash # an example training video from AD-NeRF mkdir -p data/obama wget -O data/obama/obama.mp4 ``` * Run script (may take hours dependending on the video length) ```bash # run all steps python data_utils/ data//.mp4 # if you want to run a specific step python data_utils/ data//.mp4 --task 1 # extract audio wave ``` * 3D facial landmark generator will be added in the feature. If you want to process the custom data, please ref to [Geneface]( to obtain `trainval_dataset.npy`, using our `` to extract landmarks and put the landmarks to `data//`. * File structure after finishing all steps: ```bash ./data/ ├──.mp4 # original video ├──ori_imgs # original images from video │ ├──0.jpg │ ├──0.lms # 2D landmarks │ ├──... ├──gt_imgs # ground truth images (static background) │ ├──0.jpg │ ├──... ├──parsing # semantic segmentation │ ├──0.png │ ├──... ├──torso_imgs # inpainted torso images │ ├──0.png │ ├──... ├──aud.wav # original audio ├──aud_eo.npy # audio features (wav2vec) ├──aud.npy # audio features (deepspeech) ├──bc.jpg # default background ├── # raw head tracking results ├──transforms_train.json # head poses (train split) ├──transforms_val.json # head poses (test split) |——aud_idexp_train.npy # head landmarks (train split) |——aud_idexp_val.npy # head landmarks (test split) |——aud_idexp.npy # head landmarks ``` For your convenience, we provide some processed results of the Obama video [here]( And you can also download more raw videos from geneface [here](


Quick Start & Detailed Usage ### Quick Start We have prepared relevant materials [here]( Please download these materials and put them in the new `pretrained` file * File structure after finishing all steps: ```bash ./pretrained ├──r2talker_Obama_idexp_torso.pth # pretrained model ├──test_eo.npy # driving audio features (wav2vec) ├──test_lm3ds.npy # driving audio features (landmarks) ├──test.wav # raw driving audio ├──bc.jpg # default background ├──transforms_val.json # head poses ├──test.mp4 # raw driving video ``` * Run inference: ```bash # save video to trail_test/results/*.mp4 sh scripts/ ``` ### Detailed Usage First time running will take some time to compile the CUDA extensions. ```bash # step.1 train (head) # by default, we load data from disk on the fly. # we can also preload all data to CPU/GPU for faster training, but this is very memory-hungry for large datasets. # `--preload 0`: load from disk (default, slower). # `--preload 1`: load to CPU, requires ~70G CPU memory (slightly slower) # `--preload 2`: load to GPU, requires ~24G GPU memory (fast) python data/Obama/ --workspace trial_r2talker_Obama_idexp/ -O --iters 200000 --method r2talker --cond_type idexp # step.2 train (finetune lips for another 50000 steps, run after the above command!) python data/Obama/ --workspace trial_r2talker_Obama_idexp/ -O --finetune_lips --iters 250000 --method r2talker --cond_type idexp # step.3 train (torso) # .pth should be the latest checkpoint in trial_obama python data/Obama/ --workspace trial_r2talker_Obama_idexp_torso/ -O --torso --iters 200000 --head_ckpt trial_r2talker_Obama_idexp/checkpoints/ngp_ep0035.pth --method r2talker --cond_type idexp ``` check the `scripts` directory for more provided examples.


This code is developed heavily relying on RAD-NeRF, GeneFace, and AD-NeRF. Thanks for these great projects.


  title={R2-Talker: Realistic Real-Time Talking Head Synthesis with Hash Grid Landmarks Encoding and Progressive Multilayer Conditioning},
  author={Zhiling Ye, Liangguo Zhang, Dingheng Zeng, Quan Lu, Ning Jiang},
  journal={arXiv preprint arXiv:2312.05572},