GalaxyCong / StyleDubber

[ACL 2024] This is the Pytorch code for our paper "StyleDubber: Towards Multi-Scale Style Learning for Movie Dubbing"
MIT License
48 stars 3 forks source link

StyleDubber

This package contains the accompanying code for the following paper:

"StyleDubber: Towards Multi-Scale Style Learning for Movie Dubbing", which has appeared as long paper in the Findings of the ACL, 2024.

Illustration

πŸ“£ News

πŸ—’ TODOs

πŸ“Š Dataset

β”œβ”€β”€ Lip_Grid_Gray

    β”‚       └── [GRID's Lip Region Images in Gray-scale] 

β”œβ”€β”€ Lip_Grid_Color

    β”‚       └── [GRID's Lip Region Images in RGB] 

β”œβ”€β”€ Grid_resample_ABS (GoogleDrive βœ…οΌ‰

    β”‚       └── [22050 Hz Ground Truth Audio Files in .wav] (The original data of GRID is 25K Hz)

β”œβ”€β”€ Grid_lip_Feature

    β”‚       └── [Lip Feature extracted from ```Lip_Grid_Gray``` via Lipreading_using_Temporal_Convolutional_Networks] 

β”œβ”€β”€ Grid_Face_Image

    β”‚       └── [GRID's Face Region Images] 

β”œβ”€β”€ Grid_dataset_Raw

    β”‚       └── [GRID's raw data from Website] 

β”œβ”€β”€ Grad_eachframe

    β”‚       └── [Each frame files of Grid dataset] 

β”œβ”€β”€ Gird_FaceVAFeature 

    β”‚       └── [Face Feature extracted from ```Grid_Face_Image``` via EmoFAN] 

β”œβ”€β”€ 0_Grid_Wav_22050_Abs_Feature (GoogleDrive βœ…οΌ‰

            └── [Contains all the data features for train and inference in the GRID dataset]  

Note: If you just want to train StyleDubber on the GRID dataset, you only need to download the files in 0_Grid_Wav_22050_Abs_Feature (Preprocessed data features) and Grid_resample_ABS (Ground truth waveform used for testing). If you're going to plot and display, use it for other tasks (lip reading, ASV, etc.), or re-preprocess features on your way, you can download the rest of the files you need 😊.

β”œβ”€β”€ Phoneme_level_Feature (GoogleDrive βœ…οΌ‰

    β”‚       └── [Contains all the data features for train and inference in the V2C-Animation dataset] 

β”œβ”€β”€ GT_Wav (GoogleDrive βœ…οΌ‰

            └── [22050 Hz ground truth Audio Files in .wav] 

Note: For training on V2C-Animation, you need to download the files in Phoneme_level_Feature (Preprocessed data features) and GT_Wav (Ground truth waveform used for testing). Other visual images (e.g., face and lip regions) in intermediate processes can be accessed at HPMDubbing.

Quick Q&A: HPMDubbing also has pre-processed features. Are they the same? Can I use it to train StyleDubber?

No, you need to re-download to train StyleDubber. HPMDubbing needs frame frame-level feature with 220 hop length and 880 window length for the desired upsampling manner. StyleDubber currently only supports phoneme-level features and we adjust the hop length (256) and window length (1024) during pre-processing.

πŸ’‘ Checkpoints

We provide the pre-trained checkpoints on GRID and V2C-Animation datasets as follows, respectively:

βš’οΈ Environment

Our python version is 3.8.18 and cuda version 11.5. It's possible to have other compatible version. Both training and inference are implemented with PyTorch on a GeForce RTX 4090 GPU.

conda create -n style_dubber python=3.8.18
conda activate style_dubber
pip install -r requirements.txt

πŸ”₯ Train Your Own Model

You need repalce tha path in preprocess_config (see "./ModelConfig_V2C/model_config/MovieAnimation/config_all.txt") to you own path. Training V2C-Animation dataset (153 cartoon speakers), please run:

python train_StyleDubber_V2C.py

You need repalce tha path in preprocess_config (see "./ModelConfig_GRID/model_config/GRID/config_all.txt") to you own path. Training GRID dataset (33 real-world speakers), please run:

python train_StyleDubber_GRID.py

β­• Inference Wav

There are three kinds of dubbing settings in this paper. The first setting is the same as in V2C-Net (Chen et al., 2022a), which uses target audio as reference audio from test set. However, this is impractical in real-world applications. Thus, we design two new and more reasonable settings: β€œDub 2.0” uses non-ground truth audio of the same speaker as reference audio; β€œDub 3.0” uses the audio of unseen characters (from another dataset) as reference audio.

Illustration

Inference Setting1: V2C & GRID

python 0_evaluate_V2C_Setting1.py --restore_step <checkpoint_step>

or

python 0_evaluate_GRID_Setting1.py --restore_step <checkpoint_step>

Inference Setting2: V2C

python 0_evaluate_V2C_Setting2.py --restore_step <checkpoint_step>

Inference Setting3: V2C

python 0_evaluate_V2C_Setting3.py --restore_step <checkpoint_step>

πŸ€–οΈ Output Result

✏️ Citing

If you find our work useful, please consider citing:

@inproceedings{cong-etal-2024-styledubber,
    title = "{S}tyle{D}ubber: Towards Multi-Scale Style Learning for Movie Dubbing",
    author = "Cong, Gaoxiang  and
      Qi, Yuankai  and
      Li, Liang  and
      Beheshti, Amin  and
      Zhang, Zhedong  and
      Hengel, Anton  and
      Yang, Ming-Hsuan  and
      Yan, Chenggang  and
      Huang, Qingming",
    booktitle = "Findings of the Association for Computational Linguistics ACL 2024",
    month = aug,
    year = "2024",
    pages = "6767--6779",
}

πŸ™ Acknowledgments

We would like to thank the authors of previous related projects for generously sharing their code and insights: CDFSE_FastSpeech2, Multimodal Transformer, SMA, Meta-StyleSpeech, and FastSpeech2.