Audiovisual-Synthesis

Unsupervised Any-to-many Audiovisual Synthesis via Exemplar Autoencoders

Kangle Deng, Aayush Bansal, Deva Ramanan

project page / demo / arXiv

This repo provides a PyTorch Implementation of our work.

Acknowledgements: This code borrows heavily from Auto-VC and Tacotron.

Summary Video

Demo Video

Dependencies

First, make sure ffmpeg installed on your machine.

Then, run: pip install -r requirements.txt

Data

We provide our CelebAudio Dataset at link.

Train

Voice Conversion

Check 'scripts/train_audio.sh' for an example of training a Voice-Conversion model. Make sure directory 'logs' exist.

Generally, run:

python train_audio.py --data_path PATH_TO_TRAINING_DATA --experiment_name EXPERIMENT_NAME --save_freq SAVE_FREQ --test_path PATH_TO_TEST_AUDIO --batch_size BATCH_SIZE --save_dir PATH_TO_SAVE_MODEL

You can specify any audio data as PATH_TO_TRAINING_DATA, and a small clip of audio as PATH_TO_TEST_AUDIO. For example, the following script trains an audio model for Barack Obama, and use an input clip for test every 2000 iterations. You can find the saved models and test results in the saving directory.

python train_audio.py --data_path datasets/celebaudio/BarackObama_01.wav --experiment_name VC_example_run --save_freq 2000 --test_path example/input_3_MartinLutherKing.wav  --batch_size 8 --save_dir ./saved_models/

Audiovisual Synthesis

Check 'scripts/train_audiovisual.sh' for an example of training a Audiovisual-Synthesis model. We usually train an audiovisual model based on a pretrained audio model.

1-stage generation -- video resolution: 256 * 256

Generally, run:

python train_audiovisual.py --video_path PATH_TO_TRAINING_DATA --experiment_name EXPERIMENT_NAME --save_freq SAVE_FREQ --test_path PATH_TO_TEST_AUDIO --batch_size BATCH_SIZE --save_dir PATH_TO_SAVE_MODEL --use_256 --load_model LOAD_MODEL_PATH

You can specify any audiovisual data as PATH_TO_TRAINING_DATA, and a small clip of audio as PATH_TO_TEST_AUDIO. The following script trains an audiovisual model based on a pre-trained Obama audio model, and use an input clip for test every 2000 iterations. You can find the saved models and test results in the saving directory.

python train_audiovisual.py --video_path datasets/video/obama.mp4 --experiment_name Audiovisual_example_run --save_freq 2000 --test_path example/input_3_MartinLutherKing.wav --batch_size 8 --save_dir ./saved_models/ --use_256 --load_model ./saved_models/VC_example_run/Epoch600_Iter00030000.pkl

2-stage generation -- video resolution: 512 * 512

If you want the video resolution to be 512 * 512, use the StackGAN-style 2-stage generation.

Generally, run:

python train_audiovisual.py --video_path PATH_TO_TRAINING_DATA --experiment_name EXPERIMENT_NAME --save_freq SAVE_FREQ --test_path PATH_TO_TEST_AUDIO --batch_size BATCH_SIZE --save_dir PATH_TO_SAVE_MODEL --residual --load_model LOAD_MODEL_PATH

Test

Voice Conversion

Check 'scripts/test_audio.sh' for an example of testing a Voice-Conversion model.

To convert a wavfile using a trained model, run:

python test_audio.py --model PATH_TO_MODEL --wav_path PATH_TO_INPUT --output_file PATH_TO_OUTPUT

You can specify any audio data as PATH_TO_INPUT. For example, the following script converts the input wavfile by use of a pre-trained audio model.

python test_audio.py --model ./saved_models/VC_example_run/Epoch600_Iter00030000.pkl --wav_path example/input_1_Trump.wav --output_file ./result.wav

Audiovisual Synthesis

Check 'scripts/test_audiovisual.sh' for an example of testing a Audiovisual-Synthesis model.

1-stage generation -- video resolution: 256 * 256

python test_audiovisual.py --load_model PATH_TO_MODEL --wav_path PATH_TO_INPUT --output_file PATH_TO_OUTPUT --use_256

You can specify any audio data as PATH_TO_INPUT. For example, the following script converts the input wavfile by use of a pre-trained audiovisual model.

python test_audiovisual.py --load_model ./saved_models/Audiovisual_example_run/Epoch600_Iter00030000.pkl --wav_path example/input_1_Trump.wav  --output_file ./result.mp4 --use_256

2-stage generation -- video resolution: 512 * 512

python test_audiovisual.py --load_model PATH_TO_MODEL --wav_path PATH_TO_INPUT --output_file PATH_TO_OUTPUT --residual

Tensorboard (Optional)

This repo uses TensorboardX to visualize training loss. You can also check test audio results on tensorboard.

Start TensorBoard with tensorboard --logdir ./logs.

dunbar12138 / Audiovisual-Synthesis

readme

Audiovisual-Synthesis

Summary Video

Demo Video

Dependencies

Data

Train

Voice Conversion

Audiovisual Synthesis

1-stage generation -- video resolution: 256 * 256

2-stage generation -- video resolution: 512 * 512

Test

Voice Conversion

Audiovisual Synthesis

1-stage generation -- video resolution: 256 * 256

2-stage generation -- video resolution: 512 * 512

Tensorboard (Optional)