Make It Move: Controllable Image-to-Video Generation with Text Descriptions

Screenshot

This repository contains datasets and source code used in the CVPR'2022 paper ``Make It Move: Controllable Image-to-Video Generation with Text Descriptions".

Update

[X] We improved MAGE with a more prowerful autoencoder and a controller over VAE. The code and models of the improved version, MAGE+, have been released at google drive.

[X] We proposed two no-reference evaluation metrics, action precision and referring expression precision, to evaluate the precision of fine-grained motions based on a captioning-and-matching method. (We chose SwinBERT as the captioning model. Please download the trained model on CATER-GENs at google drive and put it under 'metrics/swinbert_cater'.)

$ docker run --gpus all --ipc=host --rm -it --mount src=/home/user/SwinBERT/,dst=/videocap,type=bind --mount src=/home/user/,dst=/home/user/,type=bind -w /videocap linjieli222/videocap_torch1.7:fairscale bash -c "source /videocap/setup.sh && bash"
$ python metrics/swinbert_cater/eval_precision_run_caption_VidSwinBert.py --do_lower_case --do_test --eval_model_dir ./metrics/swinbert_cater/ --test_video_fname /home/results/

$ python eval_precision.py --data-root /home/user/datasets/CATER-GEN-v1 --gen-caption /home/user/results/catergenv1_diverse/generated_captions.json --mode ambiguous

Dataset Generation

Moving MNIST datasets

The scripts to generate Moving MNIST datasets are modified based on Sync-DRAW. You can run the following commands to generate Single Moving MNIST, Double Moving MNIST and our Modified Double Moving MNIST, respectively.

$ python data/mnist_caption_single.py
$ python data/mnist_caption_double.py
$ python data/mnist_caption_double_modified.py

CATER-GENs

Datasets Download

The original CATER-GEN-v1 and CATER-GEN-v2 used in our paper are provided at link1 and link2, respectively.

Create Your Own Datasets

Thanks to authors of CATER and CLEVR for making their code available, you can also generate your own datasets as following.

First, please generate videos and metadata according to the guideline of CATER. Please change the hyper-parameters including min_objects, max_objects, num_frames, num_images, width, height, and fix CAM_MOTION = False, start_frame = 0. Then, you can generate text descriptions by running:

$ python data/gen_cater_text_anno.py

MAGE

There are two stages training in our proposed baseline, MAGE. The first stage is to train a VQ-VAE encoder and decoder. The second stage is to train the remaining video generation model. The trained models are provided at google drive.

Environment

Our code has been tested on Ubuntu 18.04. Before starting, please configure your Anaconda environment by

$ conda create -n mage python=3.8
$ conda activate mage
$ pip install -r requirements.txt

Stage 1. VQ-VAE Training

$ python train_vqvae.py --dataset mnist --data-root /data/data_file --output-folder ./models/vqvae_model_file

Stage 2. MAGE Training

$ python main_mage.py --split train --config config/model.yaml --checkpoint-path ./models/MAGE/model_path

Sampling

$ python main_mage.py --split test --config config/model.yaml --checkpoint-path ./models/MAGE/model_path

Citation

If you find this repository useful in your research then please cite

@InProceedings{hu2022mage,
    title={Make It Move: Controllable Image-to-Video Generation with Text Descriptions},
    author={Yaosi Hu and Chong Luo and Zhenzhong Chen},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year={2022}
}

Youncy-Hu / MAGE

readme