FoundationVision / OmniTokenizer

OmniTokenizer: one model and one weight for image-video joint tokenization.
MIT License
224 stars 5 forks source link
auto-regressive-model image-generation tokenization vae video-generation vqvae

OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation

Official pytorch implementation of the following paper:

OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation.

Junke Wang1,2, Yi Jiang3, Zehuan Yuan3, Binyue Peng3, Zuxuan Wu1,2, Yu-Gang Jiang1,2
1Shanghai Key Lab of Intell. Info. Processing, School of CS, Fudan University
2Shanghai Collaborative Innovation Center of Intelligent Visual Computing, 3Bytedance Inc.

We introduce OmniTokenizer, a joint image-video tokenizer which features the following properties:

Please refer to our project page for the reconstruction and generation results by OmniTokenizer.


Please setup the environment using the following commands:

pip3 install torch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 --index-url
pip3 install -r requirements.txt

Then download the datasets from the official websites. You can download the processed by us and put them under ./annotations.

Model Zoo for VQVAE and VAE

We release both VQVAE and VAE version of OmniTokenizer, that are pretrained on a wide range of image and video datasets:

Type Training Data FID FVD ckpt
VQVAE ImageNet 1.28[^1] - imagenet_only.ckpt
VQVAE CelebAHQ 1.85 - celebahq.ckpt
VQVAE FFHQ 2.58 - ffhq.ckpt
VQVAE ImageNet + UCF 1.11 42.35 imagenet_ucf.ckpt
VQVAE ImageNet + K600 1.23 25.97 imagenet_k600.ckpt
VQVAE ImageNet + MiT 1.26 19.87 imagenet_mit.ckpt
VQVAE ImageNet + Sthv2 1.21 20.30 imagenet_sthv2.ckpt
VQVAE CelebAHQ + UCF 1.93 45.59 celebahq_ucf.ckpt
VQVAE CelebAHQ + K600 1.82 89.13 celebahq_k600.ckpt
VQVAE FFHQ + UCF 1.91 57.93 ffhq_ucf.ckpt
VQVAE FFHQ + K600 2.69 87.58 ffhq_k600.ckpt
VAE ImageNet + UCF 0.69 23.44 imagenet_ucf_vae.ckpt
VAE ImageNet + K600 0.78 13.02 imagenet_k600_vae.ckpt

[^1] We train this model w/o scaled_dot_product_attention, please comment line 446-460 in OmniTokenizer/modules/ to reproduce this result.

We recommand you to try imagenet_k600.ckpt as it is trained on large-scale image and video data.

You can easily incorporate OmniTokenizer into your language model or diffusion model with:

from OmniTokenizer import OmniTokenizer_VQGAN
vqgan = OmniTokenizer_VQGAN.load_from_checkpoint(vqgan_ckpt, strict=False)

# tokens = vqgan.encode(img)
# recons = vqgan.decode(tokens)

Tokenizer (VQVAE and VAE)

The training of VQVAE includes two stages: image-only training on a fixed resolution, and image-video joint training on multiple resolutions. After this, finetune the VQVAE model w/ KL loss to obtain a VAE model.

Please refer to scripts/recons/ for the training of omnitokenizer. Explanation of the flags that are opted to change according to different settings:

For the evaluation of omnitokenizer, please refer to scripts/recons/, scripts/recons/, scripts/recons/

LM-based Visual Synthesis

Please refer to scripts/lm_train and scripts/lm_gen for the training and evaluation of language model. We provide the checkpoints for ImageNet[imagenet_class_lm.ckpt], UCF [ucf_class_lm.ckpt], and Kinetics-600 [k600_fp_lm.ckpt].

Diffusion-based Visual Synthesis

We adopt DiT and Latte for diffusion-based visual generation. Please refer to for the training and evaluation instructions.


Please refer to for how to evaluate the reconstruction or generation results.


Our code is partially built upon VQGAN and TATS. We also appreciate the wonderful tools provided by pytorch-fid and common_metrics_on_video_quality.


This project is licensed under the MIT license, as found in the LICENSE file.