This is the codebase for the paper
Xuantong Liu, Shaozhe Hao, Xianbiao Qi*, Tianyang Hu#, Jun Wang, Rong Xiao, Yuan Yao#\ The Hong Kong University of Science and Technology, The University of Hong Kong, Intellifusion, Huawei Noah's Ark Lab\ (*: Project leader; #: Corresponding authors)
[Project Page] [arXiv] [Colab]
We explore the design space of using language models on the image generation task, including the image tokenizer choice (Binary Autoencoder or Vector-Quantization Autoencoder), language modeling method (AutoRegressive or Masked Language Model), vocabulary design based on BAE and sampling strategies and sampling strategies. We achieve a strong baseline (1.54 FID on ImageNet 256*256) compared to language-model-based and diffusion-model-based image generation models. We also analyze the fundamental difference between image and languege sequence generation and the learning behavior of language models on image generation, demonstrating the scaling law and the great potential of AR models across different domains.
We provide 4 BAE tokenizers with code dimension 16, 10, 24 and 32, each trained for 1,000,000 iterations with batch size 256. We also provide the checkpoints for all the generation models we discussed in the paper. All the download links are provided.
You can simply install the environment with the file environment.yml
by:
conda env create -f environment.yml
conda activate ELM
You can download the checkpoints for the image tokenizers (BAE) and generation models from link.
Code Dim | Bernoulli Sampling | Link | Size |
---|---|---|---|
16 | β | link | 332MB |
16 | β | link | 332MB |
20 | β | link | 332MB |
24 | β | link | 332MB |
Model | Link | Size |
---|---|---|
AR-L | [1-16] [2-8] [2-10] [2-12] | 1.25GB~1.77GB |
AR-XL | [1-16] [2-8] [2-10] [2-12] | 2.95GB~3.6GB |
AR-XXL | [1-16] [2-10] [2-12] | 5.49GB~6.25GB |
AR-2B | [2-12] | 7.64GB |
MLM-L | [1-16] | 1.51GB |
MLM-XL | [1-16] | 3.27GB |
MLM-XXL | [1-16] | 5.86GB |
If you want to generate samples with our pretrained models, run
bash inference.sh
You need to specify the checkpoint path in --ckpt
. The default setting is generated samples from 8 classes [207, 360, 387, 974, 88, 979, 417, 279].
If you want to generated images larger than 256 $\times$ 256
, activate --v_expand
(for vertical expanding) or --h_expand
(for horizontal expanding) in inference.sh
, --overlap_width
sets the length of the preceding sequence each time, --expand_time
sets how many times to expand, --gen_num
specify the number of generated samples.
If you want to train EML-L with vocabulary 2-10 on 1 GPU node with 8 GPUs, just run
bash train.sh
You need to specify the ImageNet dataset path at --data-path
. You can change the model size through --model
(L, XL, XXL and 2B), modeling method through --modeling
(ar or mlm), number of sub-codes through --token-each
(1, 2, 3, ...), dimension of each code through --code-dim
. Remember the *codebook_size
should be equal to
token-each
code-dim
**. --hm-dist
larger than 1 means the soft label according to Hamming Distance is used, however, we found it is kind of useless, and we have not utilized it or discussed it in our paper. You are free to have a try!
We train L/XL-sized models using 8 A800 GPUs, XXL/2B-sized models using 32 A800 GPUs on 4 nodes.
For each model size, we test the 50k-FID without cfg with the most suitable tokenizer using pytorch_fid . |
Model | FID |
---|---|---|
L, 2-10 | 17.95 | |
XL, 2-10 | 13.70 | |
XXL, 2-12 | 11.41 |
The training loss for token-prediction-based image generation can not converge well but still ensures high image generation capability. The rationale behind this is discussed in our paper. We show the training loss curve of the model of different sizes with the same tokenizer, where the scaling law is also presented.
However, the training loss trend of models with different tokenizers (such as L with 1-16, 2-8, 2-10, ...) is not compared. Because different tokenizers have different vocabulary sizes, the losses are not of the same magnitude and cannot be compared.