dorarad / gansformer

Generative Adversarial Transformers
MIT License
1.32k stars 149 forks source link
attention compositionality gans generative-adversarial-networks image-generation scene-generation transformers

Python 3.7 PyTorch 1.8 TensorFlow 1.14 cuDNN 7.3.1 License CC BY-NC

GANformer: Generative Adversarial Transformers

Drew A. Hudson & C. Lawrence Zitnick

Check out our new PyTorch version and the GANformer2 paper!

Update (Feb 21, 2022): We updated the weight initialization of the PyTorch version to the intended scale, leading to a substantial improvement in the model's learning speed!

This is an implementation of the GANformer model, a novel and efficient type of transformer, explored for the task of image generation. The network employs a bipartite structure that enables long-range interactions across the image, while maintaining computation of linearly efficiency, that can readily scale to high-resolution synthesis. The model iteratively propagates information from a set of latent variables to the evolving visual features and vice versa, to support the refinement of each in light of the other and encourage the emergence of compositional representations of objects and scenes. In contrast to the classic transformer architecture, it utilizes multiplicative integration that allows flexible region-based modulation, and can thus be seen as a generalization of the successful StyleGAN network.

1st Paper: https://arxiv.org/pdf/2103.01209
2nd Paper: https://arxiv.org/abs/2111.08960
Contact: dorarad@cs.stanford.edu
Implementation: network.py (TF / Pytorch)

We now support both PyTorch and TF!

:white_check_mark: Uploading initial code and readme
:white_check_mark: Image sampling and visualization script
:white_check_mark: Code clean-up and refacotiring, adding documentation
:white_check_mark: Training and data-prepreation intructions
:white_check_mark: Pretrained networks for all datasets
:white_check_mark: Extra visualizations and evaluations
:white_check_mark: Providing models trained for longer
:white_check_mark: Releasing the PyTorch version
:white_check_mark: Releasing pre-trained models for high-resolutions (up to 1024 x 1024)
⬜️ Releasing the GANformer2 model (supporting layout generation and conditional layout2image generation)

If you experience any issues or have suggestions for improvements or extensions, feel free to contact me either thourgh the issues page or at dorarad@stanford.edu.

Bibtex

@article{hudson2021ganformer,
  title={Generative Adversarial Transformers},
  author={Hudson, Drew A and Zitnick, C. Lawrence},
  journal={Proceedings of the 38th International Conference on Machine Learning, {ICML} 2021},
  year={2021}
}

@article{hudson2021ganformer2,
  title={Compositional Transformers for Scene Generation},
  author={Hudson, Drew A and Zitnick, C. Lawrence},
  journal={Advances in Neural Information Processing Systems {NeurIPS} 2021},
  year={2021}
}

Sample Images

Using the pre-trained models (generated after training for 5-7x less steps than StyleGAN2 models! Training our models for longer will improve the image quality further):

Requirements

Quickstart & Overview

Our repository supports both Tensorflow (at the main directory) and Pytorch (at pytorch_version). The two implementations follow a close code and files structure, and share the same interface. To switch from the TF to Pytorch, simply enter into pytorch_version), and install the requirements. Please feel free to open an issue or contact for any questions or suggestions about the new implementation!

A minimal example of using a pre-trained GANformer can be found at generate.py (TF / Pytorch). When executed, the 10-lines program downloads a pre-trained modle and uses it to generate some images:

python generate.py --gpus 0 --model gdrive:bedrooms-snapshot.pkl --output-dir images --images-num 32

You can use --truncation-psi to control the generated images quality/diversity trade-off.
We recommend trying out different values in the range of 0.6-1.0.

Pretrained models and High resolutions

We provide pretrained models for resolution 256×256 for all datasets, as well as 1024×1024 for FFHQ and 1024×2048 for Cityscapes.

To generate images for the high-resolution models, run the following commands: (We reduce their batch-size to 1 so that they can load onto a single GPU)

python generate.py --gpus 0 --model gdrive:ffhq-snapshot-1024.pkl --output-dir ffhq_images --images-num 32 --batch-size 1
python generate.py --gpus 0 --model gdrive:cityscapes-snapshot-2048.pkl --output-dir cityscapes_images --images-num 32 --batch-size 1 --ratio 0.5 # 1024 x 2048 cityscapes currently supported in the TF version only

We can train and evaluate new or pretrained model both quantitatively and qualitative with run_network.py (TF / Pytorch).
The model architecutre can be found at network.py (TF / Pytorch). The training procedure is implemented at training_loop.py (TF / Pytorch).

Data preparation

We explored the GANformer model on 4 datasets for images and scenes: CLEVR, LSUN-Bedrooms, Cityscapes and FFHQ. The model can be trained on other datasets as well. We trained the model on 256x256 resolution. Higher resolutions are supported too. The model will automatically adapt to the resolution of the images in the dataset.

The prepare_data.py (TF / Pytorch) can either prepare the datasets from our catalog or create new datasets.

Default Datasets

To prepare the datasets from the catalog, run the following command:

python prepare_data.py --ffhq --cityscapes --clevr --bedrooms --max-images 100000

See table below for details about the datasets in the catalog.

Useful options:

Custom Datasets

You can also use the script to create new custom datasets. For instance:

python prepare_data.py --task <dataset-name> --images-dir <source-dir> --format png --ratio 0.7 --shards-num 5

The script supports several formats: png, jpg, npy, hdf5, tfds and lmdb.

Dataset Catalog

Dataset # Images Resolution Download Size TFrecords Size Gamma
FFHQ 70,000 256×256 13GB 13GB 10
CLEVR 100,015 256×256 18GB 15.5GB 40
Cityscapes 24,998 256×256 1.8GB 8GB 20
LSUN-Bedrooms 3,033,042 256×256 42.8GB Up to 480GB 100

Use --max-images to reduce the size of the tfrecord files.

Training

Models are trained by using the --train option. To fine-tune a pretrained GANformer model:

python run_network.py --train --gpus 0 --ganformer-default --expname clevr-pretrained --dataset clevr \
  --pretrained-pkl gdrive:clevr-snapshot.pkl

We provide pretrained models for bedrooms, cityscapes, clevr and ffhq.

To train a GANformer in its default configuration form scratch:

python run_network.py --train --gpus 0 --ganformer-default --expname clevr-scratch --dataset clevr --eval-images-num 10000

By defualt, models training is resumed from the latest snapshot. Use --restart to strat a new experiment, or --pretrained-pkl to select a particular snapshot to load.

For comparing to state-of-the-art, we compute metric scores using 50,000 sample imaegs. To expedite training though, we recommend setting --eval-images-num to a lower number. Note though that this can impact the precision of the metrics, so we recommend using a lower value during training, and increasing it back up in the final evaluation.

We support a large variety of command-line options to adjust the model, training, and evaluation. Run python run_network.py -h for the full list of options!

we recommend exploring different values for --gamma when training on new datasets. If you train on resolution >= 512 and observe OOM issues, consider reducing --batch-gpu to a lower value.

Logging

Baseline models

The codebase suppors multiple baselines in addition to the GANformer. For instance, to run a vanilla GAN model:

python run_network.py --train --gpus 0 --baseline GAN --expname clevr-gan --dataset clevr 

Evaluation

To evalute a model, use the --eval option:

python run_network.py --eval --gpus 0 --expname clevr-exp --dataset clevr

Add --pretrained-pkl gdrive:<dataset>-snapshot.pkl to evalute a pretrained model.

Below we provide the FID-50k scores for the GANformer (using the pretrained checkpoints above) as well as baseline models.
Note that these scores are different than the scores reported in the StyleGAN2 paper since they run experiments for up to 7x more training steps (5k-15k kimg-steps in our experiments over all models, which takes about 3-4 days with 4 GPUs, vs 50-70k kimg-steps in their experiments, which take over 90 GPU-days).

Note regarding Generator/Discriminator: Following ablation experiments, we observed that incorporating the simplex and duplex attention to the generator (rather than to both the generator and discriminator) improve the models' performance. Accordingly, we are releasing pretrained models that incorporate attention to the generator only, and we have updated the paper to reflect that!

Model CLEVR LSUN-Bedroom FFHQ Cityscapes
GAN 25.02 12.16 13.18 11.57
kGAN 28.28 69.9 61.14 51.08
SAGAN 26.04 14.06 16.21 12.81
StyleGAN2 16.05 11.53 16.21 8.35
GANformer 9.24 6.15 7.42 5.23

Model Change-log

Compared to the original GANformer depicted in the paper, this repository make several additional improvments that contributed to the performance:

Visualization

The code supports producing qualitative results and visualizations. For instance, to create attention maps for each layer:

python run_network.py --gpus 0 --vis --expname clevr-exp --dataset clevr --vis-layer-maps

Below you can see sample images and attention maps produced by the GANformer:

Command-line Options

In the following we list some of the most useful model options.

Training

Model (most useful)

Model (others)

Visualization

Run python run_network.py -h for the full options list.

Sample images (more examples)




CUDA / Installation

The model relies on custom TensorFlow/Pytorch ops that are compiled on the fly using NVCC.

To set up the environment e.g. for cuda-10.0:

export PATH=/usr/local/cuda-10.0/bin${PATH:+:${PATH}}
export LD_LIBRARY_PATH=/usr/local/cuda10.0/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

To test that your NVCC installation is working correctly, run:

nvcc test_nvcc.cu -o test_nvcc -run
| CPU says hello.
| GPU says hello.

In the pytorch version, if you get the following repeating message:
"Failed to build CUDA kernels for upfirdn2d. Falling back to slow reference implementation"
make sure your cuda and pytorch versions match. If you have multiple CUDA installed, consider using setting CUDA_HOME to the matching one. E.g.

export CUDA_HOME=/usr/local/cuda-10.1

Architecture Overview

The GANformer consists of two networks:

Generator: which produces the images (x) given randomly sampled latents (z). The latent z has a shape [batch_size, component_num, latent_dim], where component_num = 1 by default (Vanilla GAN, StyleGAN) but is > 1 for the GANformer model. We can define the latent components by splitting z along the second dimension to obtain z_1,...,z_k latent components. The generator likewise consists of two parts:

Discriminator: Receives and image and has to predict whether it is real or fake – originating from the dataset or the generator. The model perform multiple layers of convolution and downsampling on the image, reducing the representation's resolution gradually until making final prediction. Optionally, attention can be incorporated into the discriminator as well where it has multiple (k) aggregator variables, that use attention to adaptively collect information from the image while being processed. We observe small improvements in model performance when attention is used in the discriminator, although note that most of the gain in using attention based on our observations arises from the generator.

Codebase

This codebase builds on top of and extends the great StyleGAN2 and StyleGAN2-ADA repositories by Karras et al.

The GANformer model can also be seen as a generalization of StyleGAN: while StyleGAN has one global latent vector that control the style of all image features globally, the GANformer has k latent vectors, that cooperate through attention to control regions within the image, and thereby better modeling images of multi-object and compositional scenes.

Acknowledgement

I wish to thank Christopher D. Manning for the fruitful discussions and constructive feedback in developing the Bipartite Transformer, especially when explored within the language representation area, as well as for providing the kind financial support that allowed this work to happen! :sunflower:

If you have questions, comments or feedback, please feel free to contact me at dorarad@cs.stanford.edu, Thank you! :)