crowsonkb / k-diffusion

Karras et al. (2022) diffusion models for PyTorch
MIT License
2.21k stars 371 forks source link

k-diffusion

DOI

An implementation of Elucidating the Design Space of Diffusion-Based Generative Models (Karras et al., 2022) for PyTorch, with enhancements and additional features, such as improved sampling algorithms and transformer-based diffusion models.

Hourglass diffusion transformer

k-diffusion contains a new model type, image_transformer_v2, that uses ideas from Hourglass Transformer and DiT.

Requirements

To use the new model type you will need to install custom CUDA kernels:

Also, you should make sure your PyTorch installation is capable of using torch.compile(). It will fall back to eager mode if torch.compile() is not available, but it will be slower and use more memory in training.

Usage

Demo

To train a 256x256 RGB model on Oxford Flowers without installing custom CUDA kernels, install Hugging Face Datasets:

pip install datasets

and run:

python train.py --config configs/config_oxford_flowers_shifted_window.json --name flowers_demo_001 --evaluate-n 0 --batch-size 32 --sample-n 36 --mixed-precision bf16

If you run out of memory, try adding --checkpointing or reducing the batch size. If you are using an older GPU (pre-Ampere), omit --mixed-precision bf16 to train in FP32. It is not recommended to train in FP16.

If you have NATTEN installed and working (preferred), you can train with neighborhood attention instead of shifted window attention by specifying --config configs/config_oxford_flowers.json.

Config file

In the "model" key of the config file:

  1. Set the "type" key to "image_transformer_v2".

  2. The base patch size is set by the "patch_size" key, like "patch_size": [4, 4].

  3. Model depth for each level of the hierarchy is specified by the "depths" config key, like "depths": [2, 2, 4]. This constructs a model with two transformer layers at the first level (4x4 patches), followed by two at the second level (8x8 patches), followed by four at the highest level (16x16 patches), followed by two more at the second level, followed by two more at the first level.

  4. Model width for each level of the hierarchy is specified by the "widths" config key, like "widths": [192, 384, 768]. The widths must be multiples of the attention head dimension.

  5. The self-attention mechanism for each level of the hierarchy is specified by the "self_attns" config key, like:

    "self_attns": [
        {"type": "neighborhood", "d_head": 64, "kernel_size": 7},
        {"type": "neighborhood", "d_head": 64, "kernel_size": 7},
        {"type": "global", "d_head": 64},
    ]

    If not specified, all levels of the hierarchy except for the highest use neighborhood attention with 64 dim heads and a 7x7 kernel. The highest level uses global attention with 64 dim heads. So the token count at every level but the highest can be very large.

  6. As a fallback if you or your users cannot use NATTEN, you can also train a model with shifted window attention at the low levels of the hierarchy. Shifted window attention does not perform as well as neighborhood attention and it is slower to train and inference, but it does not require custom CUDA kernels. Specify it like:

    "self_attns": [
        {"type": "shifted-window", "d_head": 64, "window_size": 8},
        {"type": "shifted-window", "d_head": 64, "window_size": 8},
        {"type": "global", "d_head": 64},
    ]

    The window size at each level must evenly divide the image size at that level. Models trained with one attention type must be fine-tuned to be used with a different type.

Inference

TODO: write this section

Installation

k-diffusion can be installed via PyPI (pip install k-diffusion) but it will not include training and inference scripts, only library code that others can depend on. To run the training and inference scripts, clone this repository and run pip install -e <path to repository>.

Training

To train models:

$ ./train.py --config CONFIG_FILE --name RUN_NAME

For instance, to train a model on MNIST:

$ ./train.py --config configs/config_mnist_transformer.json --name RUN_NAME

The configuration file allows you to specify the dataset type. Currently supported types are "imagefolder" (finds all images in that folder and its subfolders, recursively), "cifar10" (CIFAR-10), and "mnist" (MNIST). "huggingface" Hugging Face Datasets is also supported.

Multi-GPU and multi-node training is supported with Hugging Face Accelerate. You can configure Accelerate by running:

$ accelerate config

then running:

$ accelerate launch train.py --config CONFIG_FILE --name RUN_NAME

Enhancements/additional features

To do