Scaling from 128x128, to 256x256, 512x512 and 1024x1024?

crowsonkb / k-diffusion

Karras et al. (2022) diffusion models for PyTorch

MIT License

2.21k stars 371 forks source link

Scaling from 128x128, to 256x256, 512x512 and 1024x1024? #95

Open tin-sely opened 5 months ago

tin-sely commented 5 months ago

hey,

loved your paper and thanks a bunch for providing the code!

i have a quick question, how do you scale and train the network (HDiT) for increased resolutions? i saw you mentioned here: https://github.com/crowsonkb/k-diffusion/issues/14#issuecomment-1199475244 that you first need to build the entire network, and then skip layers but i'm not sure if this also applies to this new architecture?

many thanks!

tin-sely commented 5 months ago

it looks like it's not meant for progressive scaling? i guess the best option would be to train a lower resolution and then copy the relevant weights to a higher-res network

another thing i was curious about was the inputs:

def forward(self, x, sigma, aug_cond=None, class_cond=None, mapping_cond=None):

x, sigma, and class_cond are clear, but do you have any more details on aug_cond and mapping_cond?

madebyollin commented 5 months ago

@tin-sely I believe aug_cond is for non-leaky augmentations. When an input image is augmented during training, a description of how that image was augmented is also given to the generator (as aug_cond - augmentation conditioning), so that the generator eventually learns how to generate either augmented or non-augmented images depending the value of the aug_cond input.

I believe mapping_cond is an older name for aug_cond which is used in the non-transformer model configs (the ones that use KarrasAugmentWrapper - which takes the aug_cond tensor and gives it to the model as mapping_cond)

tin-sely commented 5 months ago

thanks a bunch @madebyollin! ✨

mnslarcher commented 5 months ago

My understanding is that you use aug_cond when you wish to provide the model with information about the augmentations using Fourier Features: https://github.com/crowsonkb/k-diffusion/blob/6ab5146d4a5ef63901326489f31f1d8e7dd36b48/k_diffusion/models/image_transformer_v2.py#L657 https://github.com/crowsonkb/k-diffusion/blob/6ab5146d4a5ef63901326489f31f1d8e7dd36b48/k_diffusion/models/image_transformer_v2.py#L658 https://github.com/crowsonkb/k-diffusion/blob/6ab5146d4a5ef63901326489f31f1d8e7dd36b48/k_diffusion/models/image_transformer_v2.py#L718

On the other hand, if you use mapping_cond, the condition will be fed directly into a linear layer, as shown here: https://github.com/crowsonkb/k-diffusion/blob/6ab5146d4a5ef63901326489f31f1d8e7dd36b48/k_diffusion/models/image_transformer_v2.py#L660 https://github.com/crowsonkb/k-diffusion/blob/6ab5146d4a5ef63901326489f31f1d8e7dd36b48/k_diffusion/models/image_transformer_v2.py#L720

These embeddings are then both fed into the MappingNetwork: https://github.com/crowsonkb/k-diffusion/blob/6ab5146d4a5ef63901326489f31f1d8e7dd36b48/k_diffusion/models/image_transformer_v2.py#L721

But getting more clarity on this would definitely help!