why divide 0.18215 when sampling?

facebookresearch / DiT

Official PyTorch Implementation of "Scalable Diffusion Models with Transformers"

Other

6.37k stars 570 forks source link

why divide 0.18215 when sampling? #13

Closed Yoonho-Na closed 1 year ago

Yoonho-Na commented 1 year ago

I have question. why do you divied 0.18215 when sampling? where does this number come from??

samples = vae.decode(samples / 0.18215).sample

wpeebles commented 1 year ago

Hi @Yoonho-Na, good question. This is inherited from LDM/Stable Diffusion's normalization procedure. During training, the input image latent (output from the VAE encoder) is multiplied by a factor of 0.18215, which is roughly the inverse standard deviation of image latents. After sampling a new latent, we need to remove that normalization before it's processed by the VAE decoder, which is why we divide by it. You can find more info about it here. Hope this helps!

Yoonho-Na commented 1 year ago

@wpeebles Thank you for your clear explanation. I have few more question.

I'm trying to train DiTs with my own dataset. Since this repo doesn't provides training scripts, (hope the scripts will be available near future) I'm considering to implement DiT on the baseline of official LDM repo. What I remember from your paper is that DiT is basically LDM with transformer block (replacing U-net denoising network) So is it going to be OK with simply replacing the U-net in the LDM repo with your transformer block or do you think many modification is going to be needed?
I'm not really familiar with calculating model complexity with Gflops. Could you explain about how to get Gflops measure??

wpeebles commented 1 year ago

Hi @Yoonho-Na, good questions!

We just added a training script (train.py) in the latest commit today. To answer your question: the training hyperparameters we use for DiT are actually much more similar to ADM than LDM, so some changes would need to be made to the LDM training scripts to reproduce DiT training.
We use pretty much the same methodology to count flops as the representation learning literature (technically they're MACs). You can find some examples of counting flops in other architectures here and here.

Hope this helps!