CompVis / latent-diffusion

High-Resolution Image Synthesis with Latent Diffusion Models
MIT License
11.78k stars 1.53k forks source link

How generate image from noise vector with KL-reg autoencoder #187

Open ryhhtn opened 1 year ago

ryhhtn commented 1 year ago

Thanks for sharing the code.

I tried to train kl-reg autoencoder with custom datasets. Reconstruction is possible with further learning, but image generation from noise is not possible no matter how much learning is done.

Can't torch.randn be used for generation?

https://github.com/CompVis/latent-diffusion/blob/a506df5756472e2ebaf9078affdde2c4f1502cd4/ldm/models/autoencoder.py#L400-L415

clearlyzero commented 11 months ago

converges

image The reconstruction effect looks...ok?

GuHuangAI commented 11 months ago

@ryhhtn @GuHuangAI Do u try to sample image from noise with the pretrained Autoencoder checkpoints? I used the released pretrained kl-f4 checkpoints to sample but only got this. image

Hello, the latent distribution is not a normal distribution, so you can not directly sample from Gaussian noise.

GuHuangAI commented 11 months ago

converges

image The reconstruction effect looks...ok?

It looks not OK. There are some noise in the image. Maybe you should continue the training process.

clearlyzero commented 11 months ago

converges

image The reconstruction effect looks...ok?

It looks not OK. There are some noise in the image. Maybe you should continue the training process. Thank you. I also think it’s due to insufficient training.

clearlyzero commented 11 months ago

converges

image The reconstruction effect looks...ok?

It looks not OK. There are some noise in the image. Maybe you should continue the training process.

Excuse me, did you subtract the mean from the data and remove the standard deviation during training?or simply divide it by 255

GuHuangAI commented 11 months ago

converges

image The reconstruction effect looks...ok?

It looks not OK. There are some noise in the image. Maybe you should continue the training process.

Excuse me, did you subtract the mean from the data and remove the standard deviation during training?or simply divide it by 255

I follow the transform of this repo.

shanshuo commented 11 months ago

@e4s2022 @lin-tianyu @wwq111111 @GuHuangAI hi there, may i ask you how was your disc_loss like? My disc_loss always becomes 2.0 after several iterations (like after disc_start+10000). Has anyone else encountered this problem? I think the reason might be that the VAE works too well. The discriminator will therefore make the same prediction whether the input is the original image or a reconstructed one.

I have not met the problem. In my training, the disc_loss is round 1. You can also compare the logits_real with the logits_fake. If the two losses are very close, the VAE wtoks well.

In my training, the logits_real and logits_fake continues to increase after adding the discrimination loss. This means that the VAE doesn't reconstruct well. How do you avoid this during training? image

clearlyzero commented 10 months ago

converges

image The reconstruction effect looks...ok?

It looks not OK. There are some noise in the image. Maybe you should continue the training process.

Excuse me, did you subtract the mean from the data and remove the standard deviation during training?or simply divide it by 255

I follow the transform of this repo.

Thank you for your reply. After you train the encoder, will the convergence be fast when you train the diffusion model in the second stage? I have trained many times, and the loss is still around 0.4, and it seems that it cannot converge.

def p_losses(self, x_start, t,img_r, noise = None): b, c, h, w = x_start.shape noise = default(noise, lambda: torch.randn_like(x_start))

    x_noisy = self.q_sample(x_start=x_start, t=t, noise=noise)
    x_recon = self.denoise_fn(x_noisy, t,img_r)

    if self.loss_type == 'l1':
        loss = (noise - x_recon).abs().mean()
    elif self.loss_type == 'l2':
        loss = F.mse_loss(noise, x_recon)
    else:
        raise NotImplementedError()

    return loss
wj7486 commented 10 months ago

@ryhhtn你好,

我认为即使使用训练有素的自动编码器,对噪声向量进行采样以生成图像也不是一件容易的任务,因为经过充分学习的潜在空间可能不是理想的高斯空间。

假设训练数据集有自己的分布 P(x)。经过编码器后,P(x) 就变成了另一个 P{data}(x),因为 NN 可以被视为确定性映射。相反,解码器执行反向映射,即将分布P{data}(x)中的样本映射回数据分布P(x)。

我们生成图像的方式是从潜在空间(从标准高斯分布)随机采样一个点,并将其输入解码器。如果这有效,我们隐含地假设 P_{data}(x) 也是标准高斯。我认为这就是 KL 正则化应用于潜在空间的原因。

然而,从我的训练来看,我发现kl_loss项有点大,如下: 我将kl_loss项的权重设置为默认的1e-6。 图像

在此设置中,我训练了一个自动编码器,可以忠实地重建人脸图像(CelebAHQ-Mask 数据集上约 30epoch):

真实情况 图像

侦察。 图像

话虽这么说,在我看来,如果您想成功地对潜在空间中的噪声进行采样并期望从解码器获得漂亮的合成图像,则需要增加 KL-reg 权重。或者,您可以尝试基于训练有素的自动编码器来训练 LDM,因为 LDM 会将分布 P_{data}(x) 逐渐映射到高斯分布。

帮助这有帮助。如果您能发布本次讨论的任何更新,我将不胜感激。祝你好运。

@ryhhtn hi,

I think it's not an easy task to sample a noise vector to generate an image, even with a well-trained autoencoder, since the well-learned latent space may not be an ideal gaussian space.

Let's say, the training data set has its own distribution, P(x). After passing through the encoder, the P(x) just becomes another P{data}(x), since the NN can be viewed as a deterministic mapping. The decoder, on the contrary, is performing the reversed mapping, i.e., mapping a sample in distribution P{data}(x) back to the data distribution P(x).

The way we generate an image is that we randomly sample a point from latent space (from standard gaussian distribution), and feed it to the decoder. If this works, we implicitly assume the P_{data}(x) is also a standard gaussian. I think that is why KL-regularization is applied to the latent space.

However, from my training, I found the kl_loss term is a bit large, just as follows: image I set the weight to the kl_loss term as the default 1e-6.

In this setting, I trained an autoencoder that can faithfully reconstruct face images (~ 30epoch on CelebAHQ-Mask dataset):

Ground-truth image

Recon. image

That being said, in my opinion, if you want to successfully sample noise in latent space and expect to get a nice synthetic image from the decoder, you would need to increase the KL-reg weight. Alternatively, you can try to train an LDM based on the well-trained autoencoder, since the LDM will map the distribution P_{data}(x) to a gaussian distribution gradually.

Help this helps. I would appreciate it if you can post any updates from this discuss. Good luck.

I would like to ask about using the Celeba dataset for my autoencoder kl model that I trained myself .As I want to train 128*128 resolution autoencoderkl model and I am using scale_factor. Is it normal for scale factor to be approximately 0.44 when using factor? I still cannot achieve the Fid mentioned in the paper when training LDM with this autoencoderkl. Looking forward to your reply, thank you

wj7486 commented 9 months ago

@e4s2022 @lin-tianyu @wwq111111 @GuHuangAI hi there, may i ask you how was your disc_loss like? My disc_loss always becomes 2.0 after several iterations (like after disc_start+10000). Has anyone else encountered this problem? I think the reason might be that the VAE works too well. The discriminator will therefore make the same prediction whether the input is the original image or a reconstructed one.

I have not met the problem. In my training, the disc_loss is round 1. You can also compare the logits_real with the logits_fake. If the two losses are very close, the VAE wtoks well.

Can you share the loss curve of your autoencoder training? I cannot determine if my training process is correct. Perhaps I can refer to the loss curve to find my mistake. If possible to share, this is my email address 353605187@qq.com Looking forward to your reply.

NITHISHM2410 commented 9 months ago

@GuHuangAI, hi

  • I agree. An autoencoder with KL-reg is somewhat similar to the VAE. In this way, generating an image with the pure autoencoder is just feeding torch.randn() seed to the decoder.
  • Yes, I think the generation of the pure LDM has no constraint to the latent space of the AE. The reverse process of the diffusion model is responsible for the mapping from white noise to the latent space of the well-trained AE.
  • Letting the latent space of the AE not be a standard gaussian distribution would work well. You can refer to the original paper for some intuitions about why the KL-reg is imposed.

The above comments are based on my own experience and understanding, NOT the authors'.

Hey. I just have a clarification. 1.) When training kl autoencoder for latent diffusion model, if the latent space of autoencoder doesn't matter then why not train a simple pure autoencoder without kl. Because a ldm can transform a noise vector to any distribution when trained to do it.

2.) if kl Reg is important, then why use a very kl weight. This repo uses kl weight of 1e-6 which is kind of a very low value

duongxuanluan commented 9 months ago

@Zjz999hbq Oh, maybe you do not use the scale factor to scale the latent x0. The scale factor can be set to 0.2~0.4, and you also can calculate a scale factor by the first batch data: s = latent_x0.std(), then use this scale factor to scale all data.

You mean that I should do like this: posterior=Autoencoder.encode(x), latent_x0=scale_factor*posterior.sample() when training the unet. And when sampling stage doing this: xT-> .. -> x0, x0=x0 / scale_factor, sample = Autoencoder.decode(x0)?

Yes.

@Zjz999hbq Oh, maybe you do not use the scale factor to scale the latent x0. The scale factor can be set to 0.2~0.4, and you also can calculate a scale factor by the first batch data: s = latent_x0.std(), then use this scale factor to scale all data.

You mean that I should do like this: posterior=Autoencoder.encode(x), latent_x0=scale_factor*posterior.sample() when training the unet. And when sampling stage doing this: xT-> .. -> x0, x0=x0 / scale_factor, sample = Autoencoder.decode(x0)?

Yes.

@Zjz999hbq Oh, maybe you do not use the scale factor to scale the latent x0. The scale factor can be set to 0.2~0.4, and you also can calculate a scale factor by the first batch data: s = latent_x0.std(), then use this scale factor to scale all data.

You mean that I should do like this: posterior=Autoencoder.encode(x), latent_x0=scale_factor*posterior.sample() when training the unet. And when sampling stage doing this: xT-> .. -> x0, x0=x0 / scale_factor, sample = Autoencoder.decode(x0)?

Yes.

Thanks for your nice suggestions and I tried to scale the data with 0.19 (calculated by the first batch data). In this way, the sampled images are indeed better.Like this:

0

Although better, the image is still very blurry and cannot obtain clear results.How can I improve the results?

Hi, do you figure out how to improve the result of this ? I was having similar issue, the noisy generation indeed got improved by using scale_to_std, but still do not have satisfied result. The diffusion training loss also converged to nice value. Is your problem because of training autoencoder ?

wtliao commented 6 months ago

I am convinced that it is correct for the KL loss to increase.

@lin-tianyu Hi, could you explain more in details, why should the KL loss increase? In my training, it increases. KL-div is a part of the loss function. So it should decrease in theory from my perspective. Could you help me understand the increasing KL loss? Thanks a lot!

lin-tianyu commented 6 months ago

@wtliao Hi, you can contact me via my email. Thanks.

wtliao commented 6 months ago

@e4s2022 @lin-tianyu @wwq111111 @GuHuangAI hi there, may i ask you how was your disc_loss like? My disc_loss always becomes 2.0 after several iterations (like after disc_start+10000). Has anyone else encountered this problem? I think the reason might be that the VAE works too well. The discriminator will therefore make the same prediction whether the input is the original image or a reconstructed one.

I have not met the problem. In my training, the disc_loss is round 1. You can also compare the logits_real with the logits_fake. If the two losses are very close, the VAE wtoks well.

Nice advice on logits. I'd like to know if your d loss is very close to 1.000 or it's just vaguely around like 0.9-1.1?

my discloss is 0.878, train/logits_fake = -0.805 , train/logits_real = -0.631, val/logits_fake = -0.356 , val/logits_real = 0.076. I doubt whether my loss correct

how good was your VAE's reconstruction? mine was like: output

@keyu-tian Could you send me your test image? I want to compare mine to yours

CIntellifusion commented 3 months ago

Hello, Thanks for your valuable disscussion. I have encoutered exactly the same issue with you. Although it is hard to train a vae to directly sample from randn noise, but adjust the propotion between model size and image dataset size would help. However I still didn't find the proper dataset size and model size for training face generating on 1k , 10k ffhq image. The current result indicates the smaller image size would help randn noise being decoded into a ambigous face.