CompVis / latent-diffusion

High-Resolution Image Synthesis with Latent Diffusion Models
MIT License
11.79k stars 1.53k forks source link

How generate image from noise vector with KL-reg autoencoder #187

Open ryhhtn opened 1 year ago

ryhhtn commented 1 year ago

Thanks for sharing the code.

I tried to train kl-reg autoencoder with custom datasets. Reconstruction is possible with further learning, but image generation from noise is not possible no matter how much learning is done.

Can't torch.randn be used for generation?

https://github.com/CompVis/latent-diffusion/blob/a506df5756472e2ebaf9078affdde2c4f1502cd4/ldm/models/autoencoder.py#L400-L415

ryhhtn commented 1 year ago

inputs_gs-198000_e-000032_b-011873 reconstructions_gs-198000_e-000032_b-011873 samples_gs-198000_e-000032_b-011873

e4s2022 commented 1 year ago

@ryhhtn hi,

I think it's not an easy task to sample a noise vector to generate an image, even with a well-trained autoencoder, since the well-learned latent space may not be an ideal gaussian space.

Let's say, the training data set has its own distribution, P(x). After passing through the encoder, the P(x) just becomes another P{data}(x), since the NN can be viewed as a deterministic mapping. The decoder, on the contrary, is performing the reversed mapping, i.e., mapping a sample in distribution P{data}(x) back to the data distribution P(x).

The way we generate an image is that we randomly sample a point from latent space (from standard gaussian distribution), and feed it to the decoder. If this works, we implicitly assume the P_{data}(x) is also a standard gaussian. I think that is why KL-regularization is applied to the latent space.

However, from my training, I found the kl_loss term is a bit large, just as follows:

image

I set the weight to the kl_loss term as the default 1e-6.

In this setting, I trained an autoencoder that can faithfully reconstruct face images (~ 30epoch on CelebAHQ-Mask dataset):

Ground-truth

image

Recon.

image

That being said, in my opinion, if you want to successfully sample noise in latent space and expect to get a nice synthetic image from the decoder, you would need to increase the KL-reg weight. Alternatively, you can try to train an LDM based on the well-trained autoencoder, since the LDM will map the distribution P_{data}(x) to a gaussian distribution gradually.

Help this helps. I would appreciate it if you can post any updates from this discuss. Good luck.

Luochangjiang10 commented 1 year ago

@e4s2022> hi, I'd like to know how use the code for the image reconstruction, I can‘t find in the latent diffusion code. thank you

GuHuangAI commented 1 year ago

@e4s2022 Thanks very much for your comment. I find the same problem that the trained weight of autoencoder downloaded from this repo can not generate a nice image when the input is from torch.randn(). Your experiments are interesting, can we say that if we want to generate well img with the pure autoencoder, we need to increase the KL-reg weight to make sure the latent space is a gaussian distribution? On the other hand, the generation of the pure LDM is also based on a gaussian distribution, so if we combine them together, we don't need to let the latend space of autoencoder be a gaussian distribution. And then, i have an another doubt. Why do we still train the autoencoder with a normal gaussian distribution? Can we let the latent space be a gaussian distribution its mean and std being not (0, 1), and in this case can we increase the KL-reg weight?

e4s2022 commented 1 year ago

@GuHuangAI, hi

The above comments are based on my own experience and understanding, NOT the authors'.

GuHuangAI commented 1 year ago

@e4s2022 Thanks to your reply. It is very helpul. Have you tried letting the latent space of the AE not be a standard gaussian distribution? How do you choose the new distribution?

shencuifeng commented 1 year ago

Hi Is it right that the kl-loss is increasing? Isn't that mean the latent distribution is far from a gaussian distribution?

lin-tianyu commented 1 year ago

@e4s2022 @shencuifeng While I'm training a KL-reg autoencoder, the KL loss is also increasing. However, after I change the pixel value range of images from [-1, 1] to [0, 1], the KL loss decreases properly. Any idea why?

image
GuHuangAI commented 1 year ago

@lin-tianyu Do u change the kl loss weight? The goal is not to make the auto encoder distribution be the same as the normal distribution.

lin-tianyu commented 1 year ago

@GuHuangAI No, I didn't change anything except the pixel value range. And the KL loss increases after several epochs of training. After carefully reading this issue and the latent-diffusion paper, I am convinced that it is correct for the KL loss to increase. Thanks a lot!

ustczhouyu commented 1 year ago

Hello, I would like to ask the difference between unconditional LDM and conditional LDM. After the model is trained, is unconditional sampling generate image randomly, but not based on a given image? So, if I want to generate a normal image from a flawed image (without any annotations in the inference phase), should I use conditional LDM? @ryhhtn @GuHuangAI @lin-tianyu @yoloseesee @e4s2022

GuHuangAI commented 1 year ago

@ustczhouyu Yes. The conditional generation needs an additional input, such as text prompt or image prompt.

e4s2022 commented 1 year ago

After checking section D.1 in the Appendix of the LDM paper, I found the KL-reg term is applied to avoid arbitrarily high-variance latent space. Because the variance of the latent space was found to significantly affect the results for convolutional sampling.

clearlyzero commented 1 year ago

@e4s2022 @shencuifeng While I'm training a KL-reg autoencoder, the KL loss is also increasing. However, after I change the pixel value range of images from [-1, 1] to [0, 1], the KL loss decreases properly. Any idea why? image

hi ,May I ask if the losses were particularly significant during the initial training of autoencoders

huangyehui commented 1 year ago

I am confusing how to use autoencoder in the project. In the paper and the project, the process of synthesis is provided into two steps. But I can not find the codes to load autoencoder in the project . and the pretrained model is only one too. plz give me some advice , Thanks a lot

GuHuangAI commented 1 year ago

@huangyehui HI, The auto-encoder is defined in the class LatentDiffusion, therefore, if you have trained a LatentDiffusion, just load its weight. By the way, the repo provides both the auto-encoder weights and ldm weights, please check the ReadMe.md carefully.

Wang-Wenqing commented 1 year ago

@ryhhtn @GuHuangAI @e4s2022 @shencuifeng When I trained the autoencoder, I found the total loss is really huge, about 1e4 (this depends on the size of the reconstructed image, but it's really large), after many epochs (the reconstructions are good, but the total loss still huge, although it shows a downward trend). I want to know if it's right, Thanks~~~

GuHuangAI commented 1 year ago

@wwq111111 It's OK.

Wang-Wenqing commented 1 year ago

Thanks! And I have another question, the disc_loss is zero for a long time, is that right? @GuHuangAI image

GuHuangAI commented 1 year ago

@wwq111111 There is a hyper-parameter named "disc_start" which represents the start iter of training discriminator.

GuHuangAI commented 1 year ago

@wwq111111 There is a hyper-parameter named "disc_start" which represents the start iter of training discriminator.

And the default value is 50000.

clearlyzero commented 1 year ago

@ryhhtn @GuHuangAI @e4s2022 @shencuifeng When I trained the autoencoder, I found the total loss is really huge, about 1e4 (this depends on the size of the reconstructed image, but it's really large), after many epochs (the reconstructions are good, but the total loss still huge, although it shows a downward trend). I want to know if it's right, Thanks~~~

I also encountered the same problem

clearlyzero commented 1 year ago

image like this

Wang-Wenqing commented 1 year ago

image like this

@GuHuangAI My training curve is also similar to the image above, Since the total loss is still high even after more than 100 epochs, when should we stop the training process, shall we wait for the loss to drop to a lower level or other conditions? Thanks~~~

GuHuangAI commented 1 year ago

image like this

@GuHuangAI My training curve is also similar to the image above, Since the total loss is still high even after more than 100 epochs, when should we stop the training process, shall we wait for the loss to drop to a lower level or other conditions? Thanks~~~

We do inference at intermediate iter to evaluate the performance. If u use the ema policy, the last checkpoint is satisfied.

Zjz999hbq commented 1 year ago

image like this

@GuHuangAI My training curve is also similar to the image above, Since the total loss is still high even after more than 100 epochs, when should we stop the training process, shall we wait for the loss to drop to a lower level or other conditions? Thanks~~~

We do inference at intermediate iter to evaluate the performance. If u use the ema policy, the last checkpoint is satisfied.

@GuHuangAI Hello,I have some questions about the training of Autoencoderkl.Recently, I trained this Autoencoderkl model on the lsun_church dataset.I just can't find how many epoches the Autoencoderkl is trained.I set the hyper-parameter 'disc-start' as 1,which means that the discrimunator starts to be trained earlier.The number of images in this dataset is about 126227.And I set two loss calculation processes (the two loss calculation processes mean that "optimizer_idx == 0:.....;optimizer_idx == 1:..." , in the ldm/modules/losses/contperceptual.py)like this : when the optimizer_idx==0, 1000 images are trained, then when the optimizer_idx==1, next 1000 images are trained and so on. I trained the Autoencoderkl in this lsun_church dataset and when the epoch comes to the NO.8 epoch(every epoch means the whole 126227 images are trained once), the reconstructions are like this : 12621_rec_x I wonder that why the reconstructions are low quality, is this caused by the change of 'disc_start'? I hope to get your reply,thanks very much!

GuHuangAI commented 1 year ago

image like this

@GuHuangAI My training curve is also similar to the image above, Since the total loss is still high even after more than 100 epochs, when should we stop the training process, shall we wait for the loss to drop to a lower level or other conditions? Thanks~~~

We do inference at intermediate iter to evaluate the performance. If u use the ema policy, the last checkpoint is satisfied.

@GuHuangAI Hello,I have some questions about the training of Autoencoderkl.Recently, I trained this Autoencoderkl model on the lsun_church dataset.I just can't find how many epoches the Autoencoderkl is trained.I set the hyper-parameter 'disc-start' as 1,which means that the discrimunator starts to be trained earlier.The number of images in this dataset is about 126227.And I set two loss calculation processes (the two loss calculation processes mean that "optimizer_idx == 0:.....;optimizer_idx == 1:..." , in the ldm/modules/losses/contperceptual.py)like this : when the optimizer_idx==0, 1000 images are trained, then when the optimizer_idx==1, next 1000 images are trained and so on. I trained the Autoencoderkl in this lsun_church dataset and when the epoch comes to the NO.8 epoch(every epoch means the whole 126227 images are trained once), the reconstructions are like this : 12621_rec_x I wonder that why the reconstructions are low quality, is this caused by the change of 'disc_start'? I hope to get your reply,thanks very much!

Hello, I have two suggestions as follows:

  1. do not train the discriminator too early;
  2. in general, I think we should train the generator when 'optimizer_idx = 0' and 'optimizer_idx =1' for the discriminator. By the way, I'm also conducting a new project that contains the training of auto-encoder. You can find my example codes in my repo soon. Hope it helps you!
Zjz999hbq commented 1 year ago

image like this

@GuHuangAI My training curve is also similar to the image above, Since the total loss is still high even after more than 100 epochs, when should we stop the training process, shall we wait for the loss to drop to a lower level or other conditions? Thanks~~~

We do inference at intermediate iter to evaluate the performance. If u use the ema policy, the last checkpoint is satisfied.

@GuHuangAI Hello,I have some questions about the training of Autoencoderkl.Recently, I trained this Autoencoderkl model on the lsun_church dataset.I just can't find how many epoches the Autoencoderkl is trained.I set the hyper-parameter 'disc-start' as 1,which means that the discrimunator starts to be trained earlier.The number of images in this dataset is about 126227.And I set two loss calculation processes (the two loss calculation processes mean that "optimizer_idx == 0:.....;optimizer_idx == 1:..." , in the ldm/modules/losses/contperceptual.py)like this : when the optimizer_idx==0, 1000 images are trained, then when the optimizer_idx==1, next 1000 images are trained and so on. I trained the Autoencoderkl in this lsun_church dataset and when the epoch comes to the NO.8 epoch(every epoch means the whole 126227 images are trained once), the reconstructions are like this : 12621_rec_x I wonder that why the reconstructions are low quality, is this caused by the change of 'disc_start'? I hope to get your reply,thanks very much!

Hello, I have two suggestions as follows:

  1. do not train the discriminator too early;
  2. in general, I think we should train the generator when 'optimizer_idx = 0' and 'optimizer_idx =1' for the discriminator. By the way, I'm also conducting a new project that contains the training of auto-encoder. You can find my example codes in my repo soon. Hope it helps you!

@GuHuangAI Thanks for your timely reply! I will follow your suggestions to try it again. By the way, I also have some questions about the training of latent diffusion model. After finishing the training of Autoencoderkl, I begin to train the unet of diffusion. The process is as follows: Given a batch-sized input image x, I input it to the pretrained Autoencoderkl to get the latent x0, this is to say: posterior=Autoencoder.encode(x), x0=posterior.sample(); Then for the forward diffusion, I add noise to this x0 to get the xt(given a timestep t), use the unet to predict a noise and calculate the loss. After finishing the training of unet, I use a xT=torch.randn(the shape of latent space) to sample a image, the denoising process is like: xT -> ...-> x0; then get the sampled image from this code: sampled image=Autoencoder.decode(x0). But in this way , I can't sample a reasonable result. I wonder that is this process correct? If it is wrong, what step is the problem at? Thanks very much!

GuHuangAI commented 1 year ago

image like this

@GuHuangAI My training curve is also similar to the image above, Since the total loss is still high even after more than 100 epochs, when should we stop the training process, shall we wait for the loss to drop to a lower level or other conditions? Thanks~~~

We do inference at intermediate iter to evaluate the performance. If u use the ema policy, the last checkpoint is satisfied.

@GuHuangAI Hello,I have some questions about the training of Autoencoderkl.Recently, I trained this Autoencoderkl model on the lsun_church dataset.I just can't find how many epoches the Autoencoderkl is trained.I set the hyper-parameter 'disc-start' as 1,which means that the discrimunator starts to be trained earlier.The number of images in this dataset is about 126227.And I set two loss calculation processes (the two loss calculation processes mean that "optimizer_idx == 0:.....;optimizer_idx == 1:..." , in the ldm/modules/losses/contperceptual.py)like this : when the optimizer_idx==0, 1000 images are trained, then when the optimizer_idx==1, next 1000 images are trained and so on. I trained the Autoencoderkl in this lsun_church dataset and when the epoch comes to the NO.8 epoch(every epoch means the whole 126227 images are trained once), the reconstructions are like this : 12621_rec_x I wonder that why the reconstructions are low quality, is this caused by the change of 'disc_start'? I hope to get your reply,thanks very much!

Hello, I have two suggestions as follows:

  1. do not train the discriminator too early;
  2. in general, I think we should train the generator when 'optimizer_idx = 0' and 'optimizer_idx =1' for the discriminator. By the way, I'm also conducting a new project that contains the training of auto-encoder. You can find my example codes in my repo soon. Hope it helps you!

@GuHuangAI Thanks for your timely reply! I will follow your suggestions to try it again. By the way, I also have some questions about the training of latent diffusion model. After finishing the training of Autoencoderkl, I begin to train the unet of diffusion. The process is as follows: Given a batch-sized input image x, I input it to the pretrained Autoencoderkl to get the latent x0, this is to say: posterior=Autoencoder.encode(x), x0=posterior.sample(); Then for the forward diffusion, I add noise to this x0 to get the xt(given a timestep t), use the unet to predict a noise and calculate the loss. After finishing the training of unet, I use a xT=torch.randn(the shape of latent space) to sample a image, the denoising process is like: xT -> ...-> x0; then get the sampled image from this code: sampled image=Autoencoder.decode(x0). But in this way , I can't sample a reasonable result. I wonder that is this process correct? If it is wrong, what step is the problem at? Thanks very much!

It seems correct. Could you show some generated examples?

Zjz999hbq commented 1 year ago

image like this

@GuHuangAI My training curve is also similar to the image above, Since the total loss is still high even after more than 100 epochs, when should we stop the training process, shall we wait for the loss to drop to a lower level or other conditions? Thanks~~~

We do inference at intermediate iter to evaluate the performance. If u use the ema policy, the last checkpoint is satisfied.

@GuHuangAI Hello,I have some questions about the training of Autoencoderkl.Recently, I trained this Autoencoderkl model on the lsun_church dataset.I just can't find how many epoches the Autoencoderkl is trained.I set the hyper-parameter 'disc-start' as 1,which means that the discrimunator starts to be trained earlier.The number of images in this dataset is about 126227.And I set two loss calculation processes (the two loss calculation processes mean that "optimizer_idx == 0:.....;optimizer_idx == 1:..." , in the ldm/modules/losses/contperceptual.py)like this : when the optimizer_idx==0, 1000 images are trained, then when the optimizer_idx==1, next 1000 images are trained and so on. I trained the Autoencoderkl in this lsun_church dataset and when the epoch comes to the NO.8 epoch(every epoch means the whole 126227 images are trained once), the reconstructions are like this : 12621_rec_x I wonder that why the reconstructions are low quality, is this caused by the change of 'disc_start'? I hope to get your reply,thanks very much!

Hello, I have two suggestions as follows:

  1. do not train the discriminator too early;
  2. in general, I think we should train the generator when 'optimizer_idx = 0' and 'optimizer_idx =1' for the discriminator. By the way, I'm also conducting a new project that contains the training of auto-encoder. You can find my example codes in my repo soon. Hope it helps you!

@GuHuangAI Thanks for your timely reply! I will follow your suggestions to try it again. By the way, I also have some questions about the training of latent diffusion model. After finishing the training of Autoencoderkl, I begin to train the unet of diffusion. The process is as follows: Given a batch-sized input image x, I input it to the pretrained Autoencoderkl to get the latent x0, this is to say: posterior=Autoencoder.encode(x), x0=posterior.sample(); Then for the forward diffusion, I add noise to this x0 to get the xt(given a timestep t), use the unet to predict a noise and calculate the loss. After finishing the training of unet, I use a xT=torch.randn(the shape of latent space) to sample a image, the denoising process is like: xT -> ...-> x0; then get the sampled image from this code: sampled image=Autoencoder.decode(x0). But in this way , I can't sample a reasonable result. I wonder that is this process correct? If it is wrong, what step is the problem at? Thanks very much!

It seems correct. Could you show some generated examples?

Emm, the Autoencoderkl I mentioned above is trained with my dataset, the images in this dataset is like this: 3996_x

the sampled images just like this: 0

GuHuangAI commented 1 year ago

@Zjz999hbq Oh, maybe you do not use the scale factor to scale the latent x0. The scale factor can be set to 0.2~0.4, and you also can calculate a scale factor by the first batch data: s = latent_x0.std(), then use this scale factor to scale all data.

GuHuangAI commented 1 year ago

@Zjz999hbq Oh, maybe you do not use the scale factor to scale the latent x0. The scale factor can be set to 0.2~0.4, and you also can calculate a scale factor by the first batch data: s = latent_x0.std(), then use this scale factor to scale all data.

@Zjz999hbq Notice that you must inverse-scale the latent x0 in the sampling stage by: latent_x0 = latent_x0 / s.

Zjz999hbq commented 1 year ago

@Zjz999hbq Oh, maybe you do not use the scale factor to scale the latent x0. The scale factor can be set to 0.2~0.4, and you also can calculate a scale factor by the first batch data: s = latent_x0.std(), then use this scale factor to scale all data.

You mean that I should do like this: posterior=Autoencoder.encode(x), latent_x0=scale_factor*posterior.sample() when training the unet. And when sampling stage doing this: xT-> .. -> x0, x0=x0 / scale_factor, sample = Autoencoder.decode(x0)?

GuHuangAI commented 1 year ago

@Zjz999hbq Oh, maybe you do not use the scale factor to scale the latent x0. The scale factor can be set to 0.2~0.4, and you also can calculate a scale factor by the first batch data: s = latent_x0.std(), then use this scale factor to scale all data.

You mean that I should do like this: posterior=Autoencoder.encode(x), latent_x0=scale_factor*posterior.sample() when training the unet. And when sampling stage doing this: xT-> .. -> x0, x0=x0 / scale_factor, sample = Autoencoder.decode(x0)?

Yes.

Zjz999hbq commented 1 year ago

@Zjz999hbq Oh, maybe you do not use the scale factor to scale the latent x0. The scale factor can be set to 0.2~0.4, and you also can calculate a scale factor by the first batch data: s = latent_x0.std(), then use this scale factor to scale all data.

You mean that I should do like this: posterior=Autoencoder.encode(x), latent_x0=scale_factor*posterior.sample() when training the unet. And when sampling stage doing this: xT-> .. -> x0, x0=x0 / scale_factor, sample = Autoencoder.decode(x0)?

Yes.

@Zjz999hbq Oh, maybe you do not use the scale factor to scale the latent x0. The scale factor can be set to 0.2~0.4, and you also can calculate a scale factor by the first batch data: s = latent_x0.std(), then use this scale factor to scale all data.

You mean that I should do like this: posterior=Autoencoder.encode(x), latent_x0=scale_factor*posterior.sample() when training the unet. And when sampling stage doing this: xT-> .. -> x0, x0=x0 / scale_factor, sample = Autoencoder.decode(x0)?

Yes.

@Zjz999hbq Oh, maybe you do not use the scale factor to scale the latent x0. The scale factor can be set to 0.2~0.4, and you also can calculate a scale factor by the first batch data: s = latent_x0.std(), then use this scale factor to scale all data.

You mean that I should do like this: posterior=Autoencoder.encode(x), latent_x0=scale_factor*posterior.sample() when training the unet. And when sampling stage doing this: xT-> .. -> x0, x0=x0 / scale_factor, sample = Autoencoder.decode(x0)?

Yes.

Thanks for your nice suggestions and I tried to scale the data with 0.19 (calculated by the first batch data). In this way, the sampled images are indeed better.Like this: 0

Although better, the image is still very blurry and cannot obtain clear results.How can I improve the results?

GuHuangAI commented 1 year ago

@Zjz999hbq Oh, maybe you do not use the scale factor to scale the latent x0. The scale factor can be set to 0.2~0.4, and you also can calculate a scale factor by the first batch data: s = latent_x0.std(), then use this scale factor to scale all data.

You mean that I should do like this: posterior=Autoencoder.encode(x), latent_x0=scale_factor*posterior.sample() when training the unet. And when sampling stage doing this: xT-> .. -> x0, x0=x0 / scale_factor, sample = Autoencoder.decode(x0)?

Yes.

@Zjz999hbq Oh, maybe you do not use the scale factor to scale the latent x0. The scale factor can be set to 0.2~0.4, and you also can calculate a scale factor by the first batch data: s = latent_x0.std(), then use this scale factor to scale all data.

You mean that I should do like this: posterior=Autoencoder.encode(x), latent_x0=scale_factor*posterior.sample() when training the unet. And when sampling stage doing this: xT-> .. -> x0, x0=x0 / scale_factor, sample = Autoencoder.decode(x0)?

Yes.

@Zjz999hbq Oh, maybe you do not use the scale factor to scale the latent x0. The scale factor can be set to 0.2~0.4, and you also can calculate a scale factor by the first batch data: s = latent_x0.std(), then use this scale factor to scale all data.

You mean that I should do like this: posterior=Autoencoder.encode(x), latent_x0=scale_factor*posterior.sample() when training the unet. And when sampling stage doing this: xT-> .. -> x0, x0=x0 / scale_factor, sample = Autoencoder.decode(x0)?

Yes.

Thanks for your nice suggestions and I tried to scale the data with 0.19 (calculated by the first batch data). In this way, the sampled images are indeed better.Like this: 0

Although better, the image is still very blurry and cannot obtain clear results.How can I improve the results?

Did you retrain your autoencoder? And how many sampling steps did you use to sample? Maybe there are some minor errors in your codes. If you don't mind, you can contact my email: huangai@nudt.edu.cn

Zjz999hbq commented 1 year ago

@Zjz999hbq Oh, maybe you do not use the scale factor to scale the latent x0. The scale factor can be set to 0.2~0.4, and you also can calculate a scale factor by the first batch data: s = latent_x0.std(), then use this scale factor to scale all data.

You mean that I should do like this: posterior=Autoencoder.encode(x), latent_x0=scale_factor*posterior.sample() when training the unet. And when sampling stage doing this: xT-> .. -> x0, x0=x0 / scale_factor, sample = Autoencoder.decode(x0)?

Yes.

@Zjz999hbq Oh, maybe you do not use the scale factor to scale the latent x0. The scale factor can be set to 0.2~0.4, and you also can calculate a scale factor by the first batch data: s = latent_x0.std(), then use this scale factor to scale all data.

You mean that I should do like this: posterior=Autoencoder.encode(x), latent_x0=scale_factor*posterior.sample() when training the unet. And when sampling stage doing this: xT-> .. -> x0, x0=x0 / scale_factor, sample = Autoencoder.decode(x0)?

Yes.

@Zjz999hbq Oh, maybe you do not use the scale factor to scale the latent x0. The scale factor can be set to 0.2~0.4, and you also can calculate a scale factor by the first batch data: s = latent_x0.std(), then use this scale factor to scale all data.

You mean that I should do like this: posterior=Autoencoder.encode(x), latent_x0=scale_factor*posterior.sample() when training the unet. And when sampling stage doing this: xT-> .. -> x0, x0=x0 / scale_factor, sample = Autoencoder.decode(x0)?

Yes.

Thanks for your nice suggestions and I tried to scale the data with 0.19 (calculated by the first batch data). In this way, the sampled images are indeed better.Like this: 0 Although better, the image is still very blurry and cannot obtain clear results.How can I improve the results?

Did you retrain your autoencoder? And how many sampling steps did you use to sample? Maybe there are some minor errors in your codes. If you don't mind, you can contact my email: huangai@nudt.edu.cn

OK.

ray-lee-94 commented 1 year ago

@Zjz999hbq Oh, maybe you do not use the scale factor to scale the latent x0. The scale factor can be set to 0.2~0.4, and you also can calculate a scale factor by the first batch data: s = latent_x0.std(), then use this scale factor to scale all data.

Hello, do you mean to scale the latent in the autoencoder sample or in the sable-diffusion sample? The first batch data on the training start or a well-trained weight?

unlugi commented 1 year ago

Why do we need a scale_factor? We pretrain the AE to learn a good latent space using dataset A . Then train diffusion on pretrain AE with dataset A. What is need for the scale_factor?

shanshuo commented 1 year ago

Why do we need a scale_factor? We pretrain the AE to learn a good latent space using dataset A . Then train diffusion on pretrain AE with dataset A. What is need for the scale_factor?

Check out the explanation here and the computation here.

keyu-tian commented 1 year ago

@e4s2022 @lin-tianyu @wwq111111 @GuHuangAI hi there, may i ask you how was your disc_loss like? My disc_loss always becomes 2.0 after several iterations (like after disc_start+10000). Has anyone else encountered this problem?

I think the reason might be that the VAE works too well. The discriminator will therefore make the same prediction whether the input is the original image or a reconstructed one.

GuHuangAI commented 1 year ago

@e4s2022 @lin-tianyu @wwq111111 @GuHuangAI hi there, may i ask you how was your disc_loss like? My disc_loss always becomes 2.0 after several iterations (like after disc_start+10000). Has anyone else encountered this problem?

I think the reason might be that the VAE works too well. The discriminator will therefore make the same prediction whether the input is the original image or a reconstructed one.

I have not met the problem. In my training, the disc_loss is round 1. You can also compare the logits_real with the logits_fake. If the two losses are very close, the VAE wtoks well.

GuHuangAI commented 1 year ago

@Zjz999hbq Oh, maybe you do not use the scale factor to scale the latent x0. The scale factor can be set to 0.2~0.4, and you also can calculate a scale factor by the first batch data: s = latent_x0.std(), then use this scale factor to scale all data.

Hello, do you mean to scale the latent in the autoencoder sample or in the sable-diffusion sample? The first batch data on the training start or a well-trained weight?

For the autoencoder samples.

keyu-tian commented 1 year ago

@e4s2022 @lin-tianyu @wwq111111 @GuHuangAI hi there, may i ask you how was your disc_loss like? My disc_loss always becomes 2.0 after several iterations (like after disc_start+10000). Has anyone else encountered this problem? I think the reason might be that the VAE works too well. The discriminator will therefore make the same prediction whether the input is the original image or a reconstructed one.

I have not met the problem. In my training, the disc_loss is round 1. You can also compare the logits_real with the logits_fake. If the two losses are very close, the VAE wtoks well.

Nice advice on logits. I'd like to know if your d loss is very close to 1.000 or it's just vaguely around like 0.9-1.1?

wtliao commented 1 year ago

@e4s2022 @lin-tianyu @wwq111111 @GuHuangAI hi there, may i ask you how was your disc_loss like? My disc_loss always becomes 2.0 after several iterations (like after disc_start+10000). Has anyone else encountered this problem? I think the reason might be that the VAE works too well. The discriminator will therefore make the same prediction whether the input is the original image or a reconstructed one.

I have not met the problem. In my training, the disc_loss is round 1. You can also compare the logits_real with the logits_fake. If the two losses are very close, the VAE wtoks well.

Nice advice on logits. I'd like to know if your d loss is very close to 1.000 or it's just vaguely around like 0.9-1.1?

my discloss is 0.878, train/logits_fake = -0.805 , train/logits_real = -0.631, val/logits_fake = -0.356 , val/logits_real = 0.076. I doubt whether my loss correct

GuHuangAI commented 1 year ago

@e4s2022 @lin-tianyu @wwq111111 @GuHuangAI hi there, may i ask you how was your disc_loss like? My disc_loss always becomes 2.0 after several iterations (like after disc_start+10000). Has anyone else encountered this problem? I think the reason might be that the VAE works too well. The discriminator will therefore make the same prediction whether the input is the original image or a reconstructed one.

I have not met the problem. In my training, the disc_loss is round 1. You can also compare the logits_real with the logits_fake. If the two losses are very close, the VAE wtoks well.

Nice advice on logits. I'd like to know if your d loss is very close to 1.000 or it's just vaguely around like 0.9-1.1?

my discloss is 0.878, train/logits_fake = -0.805 , train/logits_real = -0.631, val/logits_fake = -0.356 , val/logits_real = 0.076. I doubt whether my loss correct

Not bad but can be further improved.

keyu-tian commented 1 year ago

@e4s2022 @lin-tianyu @wwq111111 @GuHuangAI hi there, may i ask you how was your disc_loss like? My disc_loss always becomes 2.0 after several iterations (like after disc_start+10000). Has anyone else encountered this problem? I think the reason might be that the VAE works too well. The discriminator will therefore make the same prediction whether the input is the original image or a reconstructed one.

I have not met the problem. In my training, the disc_loss is round 1. You can also compare the logits_real with the logits_fake. If the two losses are very close, the VAE wtoks well.

Nice advice on logits. I'd like to know if your d loss is very close to 1.000 or it's just vaguely around like 0.9-1.1?

my discloss is 0.878, train/logits_fake = -0.805 , train/logits_real = -0.631, val/logits_fake = -0.356 , val/logits_real = 0.076. I doubt whether my loss correct

how good was your VAE's reconstruction? mine was like: output

liangbingzhao commented 1 year ago

@ryhhtn @GuHuangAI
Do u try to sample image from noise with the pretrained Autoencoder checkpoints? I used the released pretrained kl-f4 checkpoints to sample but only got this. image

zy-charon commented 11 months ago

图像像这样

@GuHuangAI我的训练曲线也和上图类似,由于即使经过100多个epoch后总损失仍然很高,我们什么时候应该停止训练过程,是等待损失降到较低水平还是其他条件?谢谢~~~

我们在中间 iter 处进行推理来评估性能。如果您使用 ema 策略,则满足最后一个检查点。

@GuHuangAI您好,我对 Autoencoderkl 的训练有一些疑问。最近,我在 lsun_church 数据集上训练了这个 Autoencoderkl 模型。我只是找不到 Autoencoderkl 训练了多少个时期。我将超参数“disc-start”设置为1,这意味着判别器开始更早地训练。这个数据集中的图像数量约为126227。并且我设置了两个损失计算过程(这两个损失计算过程意味着“optimizer_idx == 0:..... ;optimizer_idx == 1:..." ,在 ldm/modules/losses/contperceptual.py 中)这样:当optimizer_idx==0时,训练1000张图像,然后当optimizer_idx==1时,接下来的1000张图像受过培训等等。我在这个lsun_church数据集中训练了Autoencoderkl,当epoch到达第8个epoch时(每个epoch意味着整个126227张图像被训练一次),重建是这样的:我想知道为什么重建质量很低,是这样的吗12621_rec_x?是由'disc_start'的变化引起的吗?希望能得到您的回复,非常感谢!

您好,我有两个建议如下:

  1. 不要过早训练判别器;
  2. 一般来说,我认为我们应该在判别器的“optimizer_idx = 0”和“optimizer_idx = 1”时训练生成器。 顺便说一句,我还在进行一个新项目,其中包含自动编码器的训练。您很快就可以在我的存储库中找到我的示例代码。 希望对您有帮助!

@GuHuangAI感谢您的及时回复!我会按照您的建议再试一次。顺便问一下,我还有一些关于潜在扩散模型训练的问题。完成Autoencoderkl的训练后,我开始训练扩散的unet。过程如下:给定一个批量大小的输入图像x,我将其输入到预训练的Autoencoderkl中以获得潜在的x0,这就是说:posterior=Autoencoder.encode(x), x0=posterior.sample(); 然后对于前向扩散,我向该 x0 添加噪声以获得 xt(给定时间步 t),使用unet 预测噪声并计算损失。完成unet的训练后,我使用xT=torch.randn(潜在空间的形状)对图像进行采样,去噪过程如下:xT -> ...-> x0; 然后从这段代码中获取采样图像:sampled image=Autoencoder.decode(x0)。但这样一来,我就无法采样出合理的结果。我想知道这个过程正确吗?如果错了,问题出在哪一步?非常感谢!

看来是正确的。您能展示一些生成的示例吗?

Emm,我上面提到的 Autoencoderkl 是用我的数据集训练的,这个数据集中的图像是这样的: 3996_x

采样图像就像这样: 0

Hello, may I ask which file you use to set the value of the total training round epoch?

clearlyzero commented 11 months ago

I now have time to study this problem, and I found that the discriminator converges very quickly. image