Open vavasthi opened 2 months ago
Hello @vavasthi,
For generating larger resolution images, you would need significant compute at your disposal. But assuming you have that, the actual change is only in one key which is the image size dataset_params: im_size : 256->1024
However as of now the autoencoder has a downscale factor of 8, which means you would be training ldm for 128x128 images. If thats fine for you in terms of compute cost then great but if not then you would want to increase that factor to 16 so that your ldm training happens on 64x64. For this you would need below changes:
autoencoder_params: down_channels : [64, 128, 256, 256] changed to -> [64, 128, 256, 256, 256] down_sample : [True, True, True] changed to -> [True, True, True, True] attn_down : [False, False, False] changed to ->[False, False, False, False]
Thanks @explainingai-code. With the changes you suggested I was able to succesfully train model for 768 by 768 pixel images. It still doesn't work for 1024px images. I currently have a RTX 4090 with 24 GB of RAM so I am limited by that size. Just another question, given the fact that my images are all greyscale, is there any other change that I could do to config that would help me reach 1024px images.
I have already set im_channels to 1.
Is there any other setting that could reduce memory requirements of the model?
If you haven't changed batch size then can you try reducing the batch sizes for both the auto encoder and ldm using the following:
train_params:
ldm_batch_size: 16 changed to -> 4
autoencoder_batch_size: 4 changed to -> 1
autoencoder_acc_steps: 4 changed to -> 16
I would assume the batch size reduction would only be needed for autoencoder stage so maybe just change that and see. And if autoencoder trains successfully but ldm fails then reduce the ldm batch size as well.
What is the goal of using autoencoder_acc_steps different than 1? If it's higher than 1 it will create X gradients for all weights and will consume a lot of memory right?
Hello @jpmcarvalho , autoencoder_acc_step is just for gradient accumulation, mimicking training with larger batch size even if your GPU memory is not enough to accommodate the larger batch size. For autoencoder, we would have larger image sizes, so hence added this support in the config.
Hello, @explainingai-code, how can I train VAE on smaller images like 64 and 128? I tried to change just im_size, but VAE generates very noisy images after training. Also perceptual loss is becoming negative. Maybe I can avoid using VAE altogether.
Hello @Nikita-Sherstnev , VAE should work on smaller sizes also. Here's the config which I used for mnist dataset (https://github.com/explainingai-code/StableDiffusion-PyTorch/blob/main/config/mnist.yaml) and the only changes in config were the channels, im_size and reducing the downscaling factor to only two . And yes if your images are just 64x64 then you can instead just use diffusion on images itself using the other repo (https://github.com/explainingai-code/DDPM-Pytorch) rather than diffusion on latents.
However since perceptual loss is becoming negative,I think there maybe some other issue as that should not be the case, because lpips loss is just scaled mean square differences between feature maps of two images, and the scaling factors are all positive, so it should never be negative at all. Is it possible that you missed loading the lpips model weight(https://github.com/explainingai-code/StableDiffusion-PyTorch/tree/main#setup) causing the scaling factors to be negative and hence the negative perceptual loss and bad VAE output ? Could you please check if the lpips weight is getting loaded correctly here
@explainingai-code Thank you for your answer! Sure thing I've downloaded wrong lpips weights, model that was downloaded from the code did not match model from the readme for some reason. Anyway, model does not seem to train very well. My dataset is very small - 64 images, maybe this is the issue. I trained for about 120 epochs with batch size 8 and discriminator turned on in the last 40 epochs. Looks like discriminator does not give any quality improvements. I would like to train DDPM model itself, but I want it to be text-conditioned as well :)
I am working on a use case where I want to generate larger resolution images something like 1024x1024. How do I modify the configuration to do that?