explainingai-code / DDPM-Pytorch

This repo implements Denoising Diffusion Probabilistic Models (DDPM) in Pytorch
71 stars 16 forks source link

what parameter changes would I need to make sure it runs on our dataset? #2

Open Rushi117108 opened 10 months ago

Rushi117108 commented 10 months ago

I am running this code on set of images but getting thisu error " CUDA out of memory. Tried to allocate 150.06 GiB (GPU 0; 15.89 GiB total capacity; 720.18 MiB already allocated; 14.31 GiB free; 736.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. " I have updated the batch size, and also resize images to 224, 224 shape but it still giving me this CUDA error.

Can you please tell me what shold I do?

Thanks

explainingai-code commented 10 months ago

Hello,

224x224 is still large for this model. Can you please try to follow the steps mentioned here and see if it works fine after that ?

Rushi117108 commented 10 months ago

Hi, Thank you for reply. It is running now. But if I have to run on 224 size then how can I do it? BTW I am taking im_size = 64

explainingai-code commented 10 months ago

With 224x224 images, using the current code version it would be difficult, but you could try the following:

  1. Reduce the number of channels and layers significantly until single gpu memory is enough (but chances are it would not give good results).
  2. Right now the code does not support multi gpu, but feel free to make changes to have it run on multiple gpus.
  3. Use vae/vqvae to get 224x224->64x64 latents then train diffusion on single gpu on these 64x64 .During sampling feed the generated 64x64 to the decoder of vae/vqvae to get 224x224 image. By end of this month I will have a repo for stable diffusion that will allow you to do this.
Rushi117108 commented 10 months ago

Thank you for your response.

Rushi117108 commented 10 months ago

Hi,

I trained model on medical dataset and after sampling results are not as expected. Am I missing something? Please throw some light.

explainingai-code commented 10 months ago

When you say results are not as expected, do you mean images generated are completely garbage or they are just not of that high quality ? Was the generation output improving throughout the training epochs ? Also Is it possible to share the model config and sample database image and generated output ?

Rushi117108 commented 10 months ago

Hi, I am attaching config setting, output and input image config output image1_0_png rf 679690475fa46b1e44696e692efcb4bc

Rushi117108 commented 10 months ago

Model is improving during training.

explainingai-code commented 10 months ago

Couple of things that I can think of. I see your images are grayscale, any specific reason to use 3 channels. Maybe try with im_channels : 1 Based on these images,I suspect that model needs to be trained more(I had used 40 for mnist itself), maybe train for 100/200 epochs.

Can you see if this helps ?

Rushi117108 commented 10 months ago

No images are not grayscale. It has 3 channels. But I will use epoch more.

xiaoxiao079 commented 10 months ago

Hi, I am attaching config setting, output and input image config output image1_0_png rf 679690475fa46b1e44696e692efcb4bc

hi there, how you did this? my dataset is also have 3 channel and also i did all the changes which is mention by @explainingai-code but i got size mismatch error. image

explainingai-code commented 10 months ago

Hi @xiaoxiao079 , It looks from the error that code is trying to load a checkpoint which is trained on a different than what you are currently using to train/infer. If this error is coming during training, there might already be a checkpoint with same name but trained using different configuration that throws error here - https://github.com/explainingai-code/DDPM-Pytorch/blob/main/tools/train_ddpm.py#L49 If this error is during sampling then the config that you might be using might be incorrect during sampling here - https://github.com/explainingai-code/DDPM-Pytorch/blob/main/tools/sample_ddpm.py#L73