explainingai-code / StableDiffusion-PyTorch

This repo implements a Stable Diffusion model in PyTorch with all the essential components.
77 stars 15 forks source link

Unable to run #12

Open mognc opened 3 months ago

mognc commented 3 months ago

Hello there, I am trying to run conditional text part and have followed all the instructions but at the end I am facing following error. Screenshot attached below. It says "Model checkpoint celebhq/ddpm_ckpt_text_cond_clip.pth not found"

Screenshot 2024-03-28 144732
explainingai-code commented 3 months ago

Hello @mognc , When you ran train_ddpm_cond.py , what was the configuration file you used. If it was config/celebhq_text_cond.yaml, then this would have created a checkpoint in the celebhq/ddpm_ckpt_text_cond_clip.pth location. Can you let me know which config you ran the training with, and if you have changed some parameters, then could you attach that config as well?

mognc commented 3 months ago

I didn't changed any parameter and yes I used config/celebhq_text_cond.yaml

explainingai-code commented 3 months ago

Okay, then you can you check what is the name of checkpoint file that was created in the celebhq folder

mognc commented 3 months ago

Sorry, but there is no celebhq folder here.

Screenshot 2024-03-28 161325
explainingai-code commented 3 months ago

But there should be a celebhq folder after you run the autoencoder. I am assuming you ran train_vqvae with the same config file right ?

mognc commented 3 months ago

Sorry for troubling you, I missed that block. Its downloading different models atm and I hope my error will be resolved now. Thanks for helping. I am new to this image generation thing so I make silly mistakes.

explainingai-code commented 3 months ago

No problem at all @mognc :) So basically you will first train autoencoder (train_vqvae.py) and then you can choose to train unconditional or conditional . And just make sure that you use the same config file for both the stages(autoencoder and ldm).

Will keep the issue open for now, and you can close it once you have successfully ran the text conditional training. Feel free to comment here when you run into any further problems during that.

mognc commented 3 months ago

Unfortunately the error did not get resolved, I am sharing attachments showing the commands I used also showing my celebhq folder contents.

Screenshot 2024-03-28 180307 Screenshot 2024-03-28 180248 Screenshot 2024-03-28 180203
explainingai-code commented 3 months ago

After running train_vqvae , did you run the train_ddpm_cond ? Did that fail ?

mognc commented 3 months ago

After running this command "!python -m tools.train_vqvae --config config/celebhq_text_cond.yaml" it displayed the training is completed. Then i ran "!python -m tools.sample_ddpm_text_cond --config config/celebhq_text_cond.yaml" which failed. I also have attached pics above for your reference.

explainingai-code commented 3 months ago

Yes but the train vqvae is only for Stage I. This will only train the autoencoder and not the diffusion model. Once auto encoder is trained, we need to run train_ddpm_cond for Stage II training, that is training a conditional latent diffusion model. And after that is trained, you would be able to generate data using sample_ddpm_text_cond script

mognc commented 3 months ago

Hey there, my model got in running state yesterday but the output was not that I desired. And I am bit confused what changes I have to make to improve them. I have 100 images dataset with 100 captions file. The dimensions of pics are 600*800. I have made dataset and yaml file similar to "celebhq.yaml" and celeb_dataset.py file. I also have modified the targets to my own files. I want to generate similar pics as my dataset but with variations in them. I figured out to achieve this task I will be using Training text conditional LDM method and will be creating similar file as "celebhq_text_cond.yaml" file right?. But now I am unsure what additional changes I have to make in my new three files I have created that will help me achieve my goal like can you pin out parameters I have to change like no of epochs I have to train and etc. Also I have enables save latents in those file as you mentioned in readme file to speed up the training process.

explainingai-code commented 3 months ago

How many epochs/steps did you train the autoencoder for ? and could you add some output examples of autoencoder. Same for LDM. Because that will help me understand which stage is not generating output as desired.

The first thing I would suggest is have more images, maybe 2K images to start with. Second is there a reason you want to formulate this as a text conditional problem and not class conditional problem ? This is because for text conditional problems you would be training additional cross attention layers and the training will also be slower. So if you can achieve your goal by just formulating this as a class conditional problem , I would suggest to first try that.

mognc commented 3 months ago

Well I didn't change epochs or samples. At the moment I don't have access to those outputs as Collab erase all the data after session terminates. But the ldm stage was not producing correct output it was just blur pic. Well I assumed to add variations to a dataset through prompt I will need to use text condition and not class condition. I might be wrong as I am no expert. But the final goal is to add variations according to the user prompt. I have dataset of walls and the user will enter prompt like snow on walls or shadow on the walls and the relevent pic will be generated.

explainingai-code commented 3 months ago

If you didnt change any parameters then that means the autoencoder ran for only 20 epochs and the discriminator didnt even start because the config has the start of discriminator at 15000 steps. So you should anyways train autoencoder again and change the discriminator_start parameter to the number of steps after which you start seeing decent but blurry outputs from autoencoder.

For the conditioning, if all you have are texts of type ' on walls' where obj can be one of K things, then you can use class conditioning with K classes rather than text conditioning.

mognc commented 3 months ago

Ok I will change that parameter. And I will try conditioning too, but I just want to make sure like my dataset is simple it don't include type of variations I want like snow or dust. This will not be a problem right ?

explainingai-code commented 3 months ago

Also the ldm epochs are set at 100 epochs but this was for celebhq dataset with 30000 images. I would suggest in the current setting with 100 images, you should train LDM(Stage 2) till 1000 epochs to validate quality of ldm outputs(more if you see that the quality is still continuing to improve )

explainingai-code commented 3 months ago

"but I just want to make sure like my dataset is simple it don't include type of variations I want like snow or dust" I didnt get this part. Could you clarify a bit ? Do you mean that you want the model to generate variations for which you don't have images ?

mognc commented 3 months ago

Yes, I don't have pics of variations I want.

explainingai-code commented 3 months ago

But if the model has never seen what 'snow' looks like anytime during the training, it will not be able to generate 'snow on walls' right ?

mognc commented 3 months ago

Well my friend used some pre trained models and those were producing results, so I am not sure of this model how it works. So should I add simple pics of snow, dust and other variations and merge them with walls dataset?

explainingai-code commented 3 months ago

Yes pre-trained model would work because that has seen what 'snow' looks like. But this model will be trained from scratch. So I would suggest to either use the pre-trained model and then fine tune it using libraries like diffusers. OR if you want to train using this repo from scratch then add in those images for training.

mognc commented 3 months ago

Well I will stick with this repo and will add variations pic along with captions and will merge all dataset together. Thanks for removing all confusions really appreciate your time.

mognc commented 3 months ago

I cant figure out this part of guide like what to change and where

image
explainingai-code commented 3 months ago

This part of the readme is just saying that the dataset class must return a tuple of image tensor and a dictionary of conditional inputs. And for class conditional case, we only need to have one key 'class' with value as the integer class of the item. Example - https://github.com/explainingai-code/StableDiffusion-PyTorch/blob/main/dataset/mnist_dataset.py#L75-L77

mognc commented 3 months ago

Hello again, I was unable to train my model class conditionally was unable to solve all errors after changing some files. So i tried training it text conditionally and here are some outputs. current_autoencoder_sample_193 This is final picture that was generated while training autoencoder at 500 epochs and disc_start at 100. x0_996 This is sample generated at x0_996. x0_0 And this is sample generated at x0_0. I trained model for 1000 epochs can this result be improved if I train for more epochs and final question does my training file which is ddpm_ckpt_text_cond_clip.pth overwrite every time I run "!python -m tools.train_ddpm_cond --config config/celebhq_text_cond.yaml" cell and new file is saved right ?

mognc commented 3 months ago

This is my config file which I edited: dataset_params: im_path: 'data/Cracks_data' im_channels : 3 im_size : 256 name: 'crack'

diffusion_params: num_timesteps : 1000 beta_start : 0.00085 beta_end : 0.012

ldm_params: down_channels: [ 256, 384, 512, 768 ] mid_channels: [ 768, 512 ] down_sample: [ True, True, True ] attn_down : [True, True, True] time_emb_dim: 512 norm_channels: 32 num_heads: 16 conv_out_channels : 128 num_down_layers : 2 num_mid_layers : 2 num_up_layers : 2 condition_config: condition_types: [ 'text' ] text_condition_config: text_embed_model: 'clip' train_text_embed_model: False text_embed_dim: 512 cond_drop_prob: 0.1

autoencoder_params: z_channels: 3 codebook_size : 8192 down_channels : [64, 128, 256, 256] mid_channels : [256, 256] down_sample : [True, True, True] attn_down : [False, False, False] norm_channels: 32 num_heads: 4 num_down_layers : 2 num_mid_layers : 2 num_up_layers : 2

train_params: seed : 1111 task_name: 'crack' ldm_batch_size: 16 autoencoder_batch_size: 4 disc_start: 100 disc_weight: 0.5 codebook_weight: 1 commitment_beta: 0.2 perceptual_weight: 1 kl_weight: 0.000005 ldm_epochs: 1000 autoencoder_epochs: 500 num_samples: 1 num_grid_rows: 1 ldm_lr: 0.000005 autoencoder_lr: 0.00001 autoencoder_acc_steps: 4 autoencoder_img_save_steps: 64 save_latents : True cf_guidance_scale : 1.0 vae_latent_dir_name: 'vae_latents' vqvae_latent_dir_name: 'vqvae_latents' ldm_ckpt_name: 'ddpm_ckpt_text_cond_clip.pth' vqvae_autoencoder_ckpt_name: 'vqvae_autoencoder_ckpt.pth' vae_autoencoder_ckpt_name: 'vae_autoencoder_ckpt.pth' vqvae_discriminator_ckpt_name: 'vqvae_discriminator_ckpt.pth' vae_discriminator_ckpt_name: 'vae_discriminator_ckpt.pth'

explainingai-code commented 3 months ago

I think it would benefit by training the autoencoder more. Specifically two changes:

  1. autoencoder_epochs:1000
  2. disc_start : 200 x (Number of steps in one epoch)

Basically train for longer and start discriminator after your autoencoder generates the best reconstructions it can. The disc_start is the number of steps after which discriminator should start.

explainingai-code commented 3 months ago

Yes the ddpm_ckpt_text_cond_clip.pth is overwritten every time you run the training.