CompVis / latent-diffusion

High-Resolution Image Synthesis with Latent Diffusion Models
MIT License
11.16k stars 1.46k forks source link

autoencoder for LDM #7

Open seung-kim opened 2 years ago

seung-kim commented 2 years ago

Hi! Could you put which autoencoding models correspond to which LDMs on the table, please? Maybe I am missing this information somewhere, but it seems it's not clear which one is for which.

vvvm23 commented 2 years ago

@seung-kim I was struggling with this too. I ran the script scripts/download_first_stages.sh which downloaded all the autoencoders, With each autoencoder there is a config.yaml file that says the training data was ldm.data.openimages.FullOpenImagesTrain. So seems they were all trained on the OpenImages dataset?

@ablattmann @rromb could you please confirm this and also add the information to the README?

Eudea commented 1 year ago

@seung-kim I was struggling with this too. I ran the script scripts/download_first_stages.sh which downloaded all the autoencoders, With each autoencoder there is a config.yaml file that says the training data was ldm.data.openimages.FullOpenImagesTrain. So seems they were all trained on the OpenImages dataset?

@ablattmann @rromb could you please confirm this and also add the information to the README?

Having the same question, have you fixed it yet?

keyu-tian commented 10 months ago

the class FullOpenImagesTrain does not exist, maybe the file of ldm/data/openimages.py is missing, could you check that @rromb @ablattmann

mia01 commented 10 months ago

Hi did anyone figure this out??

keyu-tian commented 9 months ago

@mia01 @Eudea @vvvm23 @seung-kim I think im training VQVAEs well on OpenImages. Just with a random crop augmentation (resize to 384 then random crop to 256) and normalizing pixels from [0, 1] to [-1, 1]. For finetuning i use lr=4e-4, batch_size=1024. For from scratch i use lr=4e-6, batch_size=1024. I use Adam optimizer of betas=(0.5, 0.9) following https://github.com/CompVis/taming-transformers/blob/3ba01b241669f5ade541ce990f7650a3b8f65318/taming/models/vqgan.py#L128.

wtliao commented 5 months ago

@mia01 @Eudea @vvvm23 @seung-kim I think im training VQVAEs well on OpenImages. Just with a random crop augmentation (resize to 384 then random crop to 256) and normalizing pixels from [0, 1] to [-1, 1]. For finetuning i use lr=4e-4, batch_size=1024. For from scratch i use lr=4e-6, batch_size=1024. I use Adam optimizer of betas=(0.5, 0.9) following https://github.com/CompVis/taming-transformers/blob/3ba01b241669f5ade541ce990f7650a3b8f65318/taming/models/vqgan.py#L128.

Hi @keyu-tian , I am curious about the distribution of the length of image short side in OpenImages. The vae is trained using augmentation (resize to 384 then random crop to 256), which means all images are downsampled to 384?

bu135 commented 3 months ago

@mia01 @Eudea @vvvm23 @seung-kim I think im training VQVAEs well on OpenImages. Just with a random crop augmentation (resize to 384 then random crop to 256) and normalizing pixels from [0, 1] to [-1, 1]. For finetuning i use lr=4e-4, batch_size=1024. For from scratch i use lr=4e-6, batch_size=1024. I use Adam optimizer of betas=(0.5, 0.9) following https://github.com/CompVis/taming-transformers/blob/3ba01b241669f5ade541ce990f7650a3b8f65318/taming/models/vqgan.py#L128.

Hi @keyu-tian . I'm curious if you've done any experiments with VAE instead of VQGAN? I get the impression that the grid effect is hard to eliminate, should the discriminative loss weight be increased?