CompVis / latent-diffusion

High-Resolution Image Synthesis with Latent Diffusion Models
MIT License
11.58k stars 1.51k forks source link

Details about training super resolution model #10

Open GioFic95 opened 2 years ago

GioFic95 commented 2 years ago

Hi @rromb, @ablattmann, @pesser, and thank you for making your great work publicly available.

Could you please supply the code for the class ldm.data.openimages.SuperresOpenImagesAdvancedTrain/Validation to train your model for super-resolution, as required in bsr_sr/config.yaml (see this line)? Otherwise, some more information about how to train the SR model with datasets not included in your repository would be very helpful.

Thank you very much!

roimulia2 commented 2 years ago

Hey @GioFic95, did you happen to find if they posted a pre-trained model of their own? Can't find it

GioFic95 commented 2 years ago

Hi @roimulia2, yes, the link of the pre-trained LDM for super-resolution is this one: https://ommer-lab.com/files/latent-diffusion/sr_bsr.zip. You can find it in this table in the readme, looking for the task "Super-resolution": https://github.com/CompVis/latent-diffusion#pretrained-ldms.

roimulia2 commented 2 years ago

Hi @roimulia2, yes, the link of the pre-trained LDM for super-resolution is this one: https://ommer-lab.com/files/latent-diffusion/sr_bsr.zip.

You can find it in this table in the readme, looking for the task "Super-resolution": https://github.com/CompVis/latent-diffusion#pretrained-ldms.

Sorry! I meant for the inpainting pre-trained models. Is it available as well?

roimulia2 commented 2 years ago

@GioFic95 replied above

GioFic95 commented 2 years ago

@roimulia2 in the "Inpainting" section of the readme they provide a command and the link for the pretrained models for inpainting too.

roimulia2 commented 2 years ago

@GioFic95 Does this make sense that the weight size is 3.1GB?

kaihe commented 2 years ago

@GioFic95 refer to this line. second_stage_model of SR ddpm has no encode function, therefore cond_stage_key image is still in Image Space not Latent Space. Hence this line elif self.conditioning_key == 'concat': xc = torch.cat([x] + c_concat, dim=1) will throw Sizes of tensors must match except in dimension 2

Any chance bsr_sr/config.yaml is wrong?

kaihe commented 2 years ago

@GioFic95 refer to this line. second_stage_model of SR ddpm has no encode function, therefore cond_stage_key image is still in Image Space not Latent Space. Hence this line elif self.conditioning_key == 'concat': xc = torch.cat([x] + c_concat, dim=1) will throw Sizes of tensors must match except in dimension 2

Any chance bsr_sr/config.yaml is wrong?

I finally figured it out the config is right, according to section 4.4 of the paper, and simply concatenate the low-resolution conditioning y and the inputs to the UNet, i.e. τθ is the identity. low resolution image show be exact the same size with latent space, for example a 64x64x3 kl encoder is only able to upgrade 64x64 image; and a 32x32x4 kl encoder is only able to upgrade 32x32 image.

This is very different from SR3, which will upscale whatever low resolution image to high resolution and cancat them in image space.

I did try to upscale low resolution image, encode with second stage model and cancat them in latent space like SR3. It's always result with random noise output like this: image

IceClear commented 2 years ago

@GioFic95 Hi~ Have you finally figured out where is ldm.data.openimages.SuperresOpenImagesAdvancedTrain/Validation and how to train on other datasets? I read the code pipeline and found it a little bit complicated to train on my own dataset.

GioFic95 commented 2 years ago

@IceClear Hi, I'm in your same situation, unluckily

IceClear commented 2 years ago

@GioFic95 refer to this line. second_stage_model of SR ddpm has no encode function, therefore cond_stage_key image is still in Image Space not Latent Space. Hence this line elif self.conditioning_key == 'concat': xc = torch.cat([x] + c_concat, dim=1) will throw Sizes of tensors must match except in dimension 2 Any chance bsr_sr/config.yaml is wrong?

I finally figured it out the config is right, according to section 4.4 of the paper, and simply concatenate the low-resolution conditioning y and the inputs to the UNet, i.e. τθ is the identity. low resolution image show be exact the same size with latent space, for example a 64x64x3 kl encoder is only able to upgrade 64x64 image; and a 32x32x4 kl encoder is only able to upgrade 32x32 image.

This is very different from SR3, which will upscale whatever low resolution image to high resolution and cancat them in image space.

I did try to upscale low resolution image, encode with second stage model and cancat them in latent space like SR3. It's always result with random noise output like this: image

In my view, this may not be true since I have successfully generated the 120x120 image using the pre-trained model whose size is 64x64 in the default config. I think the basic idea is that the latent code is generated based on the low-resolution input. Thus, just change the image size in the config to the desired size (must be the multiple of 8) and we can obtain SR images accordingly. But the pre-trained model can only be applied for 4x because it uses f=4, VQ. I am not sure if I am right but the generated image seems reasonable.

bird

jujaryu commented 1 year ago

I need a SuperresOpenImagesAdvancedTrain too

YunjinChen commented 1 year ago

@GioFic95 @kaihe @IceClear hi, can you share your inference script of the LDM-BSR model? I encountered a bit problem to reproduce the bsr results shown in the paper.