CompVis / stable-diffusion

A latent text-to-image diffusion model
https://ommer-lab.com/research/latent-diffusion-models/
Other
67.43k stars 10.08k forks source link

[Discuss] Super-resolution with stable diffusion is not yet solved #299

Open fakyras opened 1 year ago

fakyras commented 1 year ago

tl;dr - Super-resolution is not yet solved problem - latent diffusion models have huge potential with slight modifications to Unet architecture and training schedule.

I have taken up an ambitious idea to upsample one or two games from the 90's with the help of stable diffusion. However, at current state, stable diffusion alone is not suitable for the task, as upsampling existing low res images (i.e. 64x64 or lower) to 512x512 has much higher human-eye similarity expectations compared to just having a sketch for the img2img task.

I propose to do the super-resolution task during model training: 1) downsample an image by the factor of X (some float value between 1 and 8) 2) upsample an image by the factor X 3) use upsampled image in the encoder and original image in the decoder output 4) add value X to the Unet latent space (1 - default for no superresolution)

This way latent diffusion process might capture that super-resolution is expected. I believe that this would benefit not only images intended for super-resolution, but in a way all other image generation as well (you might pick X>1 as a parameter to tune during generation and may get more rich content in the image).

This would also benefit the 'divide and conquer' approach people have been using to generate very high res images so far.

abatedemey commented 1 year ago

Any updates on this?

shreshthsaini commented 1 year ago

Any tangible progress in this direction?