CompVis / latent-diffusion

High-Resolution Image Synthesis with Latent Diffusion Models
MIT License
11.57k stars 1.51k forks source link

Why do higher resolution images have duplicate artifacts? #42

Open srelbo opened 2 years ago

srelbo commented 2 years ago

Hi @rromb @ablattmann

Thank you for sharing your work and documenting it well.

While generating higher resolution images, I am seeing duplicate artifacts. For example here is a plane example.

max_elbo_1

Is there a way to generate just 1 object instead of multiples?

austensatterlee commented 2 years ago

Hopefully someone smarter than myself comes along with a more serious answer for you, but have you tried adding to the prompt something along the lines of "a single plane", or "one plane"?

serhio-k commented 2 years ago

Hopefully someone smarter than myself comes along with a more serious answer for you, but have you tried adding to the prompt something along the lines of "a single plane", or "one plane"?

Unfortunatelly it doesn't make any changes. And I suppose the answer will be a very tricky. The results below for the 'a single plane' request:

boopage commented 2 years ago

I'm blown away how well this all works, I can't wait until we reach the point we can generate 'live video' in some years! Yes, might be far off, but I feel there's a lot of momentum and progress currently. And then the next step of course is to input into the model using Neuralink and feed it back into the brain... Ok, to the point :laughing:

I'm also really curious if resolutions >256px can be steered somehow so it matches how 256 images generate/focus. Is this an inherent result of how LDM works and needs retraining or can this be adjusted in a way?

Other examples, this 'merging of objects' happens with basically all text2image requests for >256px resolutions

A single photo-realistic red apple - steps 50, scale 10, eta 0.2
256x256 (2x2 collage)
1650018814 51659_steps50_scale10 0_eta0 2
512x512 (2x2 collage)
1650019008 662371_steps50_scale10 0_eta0 2
768x768 (apple mayhem)
1650019826 901802_steps50_scale10 0_eta0 2
srelbo commented 2 years ago

@boopage Live video is already here -- https://arxiv.org/abs/2204.03458 😄

boopage commented 2 years ago

Wow, that's crazy, thanks for sharing https://video-diffusion.github.io/

That's going to slurp up lots of memory I suppose!

Tollanador commented 2 years ago

I think there is a kind of sliding window, or similar, that is evaluated against the prompt, so a single generated image may have the prompt evaluated several times for each window-like thing

The strength setting (whatever it's called has different names depending on the repo) of the text prompt comes into play. If it's too weak, then those chunks are free to be very loosely tied to the prompt and repetition can set in this way... very good for certain art styles, such as making Fractals (or the illusion of a fractal anyway.

So try increase the strength of the clip influence and also try be a little more specific in your prompting. Perhaps using a larger CLIP model could also help.

Just fiddle and play and see what the settings do.

Tollanador commented 2 years ago

Oh, The 'sliding window' I mentioned, I figured out what I meant by this. The CLIP model used to guide the image generation, each CLIP model has a set size of the input image array, eg 224 sized image (I presume a square image, so 224x224).

When a generation canvas is given that is larger than this CLIP input image, it means CLIP will either need be fed a randomly chosen point in the generated image and crop it to 224x224 and run the Image to Text on that, or it just picks the centre of the image an crops to 224x224 and analyse that, or perhaps break the image into chunks and run the text on each chunk.

So depending on how that rolls, the generated image will be guided by that and you'll potentially get repetitions.

To reduce the repeating artifacts, choose a CLIP model with a larger input size (which will use more oh so precious memory...) or reduce the generated image canvas to be closer to the CLIP input image. The smaller the divergence from CLIP input to Canvas size, less likely you'll get repetitions.

An alternative, would be to investigate the code and come up with a method to mitigate the repeating problem whilst using a 'standard' CLIP input size (224 I think most of them use) The follow image shows where you can find the CLIP input size, it's in the model_configs of the Open_Clip repo.

image

edit: The Latent Diffusion image model has a great deal of potential imagery stored in it, at the moment, it's difficult to leverage it to it's full capability due to the limitations of our Image To Text model, the CLIP models themselves. As these progress in efficacy we will be able to generated better images based on text prompts, and do better edits of existing images based on text prompts. Also keep in mind that the length of the input text, is a limiting factor of the CLIP model, not the Latent Diffusion model.