Stability-AI / stablediffusion

High-Resolution Image Synthesis with Latent Diffusion Models
MIT License
37.84k stars 4.88k forks source link

[algorithm suggestion] "Foveated" image representation when sampling? #378

Open Lex-DRL opened 1 month ago

Lex-DRL commented 1 month ago

[!NOTE] Forgive me if what I'm suggesting is a complete nonsense in the scope of SD. I'm an expert in graphics programming, but I have almost zero knowledge in the ML field.

It feels like I can't be the first one to think about it... but just in case, here it is.

The issue

Any SD model, no matter how big it is, has a fixed "context window". The window growth has an O(N^2) complexity in every aspect: training computations, runtime computations, model size and VRAM requirement. Which holds us back in making a model aware of surroundings in a high-res image.

My understanding of why it's this way (might be wrong)

When SD generates an image, due to the number of neurons in the model, it "sees" only a "window" of the image at a time (encoded to latent space or whatever, but still only a part of the image), of the same size as the one used in training dataset. Even if VRAM allows for an image of higher resolution, each single pixel is aware only about an area around it, of this "window radius". I.e., for SDXL, it "sees" only 512px to the left/right/top/bottom from the generated area (1024px in total).

My suggestion in just two pictures

full image context window I've also posted the same idea here, with more pictures step-by-step: https://github.com/ssitu/ComfyUI_UltimateSDUpscale/issues/86 Though, it doesn't even touch latent space there ☝🏻.

The idea: nonlinear MIP-like pre-distortion (a-la foveated rendering in VR)

In graphics, we have a re-e-eally old technique of mip-mapping to improve bandwidth when working with high-res images. Why don't apply it here to encode surroundings outside the "main model resolution"? I.e., let's give the model extra context around the "core window", with details gradually averaged more and more, as we go farther away from the window. So, we don't just pass the image "as is" when encoding it to latent space. Instead, in addition to the image of original resolution, we also encode it's "latent mip-levels". And later, when the actual sampling happens, it would be able to use those lower levels for an area outside the "main window". The farther away we go from the main window, the lower "mip level" we use. Moreover, we only need those "lower mip-level pixels" (or whatever they're called in latent space) which aren't already covered by the area of higher resolution. That's why I put "mip levels" in quotes here, since for every "context window" only a tiny fraction of each mip-map is needed.

In other words, we add an extra "boundary" of context around the main window, which has the rest of the image non-linearly squashed. Yes, it increases VRAM requirement compared to the current bruteforce approach. But it increases it only with O(log(N)) complexity which lets the model have an awareness of an image with effectively infinite size.

Those "extra boundary pixels" don't need to be preserved in latent-space image between sampling stages (for any reason other than caching) and can be generated on the fly from the "normal" latent image, as an internal pre-process within KSampler (in ComfyUI terms). But the model itself, obviously, needs to be trained on data encoded this way.

The compression ratio (when going from each row of extra-border pixels to the next) can be:

P.S.

This idea seems so obvious to me that I'm surprised a resolution restriction is imposed in SD at all. Maybe I'm just missing something very basic, some truly common knowlege in ML field. But if the model was trained on images encoded this way (if it was familiar with those "special pixels on the border", even expecting them), then in terms of training cost the next SD version (3.5? 4.0?) would be effectively the same as fine-tuning SDXL on a 1088x1088 or 1152x1152 dataset. And yet the dataset itself would be able to: