Closed williamberman closed 11 months ago
yeah, these results are unsurprising. I've only made one image with the interposer myself, so can't say what to expect in general.
I'd say your interposed image fared better than mine did (mine lost a lot of dynamic range). you can see the desaturation that occurs in the interposer README:
I myself have experienced similar problems making tiny latent->RGB converters. for some reason it's hard to teach it to preserve saturation well (I struggled to make my small FFNs learn to reproduce deep reds). maybe need a different loss, or maybe I just needed something with a wider receptive area than a Linear layer.
https://twitter.com/Birchlabs/status/1640824768415842304
and yes, I think we're seeing that the diffusion decoder reduces the dynamic range of the interposed latents further still. happened on yours, just like it did with mine.
so: I don't assume any mistakes were made. these are the results I'd expect.
@Birch-san
Just to chime in, the interposer was never meant to be 100% accurate, it was mostly meant to be used at high-ish (0.4+) denoise as a stop-gap solution so one model could "composite" the image for another model (in this case, using the better prompt comprehension from SDXL).
you can see the desaturation that occurs in the interposer README:
That's just the TAESD preview VS the NovelAI VAE (the later of which is known to produce dull colors like that). That aside, the color accuracy for the interposer in general is terrible since there's no visual loss during training (don't have the hardware for it nor the experience to pull it off).
The dataset isn't very diverse either, since I can't VAE encode/decode the samples on-the-fly. Think it was Flickr2K + DIV2K with each image being cropped into 5 then flipped for like 44K images total.
I remember testing a few different architectures and they preformed basically the same (which makes me thing the problem is with the training code). I'm happy to hear any suggestions on what to do since I don't have much experience with ML stuff.
@williamberman
As for adding it to diffusers, if there's interest, I can clean up the code to be much more general. I have a version which supports different scaling/channels/etc, allowing it to work with other latent spaces such as the wurstchen one. The quality also isn't great but I think it's a better base than the current code. I uploaded a snapshot on a separate branch for anyone curious
Wow super helpful discussion! NW I think we might hold off on adding for now then but will keep an eye on it fersure. Just trying to be a little additional careful about what gets added to the core library these days.
If you want to add it though under the community examples folder, happy to merge. And regardless will keep an eye on progress :)
Hello! We saw your tweet https://twitter.com/Birchlabs/status/1721709378691010884 and are looking at if it makes sense to add the interpose model to diffusers. I'm getting inconclusive quality results where it looks to me like the results are either inconclusive or the vanilla sdxl decoder might give the best decoding quality. I was wondering if you have any good examples and I just got unlucky or if maybe I'm using the interposer wrong. Any thoughts appreciated!
Vanilla SDXL -> Interposed + Vanilla SD VAE -> Interposed + Consistency SD VAE
With 20 denoising steps:
With 50 denoising steps:
And here's the script I used