comfyanonymous / ComfyUI

The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface.
https://www.comfy.org/
GNU General Public License v3.0
54.47k stars 5.76k forks source link

Please make Latent-Interposer a standard node #2793

Open ppbrown opened 8 months ago

ppbrown commented 8 months ago

https://github.com/city96/SD-Latent-Interposer Allows passing of latent output from one sampler, to be passed between SD and SDXL, without having to transition through pixel space.

This gives a WAY better quality pipeline, yet it's such a low level tool, it feels like it should be part of the basic tools already. Please add it in?

Discussion and examples at https://www.reddit.com/r/comfyui/comments/1apwear/more_direct_converter_from_sd_latent_to_sdxl/

ltdrdata commented 8 months ago

The core repository never includes pretrained models and does not provide automatic download functionality. If this were included in the core rather than as a custom node, you might actually experience lower usability because you would have to manually download the model.

ppbrown commented 8 months ago

errr... Im confused. The node I'm suggesting, has nothing to do with automatic download functionality?

blepping commented 8 months ago

@ppbrown

The node I'm suggesting, has nothing to do with automatic download functionality?

it runs a model to convert the latents, so the node can't do anything without that model available. so that would mean it has to be downloaded.

also there's a fair amount of misinformation in that reddit thread. the OP seems to be assuming that VAE decode/encode loses information but the latent interpose process doesn't which isn't true at all. in fact, it probably loses more information/introduces more artifacts (nothing against it, i've actually recommended it for people to try). the big advantage it has going for it is that it's fast and doesn't use much memory, but it probably is lower quality than just VAE decode followed by VAE encode.

ppbrown commented 8 months ago

OP seems to be assuming that VAE decode/encode loses information

I dont think so. I think the problem is that the VAE decode ADDS information. If you're using SD gen for composition, and following up with SDXL for style.. then you want the SD gen to stay simple. Adding style information via VAE decode is counterproductive.

And more importantly than theory: I've run comparisons.

The latent->VAEdecode->upscale ->VAEencode pipeline gives results that stick closer to the original SD style than the latent ->upscale_latent pipeline.

blepping commented 8 months ago

I think the problem is that the VAE decode ADDS information.

basically the same thing. or another way to talk about it is the vae encode/decode process causes changes: the original information is lost in the process.

Adding style information via VAE decode is counterproductive.

there are a bunch of user tuned SD1.5 VAEs, some can affect stuff like color temperature quite a bit. you shouldn't see major style changes doing a decode with a neutral VAE. ComfyUI has a VAE loader node, you can use whatever VAE you want, doesn't haven to be the one built into the model.

The latent->VAEdecode->upscale ->VAEencode pipeline gives results that stick closer to the original SD style than the latent ->upscale_latent pipeline.

that may be true in some specific cases, especially if you're using (or your model has) a custom VAE that affects the style in a significant way. you bypass that when you use the latent interposer to convert to SDXL so in that case you may actually see a noticeable difference, but it's really not a given.

can't really argue with anecdotes though.

ppbrown commented 8 months ago

you shouldn't see major style changes doing a decode with a neutral VAE.

Fair point. But I dont know of any "neutral VAE"s. Can you suggest one for SD, and SDXL, that does not change anything; they just convert from latent space to image space?

you shouldn't see major style changes

not "major", sure. But minor styles changes are important too. Especially when you are doing major style changes between the two models in principle (so small changes between inputs can get magnified)

And actually, if you are transitioning between, lets say an Anime back end, to a realistic front end, it can make the difference between some background decoration coming through as something completely off the wall, vs coming through as something sane. So sometimes even semi-major changes. I just ran into that doing additional testing just now.

You can observe the same effects if you just do an SDXL to SDXL upscale, without this latent translator model. Would you like me to include specific examples in this ticket or something? Preferred method of inclusion?

blepping commented 8 months ago

Can you suggest one for SD, and SDXL, that does not change anything;

like i mentioned VAE is a lossy process, so if you take an image, VAE encode it, then VAE decode it, you won't get back the orginal. some information is loss, just like converting an image to JPG then back to a non-lossy format. this also applies to the interposer: it's not exact either.

a neutral VAE would be the ones stability published initially. i'm pretty sure you basically only have one choice for SDXL, for sd15 there are EMA and MSE variants. an sd15 anime model probably includes its own VAE tuned for anime styles and may not be all that neutral.

But minor styles changes are important too.

i don't think you'll have a hard time convincing people (including myself) that more accuracy is better in this case, but there are already good reasons to suspect that the interposer approach wouldn't be more accurate (it's a personal project competing with much bigger models that had a lot more training).

more accuracy also doesn't necessarily mean better subjective results. i love spewing extra noise into my generations: assuming you run enough steps, noise turns into details. so it might be the artifacts/noise the interposer is potentially adding that you like. anyway, certainly no justification is needed to want to use it more accurate or not: "i like the results" is a perfectly fine reason, but that doesn't mean it's better overall or more accurate.

Would you like me to include specific examples in this ticket or something?

i think a more concrete example would be good, but i am just a random anonymous jerk on the internet, there isn't much of a payoff even if you convince me. also, the main reason it's not included is likely due to it requiring a model so even if you could show it's somewhat better overall than VAE decode/encode that probably would be the reason for not including it. so in a way, this part is a moot point.

why is it important to be included baseline when it's easily accessible as a custom node? a lot of the time developers of something like comfyui which is basically infrastructure want to keep it pretty lean and let extensions add/maintain the specific features.

ppbrown commented 8 months ago

for the record, I dug up the plainjane sd1.5 vae, and used it explicitly as the decoder in my comparison process. Other than a slight degredation in the fast, It generated the same transitional image byte-for-byte as the model's original vae.

So, the "just use a neutral VAE" approach is not comparable to direct latent transfer.