comfyanonymous / ComfyUI

The most powerful and modular diffusion model GUI, api and backend with a graph/nodes interface.
https://www.comfy.org/
GNU General Public License v3.0
49.76k stars 5.24k forks source link

When performing VAE encoding on high-resolution images, there is a high probability that the original image will be changed and artifacts will be generated. #3964

Open leonary opened 2 months ago

leonary commented 2 months ago

Expected Behavior

Since upscaling is an important feature, it is very important to remove these artifacts to preserve the original image content. These artifacts are difficult to remove at low denosing levels, making the result poor.

Actual Behavior

Original image https://files.catbox.moe/uxrle2.png After VAE encoding https://files.catbox.moe/4f9mbh.png Artifacts: image

Steps to Reproduce

VAE_encode.json If you use any high-resolution image to encode with the SD15 series VAE, you should be able to observe this problem. The SDXL series VAE will not have this problem. There is a post on reddit discussing this issue.https://www.reddit.com/r/StableDiffusion/comments/1agd5pz/a_recent_post_went_viral_claiming_that_the_vae_is/ So this could be a problem with the VAE itself, but I'm curious if there are any tricks in the code that can avoid this artifact? Or is there a way to fix the VAE in the SD15 series?

Debug Logs

No debug logs

Other

No response

leonary commented 2 months ago

Found a repository that encodes with SDXL series VAE and decodes with SD15 series, but it has problems with artifacts and color shift. https://github.com/city96/SD-Latent-Interposer

shawnington commented 2 months ago

VAE encoding is not lossless.

RandomGitUser321 commented 2 months ago

Since upscaling is an important feature, it is very important to remove these artifacts to preserve the original image content. These artifacts are difficult to remove at low denosing levels, making the result poor.

That is 100% to be expected. As Shawnington stated, vae encoding<>decoding is not lossless. When an image is vae encoded, it's in a very compressed state.

For example: A 1024x1024x24bit image would be exactly 3MB(before compression or lossless compression). Meanwhile, a vae encoded version of that image gets compressed to 128x128x4x16bit float precision in latent form(might be 32bit, but i think the 32bit modes are only for the calcuations, I'd have to double check). This means the latent space version of the image is 0.125 MB in size, which is a 24:1 reduction in size. When you then decode the image with a vae, you then take that image with 1/24th the information and reinflate it back to the original size.

So yeah, there will be losses and it won't get everything correct. This was one of the exciting things about the new 16 channel SD3 vae: encode<>decode quality.