A small neural network to provide interoperability between the latents generated by the different Stable Diffusion models.
I wanted to see if it was possible to pass latents generated by the new SDXL model directly into SDv1.5 models without decoding and re-encoding them using a VAE first.
To install it, simply clone this repo to your custom_nodes folder using the following command:
git clone https://github.com/city96/SD-Latent-Interposer custom_nodes/SD-Latent-Interposer
Alternatively, you can download the comfy_latent_interposer.py file to your ComfyUI/custom_nodes
folder as well. You may need to install hfhub using the command pip install huggingface-hub
inside your venv.
If you need the model weights for something else, they are hosted on HF under the same Apache2 license as the rest of the repo. The current files are in the "v4.0" subfolder.
Simply place it where you would normally place a VAE decode followed by a VAE encode. Set the denoise as appropirate to hide any artifacts while keeping the composition. See image below.
Without the interposer, the two latent spaces are incompatible:
The node pulls the required files from huggingface hub by default. You can create a models
folder and place the models there if you have a flaky connection or prefer to use it completely offline. The custom node will prefer local files over HF when available. The path should be: ComfyUI/custom_nodes/SD-Latent-Interposer/models
Alternatively, just clone the entire HF repo to it:
git clone https://huggingface.co/city96/SD-Latent-Interposer custom_nodes/SD-Latent-Interposer/models
Model names:
code | name |
---|---|
v1 |
Stable Diffusion v1.x |
xl |
SDXL |
v3 |
Stable Diffusion 3 |
ca |
Stable Cascade (Stage A/B) |
Available models:
From | to v1 |
to xl |
to v3 |
to ca |
---|---|---|---|---|
v1 |
- | v4.0 | v4.0 | No |
xl |
v4.0 | - | v4.0 | No |
v3 |
v4.0 | v4.0 | - | No |
ca |
v4.0 | v4.0 | v4.0 | - |
The training code initializes most training parameters from the provided config file. The dataset should be a single .bin file saved with torch.save
for each latent version. The format should be [batch, channels, height, width] with the "batch" being as large as the dataset, ie 88000.
The training code currently initializes two copies of the model, one in the target direction and one in the opposite. The losses are defined based on this.
p_loss
is the main criterion for the primary model.b_loss
is the main criterion for the secondary one.r_loss
is the output of the primary model back through the secondary model and checked against the source latent (basically a round trip through the two models).h_loss
is the same as r_loss
but for the secondary model.All models were trained for 50000 steps with either batch size 128 (xl/v1) or 48 (cascade). The training was done locally on an RTX 3080 and a Tesla V100S.