[Question]: Why does the effect of diffusers not match yours, and how can I accelerate IPAdapter?

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits of both this extension and the webui

What happened?

Thank you for this project.

The results of using IPAdapter with diffusers that I implemented myself are much worse compared to the results of calling ControlNet IPAdapter as you described. I'm not sure why. Are there some special processes here that I haven't noticed?

Also, I'd like to know how to accelerate IPAdapter, such as using TensorRT or Stable-Fast. Since each face ID is different, can't frameworks that require compilation (like TensorRT) be used for acceleration? I feel that the time it takes to call ControlNet IPAdapter is too long, and it seems like ControlNet cache isn't working. Inference using diffusers only takes 5 seconds, while ControlNet IPAdapter takes 16 seconds. Do you have any good suggestions for accelerating calculations with ControlNet IPAdapter?I hope to accelerate when calling through the API because my pipeline is fixed.

Steps to reproduce the problem

diffusers code:

import time

import torch
from diffusers import StableDiffusionXLPipeline, DDIMScheduler
from diffusers.utils import load_image
from insightface.app import FaceAnalysis
from insightface.utils import face_align
import cv2
import numpy as np
import torch
from diffusers import AutoPipelineForText2Image, DDIMScheduler
from transformers import CLIPVisionModelWithProjection
from diffusers.utils import load_image

# 弄出图片的 embeddings 模型
image_encoder = CLIPVisionModelWithProjection.from_pretrained(
    "/ssd/xiedong/stable-fast/CLIP-ViT-H-14-laion2B-s32B-b79K",
    torch_dtype=torch.float16,
)

noise_scheduler = DDIMScheduler(
    num_train_timesteps=1000,
    beta_start=0.00085,
    beta_end=0.012,
    beta_schedule="scaled_linear",
    clip_sample=False,
    set_alpha_to_one=False,
    steps_offset=1,
)

# 加载 pipeline
pipeline = AutoPipelineForText2Image.from_pretrained(
    "/ssd/xiedong/stable-fast/portrait_sdxl1.0_finetune-000029",
    torch_dtype=torch.float16,
    image_encoder=image_encoder,
    scheduler=noise_scheduler,
    safety_checker=None
).to("cuda")

tim1 = time.time()

pipeline.load_ip_adapter("/ssd/xiedong/stable-fast/IP-Adapter/IP-Adapter-FaceID",
                         subfolder=None,
                         weight_name="ip-adapter-faceid-plusv2_sdxl.bin",
                         image_encoder_folder=None)
pipeline.set_ip_adapter_scale(1)

pipeline.load_lora_weights("/ssd/xiedong/stable-fast/IP-Adapter/IP-Adapter-FaceID",
                           weight_name="ip-adapter-faceid-plusv2_sdxl_lora.safetensors")
pipeline.fuse_lora(lora_scale=0.5)

image = load_image("./huge.jpg")
num_images = 1
ref_images_embeds = []
ip_adapter_images = []
app = FaceAnalysis(root="/ssd/xiedong/stable-fast/insightface", name="buffalo_l",
                   providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])
app.prepare(ctx_id=0, det_size=(640, 640))
image = cv2.cvtColor(np.asarray(image), cv2.COLOR_BGR2RGB)
faces = app.get(image)
ip_adapter_images.append(face_align.norm_crop(image, landmark=faces[0].kps, image_size=224))

image = torch.from_numpy(faces[0].normed_embedding)
ref_images_embeds.append(image.unsqueeze(0))
ref_images_embeds = torch.stack(ref_images_embeds, dim=0).unsqueeze(0)
neg_ref_images_embeds = torch.zeros_like(ref_images_embeds)
id_embeds = torch.cat([neg_ref_images_embeds, ref_images_embeds]).to(dtype=torch.float16, device="cuda")

clip_embeds = \
pipeline.prepare_ip_adapter_image_embeds([ip_adapter_images], None, torch.device("cuda"), num_images, True)[0]

# clip_embeds shape
print(f"clip_embeds shape: {clip_embeds.shape}")

# id_embeds shape
print(f"id_embeds shape: {id_embeds.shape}")

pipeline.unet.encoder_hid_proj.image_projection_layers[0].clip_embeds = clip_embeds.to(dtype=torch.float16)
pipeline.unet.encoder_hid_proj.image_projection_layers[0].shortcut = True  # True if Plus v2

generator = torch.Generator(device="cpu").manual_seed(42)
images = pipeline(
    prompt="In a snowy mountain range, the young man is dressed in winter attire, facing the camera with a determined gaze. He sports a thick wool coat, knit hat, and gloves to keep warm in the frigid temperatures. His eyes, piercing and resolute, reflect the strength and resolve needed to conquer the elements and the challenging terrain.",
    ip_adapter_image_embeds=[id_embeds],
    negative_prompt="paintings, sketches, worst quality, low quality, normal quality, lowres, blurry, text, logo, monochrome, grayscale, skin spots, acnes, skin blemishes, age spot, strabismus, wrong finger, bad anatomy, bad hands, error, missing fingers, cropped, jpeg artifacts, signature, watermark, username, dark skin, fused girls, fushion, bad feet, ugly, pregnant, vore, duplicate, morbid, mutilated, transexual, hermaphrodite, long neck, mutated hands, poorly drawn face, mutation, deformed, bad proportions, malformed limbs, extra limbs, cloned face, disfigured, gross proportions, missing arms, missing legs, extra arms, extra legs, plump, open mouth, tooth, teeth, nsfw,",
    num_inference_steps=30,
    num_images_per_prompt=1,
    width=1024,
    height=1024,
    generator=generator
).images
tim2 = time.time()
print(tim2 - tim1)
images[0].save("output1.png")

What should have happened?

As mentioned above.

Commit where the problem happens

webui: controlnet:

What browsers do you use to access the UI ?

No response

Command Line Arguments

As mentioned above.

List of enabled extensions

As mentioned above.

Console logs

As mentioned above.

Additional information

No response

Mikubill / sd-webui-controlnet