TensorRT for SDXL demo not work well than use diffusers only ，consumes a lot of memory

Description

almost same params

Even use int8 , it can't save more memory and slower than use deepcache. Is this supposed to be ? How to save more memory?

TensorRT supports dynamic-shape but why max_batch_size is 4 ?

python3 demo_txt2img_xl.py "An astronaut riding a green horse" \
  --version=xl-1.0 \
  --framework-model-dir /xxx/stable-diffusion-xl-base-1.0 \
  --build-dynamic-shape \
  --timing-cache /xxx/stable-diffusion-xl-base-1.0/timing-cache \
  --engine-dir /xxx/trt_engine \
  --onnx-dir /xxx/onnx \
  --num-warmup-runs 1 \
  --int8 \    # optional
  -v \
  --onnx-opset 17 \
  --height 1024 \
  --width 1024 \
  --batch-size 4 \
  --denoising-steps 50

Use diffusers only

def deep_cache(pipe):
    # https://arxiv.org/abs/2312.00858
    from DeepCache import DeepCacheSDHelper
    helper = DeepCacheSDHelper(pipe=pipe)
    helper.set_params(
        cache_interval=3,  
        cache_branch_id=0,  
    )
    helper.enable()
    normal_optimization(pipe)

def normal_optimization(pipe):
    pipe.enable_xformers_memory_efficient_attention()
    # pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
    pipe.enable_vae_slicing()
    pipe.enable_vae_tiling()
    pipe.enable_model_cpu_offload()

def load_from_single(local_dir):
    pipe = StableDiffusionXLPipeline.from_single_file(
        f'{local_dir}/sd_xl_base_1.0.safetensors',
        config=download_config(local_dir),
        local_files_only=True,
        torch_dtype=torch.float16,
    ).to("cuda")

    prompt = ["An astronaut riding a green horse"] * 5

    # images = tgate_with_dc(pipe, prompt)

    deep_cache(pipe)
    images = pipe(prompt=prompt).images

    save_image(images)

load_from_single(local_dir)

Environment

TensorRT Version: 10.1

NVIDIA GPU: A100 40G

NVIDIA Driver Version: 555.42.02

CUDA Version: 12.5

CUDNN Version:

Operating System:

Python Version (if applicable): 3.11

Tensorflow Version (if applicable):

PyTorch Version (if applicable): 2.3

Baremetal or Container (if so, version):

NVIDIA / TensorRT

TensorRT for SDXL demo not work well than use diffusers only ，consumes a lot of memory #3981

Description

Environment