Disable `pipe.vae.enable_tiling` leads to `RuntimeError: Calculated padded input size per channel`

System Info / 系統信息

Torch: 2.1.0 CUDA: 12.2 diffusers: 0.32.0.dev0

Information / 问题信息

[X] The official example scripts / 官方的示例脚本
[ ] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

Thanks for your contributions and efforts!

I am using a single H100 to run inference, when I turn off all the diffusers optimization:

# pipe.enable_sequential_cpu_offload()
# pipe.vae.enable_tiling()
# pipe.vae.enable_slicing()

or disable pipe.vae.enable_tiling(), there is an error:

Traceback (most recent call last):
  File "/home/tiger/code/run.py", line 16, in <module>
    video = pipe(
  File "/usr/local/lib/python3.9/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/tiger/.local/lib/python3.9/site-packages/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py", line 776, in __call__
    latents, image_latents = self.prepare_latents(
  File "/home/tiger/.local/lib/python3.9/site-packages/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py", line 381, in prepare_latents
    image_latents = [retrieve_latents(self.vae.encode(img.unsqueeze(0)), generator) for img in image]
  File "/home/tiger/.local/lib/python3.9/site-packages/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py", line 381, in <listcomp>
    image_latents = [retrieve_latents(self.vae.encode(img.unsqueeze(0)), generator) for img in image]
  File "/home/tiger/.local/lib/python3.9/site-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper
    return method(self, *args, **kwargs)
  File "/home/tiger/.local/lib/python3.9/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 1232, in encode
    h = self._encode(x)
  File "/home/tiger/.local/lib/python3.9/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 1204, in _encode
    x_intermediate, conv_cache = self.encoder(x_intermediate, conv_cache=conv_cache)
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/tiger/.local/lib/python3.9/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 807, in forward
    hidden_states, new_conv_cache[conv_cache_key] = down_block(
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/tiger/.local/lib/python3.9/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 439, in forward
    hidden_states, new_conv_cache[conv_cache_key] = resnet(
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/tiger/.local/lib/python3.9/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 304, in forward
    hidden_states, new_conv_cache["conv1"] = self.conv1(hidden_states, conv_cache=conv_cache.get("conv1"))
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/tiger/.local/lib/python3.9/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 144, in forward
    output = self.conv(inputs)
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/tiger/.local/lib/python3.9/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 62, in forward
    output_chunks.append(super().forward(input_chunk))
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/conv.py", line 610, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/conv.py", line 605, in _conv_forward
    return F.conv3d(
RuntimeError: Calculated padded input size per channel: (1 x 2402 x 2402). Kernel size: (3 x 3 x 3). Kernel size can't be greater than actual input size

If I turn off cpu_offload() or pipe.vae.enable_slicing(), the code can run successfully, about 2h35m to generate a video

# pipe.enable_sequential_cpu_offload()
pipe.vae.enable_tiling()
# pipe.vae.enable_slicing()

The full code is here:

import torch
from diffusers import CogVideoXImageToVideoPipeline
from diffusers.utils import export_to_video, load_image

prompt = "A little girl is riding a bicycle at high speed. Focused, detailed, realistic."
image = load_image(image="image.webp")  # 1024x1024
pipe = CogVideoXImageToVideoPipeline.from_pretrained(
    "THUDM/CogVideoX1.5-5B-I2V",
    torch_dtype=torch.bfloat16,
).to("cuda")

# pipe.enable_sequential_cpu_offload()
# pipe.vae.enable_tiling()
# pipe.vae.enable_slicing()

video = pipe(
    prompt=prompt,
    image=image,
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=49,
    guidance_scale=6,
    generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

export_to_video(video, "output.mp4", fps=8)

Expected behavior / 期待表现

Hopefully we can disable pipe.vae.enable_tiling() and run successfully

THUDM / CogVideo

Disable `pipe.vae.enable_tiling` leads to `RuntimeError: Calculated padded input size per channel` #561

System Info / 系統信息

Information / 问题信息

Reproduction / 复现过程

Expected behavior / 期待表现