or disable pipe.vae.enable_tiling(), there is an error:
Traceback (most recent call last):
File "/home/tiger/code/run.py", line 16, in <module>
video = pipe(
File "/usr/local/lib/python3.9/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/tiger/.local/lib/python3.9/site-packages/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py", line 776, in __call__
latents, image_latents = self.prepare_latents(
File "/home/tiger/.local/lib/python3.9/site-packages/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py", line 381, in prepare_latents
image_latents = [retrieve_latents(self.vae.encode(img.unsqueeze(0)), generator) for img in image]
File "/home/tiger/.local/lib/python3.9/site-packages/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py", line 381, in <listcomp>
image_latents = [retrieve_latents(self.vae.encode(img.unsqueeze(0)), generator) for img in image]
File "/home/tiger/.local/lib/python3.9/site-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper
return method(self, *args, **kwargs)
File "/home/tiger/.local/lib/python3.9/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 1232, in encode
h = self._encode(x)
File "/home/tiger/.local/lib/python3.9/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 1204, in _encode
x_intermediate, conv_cache = self.encoder(x_intermediate, conv_cache=conv_cache)
File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tiger/.local/lib/python3.9/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 807, in forward
hidden_states, new_conv_cache[conv_cache_key] = down_block(
File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tiger/.local/lib/python3.9/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 439, in forward
hidden_states, new_conv_cache[conv_cache_key] = resnet(
File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tiger/.local/lib/python3.9/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 304, in forward
hidden_states, new_conv_cache["conv1"] = self.conv1(hidden_states, conv_cache=conv_cache.get("conv1"))
File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tiger/.local/lib/python3.9/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 144, in forward
output = self.conv(inputs)
File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tiger/.local/lib/python3.9/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 62, in forward
output_chunks.append(super().forward(input_chunk))
File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/conv.py", line 610, in forward
return self._conv_forward(input, self.weight, self.bias)
File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/conv.py", line 605, in _conv_forward
return F.conv3d(
RuntimeError: Calculated padded input size per channel: (1 x 2402 x 2402). Kernel size: (3 x 3 x 3). Kernel size can't be greater than actual input size
If I turn off cpu_offload() or pipe.vae.enable_slicing(), the code can run successfully, about 2h35m to generate a video
System Info / 系統信息
Torch: 2.1.0 CUDA: 12.2 diffusers: 0.32.0.dev0
Information / 问题信息
Reproduction / 复现过程
Thanks for your contributions and efforts!
I am using a single H100 to run inference, when I turn off all the diffusers optimization:
or disable
pipe.vae.enable_tiling()
, there is an error:If I turn off
cpu_offload()
orpipe.vae.enable_slicing()
, the code can run successfully, about 2h35m to generate a videoThe full code is here:
Expected behavior / 期待表现
Hopefully we can disable
pipe.vae.enable_tiling()
and run successfully