chengzeyi / stable-fast

Best inference performance optimization framework for HuggingFace Diffusers on NVIDIA GPUs.
MIT License
1.16k stars 71 forks source link

RuntimeError: no valid convolution algorithms available in CuDNN #77

Closed HoiM closed 9 months ago

HoiM commented 9 months ago

The following problem occurred when I call compile(pipe)


Traceback (most recent call last):

File "svd_sf.py", line 49, in

frames = pipe(image, decode_chunk_size=7, num_frames=20).frames[0]

File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context

return func(*args, **kwargs)

File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/diffusers/pipelines/stable_video_diffusion/pipeline_stable_video_diffusion.py", line 499, in call

noise_pred = self.unet(

File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl

return self._call_impl(*args, **kwargs)

File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl

return forward_call(*args, **kwargs)

File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/sfast/cuda/graphs.py", line 40, in dynamic_graphed_callable

cached_callable = simple_make_graphed_callable(

File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/sfast/cuda/graphs.py", line 61, in simple_make_graphed_callable

return make_graphed_callable(func,

File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/sfast/cuda/graphs.py", line 90, in make_graphed_callable

func(*tree_copy(example_inputs, detach=True),

File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/sfast/jit/trace_helper.py", line 64, in wrapper

return traced_module(*args, **kwargs)

File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl

return self._call_impl(*args, **kwargs)

File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl

return forward_call(*args, **kwargs)

File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/sfast/jit/trace_helper.py", line 133, in forward

outputs = self.module(*self.convert_inputs(args, kwargs))

File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl

return self._call_impl(*args, **kwargs)

File "/path/to/my_dir/envs/torch2.1.0/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl

return forward_call(*args, **kwargs)

RuntimeError: The following operation failed in the TorchScript interpreter.

Traceback of TorchScript (most recent call last):

graph(%1, %2, %3, %4, %5, %6, %7, %8, %9, %10, %11, %12, %13, %14, %15):

%x = sfast::cudnn_convolution_bias_add(%1, %2, %3, %14, %15, %4, %5, %6, %7, %8, %9)
     ~~~~~ <--- HERE
return (%x)

RuntimeError: no valid convolution algorithms available in CuDNN


I have cudnn installed on my server. torch.backends.cudnn.is_available() and torch.backends.cudnn.enabled show True


Updated: I successfully ran your example in README.md. Currently I'm trying to accelerate stable video diffusion, which involves very large matmul. So possibly this is the reason?

chengzeyi commented 9 months ago

@HoiM What's your GPU model? This exception sometimes occurs because of insufficient GPU VRAM.

HoiM commented 9 months ago

@chengzeyi I am using V100 32G. I failed to convert the model with TensorRT due to insufficient GPU global memory. That's why I refer to this framework. btw the SVD model is indeed very large.

chengzeyi commented 9 months ago

@chengzeyi I am using V100 32G. I failed to convert the model with TensorRT due to insufficient GPU global memory. That's why I refer to this framework. btw the SVD model is indeed very large.

You can try tweaking the config.

First try setting enable_cuda_graph = False

Second try setting enable_cnn_optimization = False

HoiM commented 9 months ago

@chengzeyi Setting enable_cnn_optimization = False worked for me. And it did bring about acceleration. Thank you for your excellent work!

Two more questions:

  1. How can it support dynamic input shapes? It seems the model get compiled again when the input shapes change.
  2. Any other suggestions, especially on configs, for deployment on A30-24G?
chengzeyi commented 9 months ago

@HoiM If you enable cuda graph, the answer is yes, it recompiles with each new input shapes, but the speed should be very fast. If you don't enable it, there will be fewer recompilations.

So if you want to get more support, can you share your whole script with us? I'd like to add an SVD example to this project as there are many other people waiting for it. We could use the script you provide to test and benchmark more specifically.

HoiM commented 9 months ago

@chengzeyi Thank you for your help!

The link to SVD is here. Change of input shapes result from number of frames of the generated videos and the video resolution.

chengzeyi commented 9 months ago

@chengzeyi Thank you for your help!

The link to SVD is here. Change of input shapes result from number of frames of the generated videos and the video resolution.

Now SVD is officially supported