error occurs when running stable diffusion demo on V100 16G

HeKun-NVIDIA commented 1 year ago

Description

Hi, I tried running stable diffusion demo on V100 16G. The following error occurs:

python3 demo_txt2img.py "a beautiful photograph of Mt. Fuji during cherry blossom" --hf -token=$HF_TOKEN -v

Loading TensorRT engine: engine/vae.plan [I] Loading bytes from engine/vae.plan [E] 1: [defaultAllocator.cpp::allocate::21] Error Code 1: Cuda Runtime (out of memory) [W] Requested amount of GPU memory (20401098752 bytes) could not be allocated. There may not be enough free memory for allocation to succeed. [E] 2: [executionContext.cpp::ExecutionContext::436] Error Code 2: OutOfMemory (no further information) Traceback (most recent call last): File "demo_txt2img.py", line 83, in demo.loadResources(image_height, image_width, batch_size, args.seed) File "/workspace/mydata/tensorrt-sd/TensorRT/demo/Diffusion/stable_diffusion_pipeline.py", line 151, in loadResources self.engine[model_name].allocate_buffers(shape_dict=obj.get_shape_dict(batch_size, image_height, image_width), device=self.device) File "/workspace/mydata/tensorrt-sd/TensorRT/demo/Diffusion/utilities.py", line 234, in allocate_buffers self.context.set_binding_shape(idx, shape) AttributeError: 'NoneType' object has no attribute 'set_binding_shape'

Then I added the --build-static-batch flag at the end of the command. The program worked fine, but I got an all-black image.

python3 demo_txt2img.py "a beautiful photograph of Mt. Fuji during cherry blossom" --hf-token=$HF_TOKEN -v --build-static-batch

Does SD demo not work on V100 now? Could you give me some suggestion?

Thank you!!

Environment

TensorRT Version: 8.6.0 NVIDIA GPU: V100 16G NVIDIA Driver Version: 530.30 CUDA Version: 12.1 CUDNN Version: 8 Operating System: ubuntu 18.04 Python Version (if applicable): Tensorflow Version (if applicable): PyTorch Version (if applicable): Baremetal or Container (if so, version): Container

Relevant Files

[I] Saving engine to engine/vae.plan Loading TensorRT engine: engine/clip.plan [I] Loading bytes from engine/clip.plan Loading TensorRT engine: engine/unet.plan [I] Loading bytes from engine/unet.plan [E] 1: [defaultAllocator.cpp::allocate::21] Error Code 1: Cuda Runtime (out of memory) [W] Requested amount of GPU memory (13020918272 bytes) could not be allocated. There may not be enough free memory for allocation to succeed. [E] 2: [executionContext.cpp::ExecutionContext::436] Error Code 2: OutOfMemory (no further information) Loading TensorRT engine: engine/vae.plan [I] Loading bytes from engine/vae.plan [E] 1: [defaultAllocator.cpp::allocate::21] Error Code 1: Cuda Runtime (out of memory) [W] Requested amount of GPU memory (20401098752 bytes) could not be allocated. There may not be enough free memory for allocation to succeed. [E] 2: [executionContext.cpp::ExecutionContext::436] Error Code 2: OutOfMemory (no further information) Traceback (most recent call last): File "demo_txt2img.py", line 83, in demo.loadResources(image_height, image_width, batch_size, args.seed) File "/workspace/mydata/tensorrt-sd/TensorRT/demo/Diffusion/stable_diffusion_pipeline.py", line 151, in loadResources self.engine[model_name].allocate_buffers(shape_dict=obj.get_shape_dict(batch_size, image_height, image_width), device=self.device) File "/workspace/mydata/tensorrt-sd/TensorRT/demo/Diffusion/utilities.py", line 234, in allocate_buffers self.context.set_binding_shape(idx, shape) AttributeError: 'NoneType' object has no attribute 'set_binding_shape'

Vozf commented 1 year ago

it seems 8.6 branch is barely usable. You can try to use release/8.5 branch. I've got it working with t4 gpu. may be related https://github.com/NVIDIA/TensorRT/issues/2784

zerollzeng commented 1 year ago

V100 is Volta GPU, and I think we support Turing+.

dongjinxin123 commented 1 year ago

I use T4 GPU and I have the same problem I] Initializing StableDiffusion txt2img demo using TensorRT 512 512 get_path(version, inpaint=inpaint) runwayml/stable-diffusion-v1-5 Loading TensorRT engine: engine/clip.plan jxdong engine load [I] Loading bytes from engine/clip.plan Loading TensorRT engine: engine/unet.plan jxdong engine load [I] Loading bytes from engine/unet.plan Loading TensorRT engine: engine/vae.plan jxdong engine load [I] Loading bytes from engine/vae.plan [E] 1: [defaultAllocator.cpp::allocate::21] Error Code 1: Cuda Runtime (out of memory) [W] Requested amount of GPU memory (20401098752 bytes) could not be allocated. There may not be enough free memory for allocation to succeed. [E] 2: [executionContext.cpp::ExecutionContext::436] Error Code 2: OutOfMemory (no further information) Allocate buffers model name: clip Allocate buffers model name: unet Allocate buffers model name: vae Traceback (most recent call last): File "demo_txt2img.py", line 85, in demo.loadResources(image_height, image_width, batch_size, args.seed) File "/workspace/stable_diffusion_pipeline.py", line 152, in loadResources self.engine[model_name].allocate_buffers(shape_dict=obj.get_shape_dict(batch_size, image_height, image_width), device=self.device) File "/workspace/utilities.py", line 236, in allocate_buffers self.context.set_binding_shape(idx, shape) AttributeError: 'NoneType' object has no attribute 'set_binding_shape'

dongjinxin123 commented 1 year ago

I just use docker image nvcr.io/nvidia/pytorch:23.02-py3 TensorRT Version: 8.6.0 NVIDIA GPU: T4 16G NVIDIA Driver Version: 450 CUDA Version: 12.1 CUDNN Version: 8 Operating System: ubuntu 18.04 Python Version (if applicable): Tensorflow Version (if applicable): PyTorch Version (if applicable): Baremetal or Container (if so, version): nvcr.io/nvidia/pytorch:23.02-py3

The size of vae.plan is small, only 300mb, but when we load the vae.plan, it need 20GB of gpu memory.

zhangvia commented 1 year ago

V100 is Volta GPU, and I think we support Turing+.

i met this error also,but i run the demo_txt2img.py successfully.however,the demo_img2img.py went wrong.i don't know why.and i run these two script on a100

zerollzeng commented 1 year ago

cc @nvpohanh

nvpohanh commented 1 year ago

Please document the repro steps so that we can repro and investigate the issue. Thanks

BoyuanJiang commented 1 year ago

same error an T4

tianyu-sz commented 1 year ago

same error on T4 16G , followed by the README steps. From the error message , it seems the engine does not initial correctly due to OOM.

tianyu-sz commented 1 year ago

same error on T4 16G , followed by the README steps. From the error message , it seems the engine does not initial correctly due to OOM.

An update: I got it working by hardcode the max_batch_size = 4 in demo_txt2img.py at line55. The default is 16 if you use the default command, but no command argument is provided to set it for now.

So my suggestion is try to set max_batch_size to 4 if you use T4 16G | V100 16G , it takes 8G GPU mem during infer, not 20G that causing OOM.

NVIDIA / TensorRT