NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.27k stars 2.09k forks source link

8.6.0 diffusion demo txt2img not working #2784

Open Vozf opened 1 year ago

Vozf commented 1 year ago

Description

I try to run by instructions and on step python3 demo_txt2img.py "a beautiful photograph of Mt. Fuji during cherry blossom" --hf-token=$HF_TOKEN -v I encounter the error

[E] ModelImporter.cpp:726: While parsing node number 7 [LayerNormalization -> "/text_model/encoder/layers.0/layer_norm1/LayerNormalization_output_0"]:                    
[E] ModelImporter.cpp:727: --- Begin node ---                                                                                                                             
[E] ModelImporter.cpp:728: input: "/text_model/embeddings/Add_output_0"                                                                                                   
    input: "text_model.encoder.layers.0.layer_norm1.weight"                                                                                                               
    input: "text_model.encoder.layers.0.layer_norm1.bias"                                                                                                                 
    output: "/text_model/encoder/layers.0/layer_norm1/LayerNormalization_output_0"                                                                                        
    name: "/text_model/encoder/layers.0/layer_norm1/LayerNormalization"                                                                                                   
    op_type: "LayerNormalization"                                                                                                                                         
    attribute {                                                                                                                                                           
      name: "axis"                                                                                                                                                        
      i: -1                                                                                                                                                               
      type: INT                                                                                                                                                           
    }                                                                                                                                                                     
    attribute {                                                                                                                                                           
      name: "epsilon"                                                                                                                                                     
      f: 1e-05                                                                                                                                                            
      type: FLOAT                                                                                                                                                         
    }                                                                                                                                                                     
[E] ModelImporter.cpp:729: --- End node ---                                                                                                                               
[E] ModelImporter.cpp:732: ERROR: builtin_op_importers.cpp:5428 In function importFallbackPluginImporter:                                                                 
    [8] Assertion failed: creator && "Plugin not found, are the plugin name, version, and namespace correct?"                                                             
[E] In node 7 (importFallbackPluginImporter): UNSUPPORTED_NODE: Assertion failed: creator && "Plugin not found, are the plugin name, version, and namespace correct?"     
[!] Could not parse ONNX correctly
[0] 0:[tmux]*                                                                                                                      "root@5100eb7a428b: /w" 16:17 17-Mar-23

Environment

TensorRT Version: NVIDIA GPU: NVIDIA Driver Version: CUDA Version: CUDNN Version: Operating System: Python Version (if applicable): Tensorflow Version (if applicable): PyTorch Version (if applicable): Baremetal or Container (if so, version):

Relevant Files

Steps To Reproduce

rajeevsrao commented 1 year ago

@Vozf did you also upgrade to TensorRT 8.6.0? python3 -c 'import tensorrt as trt;print(trt.__version__)' should give you 8.6.0.

Vozf commented 1 year ago

Yeah you were right, I've had 8.5.3. It seems this "Optional" step in the instruction isn't so optional Unfortunately, I've upgraded and now get the following error

[I] Saving engine to engine/clip.plan                                                                                                                             [4/1859]
Building TensorRT engine for onnx/unet.opt.onnx: engine/unet.plan                                                                                                         
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:604] Reading dangerously large protocol message.  If the message turns out to be larger than 2147483647 bytes, par
sing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_st
ream.h.                                                                                                                                                                   
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 1733934759                                                                 
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:604] Reading dangerously large protocol message.  If the message turns out to be larger than 2147483647 bytes, par
sing will be halted for security reasons.  To increase the limit (or to disable these warnings), see CodedInputStream::SetTotalBytesLimit() in google/protobuf/io/coded_st
ream.h.                                                                                                                                                                   
[libprotobuf WARNING google/protobuf/io/coded_stream.cc:81] The total number of bytes read was 1733934759                                                                 
[W] onnx2trt_utils.cpp:400: One or more weights outside the range of INT32 was clamped                                                                                    
[I]     Configuring with profiles: [Profile().add('sample', min=(2, 4, 64, 64), opt=(2, 4, 64, 64), max=(32, 4, 64, 64)).add('encoder_hidden_states', min=(2, 77, 1024), o
pt=(2, 77, 1024), max=(32, 77, 1024)).add('timestep', min=[1], opt=[1], max=[1])]                                                                                         
[I] Building engine with configuration:                                                                                                                                   
    Flags                  | [FP16]                                                                                                                                       
    Engine Capability      | EngineCapability.DEFAULT                                                                                                                         Memory Pools           | [WORKSPACE: 6934.31 MiB, TACTIC_DRAM: 11170.44 MiB]                                                                                          
    Tactic Sources         | []                                                                                                                                               Profiling Verbosity    | ProfilingVerbosity.DETAILED                                                                                                                  
    Preview Features       | [DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805]                                                                                              
[E] 10: Could not find any implementation for node {ForeignNode[/down_blocks.0/attentions.0/norm/Constant_1_output_0 + (Unnamed Layer* 1216) [Shuffle].../down_blocks.0/re
snets.1/conv1/Cast]}.                                                                                                                                                     
[E] 10: [optimizer.cpp::computeCosts::3873] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[/down_blocks.0/attentions.0/norm/Consta
nt_1_output_0 + (Unnamed Layer* 1216) [Shuffle].../down_blocks.0/resnets.1/conv1/Cast]}.)                                                                                 
[!] Invalid Engine. Please ensure the engine was built correctly                                                                                                          
Traceback (most recent call last):                                                                                                                                        
  File "demo_txt2img.py", line 76, in <module>                                                                                                                            
    demo.loadEngines(args.engine_dir, args.onnx_dir, args.onnx_opset,                                                                                                     
  File "/workspace/projects/pixomatic/TensorRT/demo/Diffusion/stable_diffusion_pipeline.py", line 290, in loadEngines                                                     
    engine.build(onnx_opt_path,                                                                                                                                           
  File "/workspace/projects/pixomatic/TensorRT/demo/Diffusion/utilities.py", line 206, in build                                                                           
    engine = engine_from_network(                                                                                                                                         
  File "<string>", line 3, in engine_from_network                                                                                                                         
  File "/usr/local/lib/python3.8/dist-packages/polygraphy/backend/base/loader.py", line 42, in __call__                                                                   
    return self.call_impl(*args, **kwargs)                                                                                                                                
  File "/usr/local/lib/python3.8/dist-packages/polygraphy/backend/trt/loader.py", line 530, in call_impl                                                                  
    return engine_from_bytes(super().call_impl)                                                                                                                           
  File "<string>", line 3, in engine_from_bytes                                                                                                                           
  File "/usr/local/lib/python3.8/dist-packages/polygraphy/backend/base/loader.py", line 42, in __call__                                                                   
    return self.call_impl(*args, **kwargs)                                                                                                                                
  File "/usr/local/lib/python3.8/dist-packages/polygraphy/backend/trt/loader.py", line 554, in call_impl                                                                  
    buffer, owns_buffer = util.invoke_if_callable(self._serialized_engine)                                                                                                
  File "/usr/local/lib/python3.8/dist-packages/polygraphy/util/util.py", line 661, in invoke_if_callable                                                                  
    ret = func(*args, **kwargs)                                                                                                                                           
  File "/usr/local/lib/python3.8/dist-packages/polygraphy/backend/trt/loader.py", line 488, in call_impl                                                                  
    G_LOGGER.critical("Invalid Engine. Please ensure the engine was built correctly")                                                                                     
  File "/usr/local/lib/python3.8/dist-packages/polygraphy/logger/logger.py", line 597, in critical                                                                        
    raise PolygraphyException(message) from None                                                                                                                          
polygraphy.exception.exception.PolygraphyException: Invalid Engine. Please ensure the engine was built correctly                                                          
aredden commented 1 year ago

I had a similar error-

[I]     Configuring with profiles: [Profile().add('sample', min=(2, 4, 32, 32), opt=(2, 4, 80, 64), max=(2, 4, 192, 192)).add('encoder_hidden_states', min=(2, 77, 768), opt=(2, 77, 768), max=(2, 77, 768)).add('timestep', min=[1], opt=[1], max=[1])]
[I] Building engine with configuration:
    Flags                  | [FP16, REFIT]
    Engine Capability      | EngineCapability.DEFAULT
    Memory Pools           | [WORKSPACE: 19704.19 MiB, TACTIC_DRAM: 24217.31 MiB]
    Tactic Sources         | [CUBLAS, CUBLAS_LT, CUDNN, EDGE_MASK_CONVOLUTIONS, JIT_CONVOLUTIONS]
    Profiling Verbosity    | ProfilingVerbosity.DETAILED
    Preview Features       | [FASTER_DYNAMIC_SHAPES_0805, DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805]
[E] 10: Could not find any implementation for node {ForeignNode[onnx::LayerNormalization_9032 + (Unnamed Layer* 1211) [Shuffle].../down_blocks.0/attentions.0/Reshape_1 + /down_blocks.0/attentions.0/Transpose_1]}.
[E] 10: [optimizer.cpp::computeCosts::3873] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[onnx::LayerNormalization_9032 + (Unnamed Layer* 1211) [Shuffle].../down_blocks.0/attentions.0/Reshape_1 + /down_blocks.0/attentions.0/Transpose_1]}.)
[!] Invalid Engine. Please ensure the engine was built correctly

I also noticed that it seems like the flash attention plugins were removed? also- with this version, since it seems like no extra plugins are being added, instead of the unet getting down to UNet: final .. 1082 nodes, 2037 tensors, 3 inputs, 1 outputs it gets to: 4016 nodes, 6732 tensors, 3 inputs, 1 outputs. Is this a result of trying to get it functional for the other versions of stable diffusion? Sacrificing performance for flexibility?

rajeevsrao commented 1 year ago

@aredden @Vozf please share the python commands you used.

@aredden It looks like in your case REFIT is enabled?

Is this a result of trying to get it functional for the other versions of stable diffusion? Sacrificing performance for flexibility?

The increase in nodes is expected if we don't use plugins, however they will be fused back into fMHA Ops by the TensorRT optimizer. Plugins are you note are also not very flexible and support fewer SD versions and GPU target than the TensorRT out-of-box solution.

aredden commented 1 year ago

I was trying refit, to see what it would be like- maybe that was incorrect usage? Also- my environment was exactly the environment described in the stable-diffusion demo README.md via the docker container- installing requirements, to a T. @rajeevsrao One thing I noticed was that inside the container the tensorrt version is 8.5.3, whereas the optional tensorrt version I installed is 8.6.0- Maybe that caused some issue? - GPU is a 4090, with cuda 12.1 out of the container.

aredden commented 1 year ago

As for command, I was using this script, I added some arguments to modify the max latent dimensions for the unet, and having the pytorch model get pulled from a local custom diffusers checkpoint path- something I had been doing for the previous version of tensorrt.

#!/bin/sh
CUDA_VISIBLE_DEVICES=0 CUDA_MODULE_LOADING=LAZY python3 demo_txt2img.py \
    --negative-prompt "((Horribly blurred)), very ugly, (jpeg artifacts, blurry, gross), messy, warped, split, bad anatomy, malformed body, malformed, warped, fake, 3d, drawn, hideous, disgusting" \
    --denoising-steps 50 \
    --scheduler DPM \
    --width 512 \
    --height 640 \
    --engine-dir engine \
    --onnx-dir onnx \
    --force-onnx-export \
    --force-engine-build \
    --force-onnx-optimize \
    --build-preview-features \
    --build-enable-refit \
    --build-all-tactics \
    --build-static-batch \
    --build-dynamic-shape \
    --max-size 1536 \
    --model-path ./oranjipiratejaydos \
    -v \    
    "(Stunningly beautiful detailed) lush futuristic (eutopian paradise cyberpunk cityscape) landscape, intricate, elegant, mountains and very high waterfall background, volumetric lighting"

edit: Interesting, the error seems to have gone away after I build a container from source with tensorrt 8.6 and cuda 12 using the basic demo script. Might be that I had some code errors, or something about larger dynamic shapes doesn't work very well? Error also could be related to two different tensorrt binary versions in the previous container with 8.5.3 after updating to 8.6 via pip. Not sure.

edit 2: It compiles, but the output images are all black.

edit 3: The black images were the result of a faulty tensorrt compiled clip model for some reason 🤔 - I didn't change any code whatsoever so not sure why that would happen, but the speed of inference is about 1/3 what it was with 8.5.3.

edit 4: Alright, I built from source and that helped shave off quite a bit of time, from ~1600 ms/50 unet passes to about 1100ms / 50 unet passes. Which is still considerably slower than with 8.5.3, which gets closer to ~ 580 ms/50

Vozf commented 1 year ago

I'm doing step by step from diffusion readme The error is on step python3 demo_txt2img.py "a beautiful photograph of Mt. Fuji during cherry blossom" --hf-token=$HF_TOKEN -v

chavinlo commented 1 year ago

I had a similar error-

[I]     Configuring with profiles: [Profile().add('sample', min=(2, 4, 32, 32), opt=(2, 4, 80, 64), max=(2, 4, 192, 192)).add('encoder_hidden_states', min=(2, 77, 768), opt=(2, 77, 768), max=(2, 77, 768)).add('timestep', min=[1], opt=[1], max=[1])]
[I] Building engine with configuration:
    Flags                  | [FP16, REFIT]
    Engine Capability      | EngineCapability.DEFAULT
    Memory Pools           | [WORKSPACE: 19704.19 MiB, TACTIC_DRAM: 24217.31 MiB]
    Tactic Sources         | [CUBLAS, CUBLAS_LT, CUDNN, EDGE_MASK_CONVOLUTIONS, JIT_CONVOLUTIONS]
    Profiling Verbosity    | ProfilingVerbosity.DETAILED
    Preview Features       | [FASTER_DYNAMIC_SHAPES_0805, DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805]
[E] 10: Could not find any implementation for node {ForeignNode[onnx::LayerNormalization_9032 + (Unnamed Layer* 1211) [Shuffle].../down_blocks.0/attentions.0/Reshape_1 + /down_blocks.0/attentions.0/Transpose_1]}.
[E] 10: [optimizer.cpp::computeCosts::3873] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[onnx::LayerNormalization_9032 + (Unnamed Layer* 1211) [Shuffle].../down_blocks.0/attentions.0/Reshape_1 + /down_blocks.0/attentions.0/Transpose_1]}.)
[!] Invalid Engine. Please ensure the engine was built correctly

I also noticed that it seems like the flash attention plugins were removed? also- with this version, since it seems like no extra plugins are being added, instead of the unet getting down to UNet: final .. 1082 nodes, 2037 tensors, 3 inputs, 1 outputs it gets to: 4016 nodes, 6732 tensors, 3 inputs, 1 outputs. Is this a result of trying to get it functional for the other versions of stable diffusion? Sacrificing performance for flexibility?

Had the same issue, Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[onnx::LayerNormalization

Tried everything, verifying torch, reinstalling dependencies, compiling the plugins, nothing worked except adding the --build-preview-features flag.

There was some warning that mentioned that enabling this would prevent issues... so I guess thats it....

chavinlo commented 1 year ago

Had the same issue, Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[onnx::LayerNormalization

Tried everything, verifying torch, reinstalling dependencies, compiling the plugins, nothing worked except adding the --build-preview-features flag.

There was some warning that mentioned that enabling this would prevent issues... so I guess thats it....

txt2img-fp16-a_beautifu-1-4725 Can confirm it works, although yes, this is wayyyy slower than before

chavinlo commented 1 year ago

The increase in nodes is expected if we don't use plugins, however they will be fused back into fMHA Ops by the TensorRT optimizer. Plugins are you note are also not very flexible and support fewer SD versions and GPU target than the TensorRT out-of-box solution.

@rajeevsrao is there a way to accelerate it in the current state? by "GPU target" do you mean compiling the plugins for especific architectures? would that help?

Vozf commented 1 year ago

I've managed to run the demo until it was somehow killed at unet stage, although I had to manually install torch 1.13 because torch 2.0 was installing by default as torch isn't in requirements file. torch==1.13 must be added to requirements. torch 2.0 results in error from the start

chavinlo commented 1 year ago

man can't even replicate what I did yesterday damn yall really broke it this time

skirsten commented 1 year ago

I am also getting the

Could not find any implementation for node {ForeignNode[down_blocks.0.attentions.0.transformer_blocks.0.norm1.weight + (Unnamed Layer* 1363) [Shuffle].../down_blocks.0/attentions.0/Reshape_1 + /down_blocks.0/attentions.0/Transpose_1]} [profile 1].

with this config on the normal text2img:

[I] Building engine with configuration:
    Flags                  | [FP16, REFIT]
    Engine Capability      | EngineCapability.DEFAULT
    Memory Pools           | [WORKSPACE: 20480.00 MiB, TACTIC_DRAM: 24259.69 MiB]
    Tactic Sources         | []
    Profiling Verbosity    | ProfilingVerbosity.DETAILED
    Optimization Profiles  | 2 profile(s)
    Preview Features       | [FASTER_DYNAMIC_SHAPES_0805, DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805]

I was using the version_compatible build before without refit and everything was fine :disappointed:. Its a shame that I cannot build a engine with refit AND version_compatible to use with the lean runtime. I also tested the normal build and that also works.

So it has to do with the refit which does not make any sense... Now I am stuck having to ship gigabytes of unused dependencies and the build is magically failing without any reason...