load unet.onnx failure of TensorRT 8.6 when running engine_from_network on GPU rtx4090

zhangvia commented 1 year ago

Description

i use the demo_txt2img.py in demo/Diffusion,it works when using default arguments. but when i use the dynamic shape ,it fails with the error below:

[I]     Configuring with profiles: [Profile().add('sample', min=(2, 4, 32, 32), opt=(2, 4, 64, 64), max=(8, 4, 128, 128)).add('encoder_hidden_states', min=(2, 77, 768), opt=(2, 77, 768), max=(8, 77, 768)).add('timestep', min=[1], opt=[1], max=[1])]
[I] Building engine with configuration:
    Flags                  | [FP16]
    Engine Capability      | EngineCapability.DEFAULT
    Memory Pools           | [WORKSPACE: 16692.81 MiB, TACTIC_DRAM: 24217.31 MiB]
    Tactic Sources         | []
    Profiling Verbosity    | ProfilingVerbosity.DETAILED
    Preview Features       | [DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805]
[W] Myelin graph with multiple dynamic values may have poor performance if they differ.
[E] 10: Could not find any implementation for node {ForeignNode[onnx::LayerNormalization_9097 + (Unnamed Layer* 1230) [Shuffle].../down_blocks.0/attentions.0/Reshape_1 + /down_blocks.0/attentions.0/Transpose_1]}.
[E] 10: [optimizer.cpp::computeCosts::3869] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[onnx::LayerNormalization_9097 + (Unnamed Layer* 1230) [Shuffle].../down_blocks.0/attentions.0/Reshape_1 + /down_blocks.0/attentions.0/Transpose_1]}.)
[!] Invalid Engine. Please ensure the engine was built correctly
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /media/74nvme/research/diffusers_tensorrt/convert/convert_diffusers_to_tensorrt.py:223 in        │
│ <module>                                                                                         │
│                                                                                                  │
│   220                                                                                            │
│   221 if __name__=="__main__":                                                                   │
│   222 │   args = parse()                                                                         │
│ ❱ 223 │   convert2Engines(args)                                                                  │
│   224                                                                                            │
│   225                                                                                            │
│   226                                                                                            │
│                                                                                                  │
│ /media/74nvme/research/diffusers_tensorrt/convert/convert_diffusers_to_tensorrt.py:205 in        │
│ convert2Engines                                                                                  │
│                                                                                                  │
│   202 │   │   │   except Exception as e:                                                         │
│   203 │   │   │   │   print(e)                                                                   │
│   204 │   │   │   if args.force_engine_build or not os.path.exists(engine.engine_path):          │
│ ❱ 205 │   │   │   │   engine.build(onnx_opt_path,                                                │
│   206 │   │   │   │   │   fp16=True,                                                             │
│   207 │   │   │   │   │   input_profile=obj.get_input_profile(                                   │
│   208 │   │   │   │   │   │   args.opt_batch_size, args.opt_image_height, args.opt_image_width   │
│                                                                                                  │
│ /media/74nvme/research/diffusers_tensorrt/convert/convert_diffusers_to_tensorrt.py:90 in build   │
│                                                                                                  │
│    87 │   │   if not enable_all_tactics:                                                         │
│    88 │   │   │   config_kwargs['tactic_sources'] = []                                           │
│    89 │   │                                                                                      │
│ ❱  90 │   │   engine = engine_from_network(                                                      │
│    91 │   │   │   network_from_onnx_path(onnx_path),                                             │
│    92 │   │   │   config=CreateConfig(fp16=fp16,                                                 │
│    93 │   │   │   │   refittable=enable_refit,                                                   │
│ in engine_from_network                                                                           │
│                                                                                                  │
│ /home/arc-zjy3501/anaconda3/envs/sd-webui/lib/python3.10/site-packages/polygraphy/backend/base/l │
│ oader.py:40 in __call__                                                                          │
│                                                                                                  │
│   37 │   │   Note: ``call_impl`` should *not* be called directly - use this function instead.    │
│   38 │   │   """                                                                                 │
│   39 │   │   __doc__ = self.call_impl.__doc__                                                    │
│ ❱ 40 │   │   return self.call_impl(*args, **kwargs)                                              │
│   41                                                                                             │
│                                                                                                  │
│ /home/arc-zjy3501/anaconda3/envs/sd-webui/lib/python3.10/site-packages/polygraphy/util/util.py:6 │
│ 94 in wrapped                                                                                    │
│                                                                                                  │
│    691 │   │   │   │   │   │   f"Calling '{func.__qualname__}()' directly is not recommended. P  │
│    692 │   │   │   │   │   )                                                                     │
│    693 │   │   │                                                                                 │
│ ❱  694 │   │   │   return func(*args, **kwargs)                                                  │
│    695 │   │                                                                                     │
│    696 │   │   return wrapped                                                                    │
│    697                                                                                           │
│                                                                                                  │
│ /home/arc-zjy3501/anaconda3/envs/sd-webui/lib/python3.10/site-packages/polygraphy/backend/trt/lo │
│ ader.py:617 in call_impl                                                                         │
│                                                                                                  │
│   614 │   │   """                                                                                │
│   615 │   │   # We do not invoke super().call_impl here because we would otherwise be responsi   │
│   616 │   │   # for freeing it's return values.                                                  │
│ ❱ 617 │   │   return engine_from_bytes(super().call_impl, runtime=self._runtime)                 │
│   618                                                                                            │
│   619                                                                                            │
│   620 @mod.export(funcify=True)                                                                  │
│ in engine_from_bytes                                                                             │
│                                                                                                  │
│ /home/arc-zjy3501/anaconda3/envs/sd-webui/lib/python3.10/site-packages/polygraphy/backend/base/l │
│ oader.py:40 in __call__                                                                          │
│                                                                                                  │
│   37 │   │   Note: ``call_impl`` should *not* be called directly - use this function instead.    │
│   38 │   │   """                                                                                 │
│   39 │   │   __doc__ = self.call_impl.__doc__                                                    │
│ ❱ 40 │   │   return self.call_impl(*args, **kwargs)                                              │
│   41                                                                                             │
│                                                                                                  │
│ /home/arc-zjy3501/anaconda3/envs/sd-webui/lib/python3.10/site-packages/polygraphy/util/util.py:6 │
│ 94 in wrapped                                                                                    │
│                                                                                                  │
│    691 │   │   │   │   │   │   f"Calling '{func.__qualname__}()' directly is not recommended. P  │
│    692 │   │   │   │   │   )                                                                     │
│    693 │   │   │                                                                                 │
│ ❱  694 │   │   │   return func(*args, **kwargs)                                                  │
│    695 │   │                                                                                     │
│    696 │   │   return wrapped                                                                    │
│    697                                                                                           │
│                                                                                                  │
│ /home/arc-zjy3501/anaconda3/envs/sd-webui/lib/python3.10/site-packages/polygraphy/backend/trt/lo │
│ ader.py:646 in call_impl                                                                         │
│                                                                                                  │
│   643 │   │   Returns:                                                                           │
│   644 │   │   │   trt.ICudaEngine: The deserialized engine.                                      │
│   645 │   │   """                                                                                │
│ ❱ 646 │   │   buffer, owns_buffer = util.invoke_if_callable(self._serialized_engine)             │
│   647 │   │   runtime, owns_runtime = util.invoke_if_callable(self._runtime)                     │
│   648 │   │                                                                                      │
│   649 │   │   trt.init_libnvinfer_plugins(trt_util.get_trt_logger(), "")                         │
│                                                                                                  │
│ /home/arc-zjy3501/anaconda3/envs/sd-webui/lib/python3.10/site-packages/polygraphy/util/util.py:6 │
│ 63 in invoke_if_callable                                                                         │
│                                                                                                  │
│    660 │   The second return value of this function indicates whether the argument was a callab  │
│    661 │   """                                                                                   │
│    662 │   if callable(func):                                                                    │
│ ❱  663 │   │   ret = func(*args, **kwargs)                                                       │
│    664 │   │   return ret, True                                                                  │
│    665 │   return func, False                                                                    │
│    666                                                                                           │
│                                                                                                  │
│ /home/arc-zjy3501/anaconda3/envs/sd-webui/lib/python3.10/site-packages/polygraphy/util/util.py:6 │
│ 94 in wrapped                                                                                    │
│                                                                                                  │
│    691 │   │   │   │   │   │   f"Calling '{func.__qualname__}()' directly is not recommended. P  │
│    692 │   │   │   │   │   )                                                                     │
│    693 │   │   │                                                                                 │
│ ❱  694 │   │   │   return func(*args, **kwargs)                                                  │
│    695 │   │                                                                                     │
│    696 │   │   return wrapped                                                                    │
│    697                                                                                           │
│                                                                                                  │
│ /home/arc-zjy3501/anaconda3/envs/sd-webui/lib/python3.10/site-packages/polygraphy/backend/trt/lo │
│ ader.py:550 in call_impl                                                                         │
│                                                                                                  │
│   547 │   │   │   end_time = time.time()                                                         │
│   548 │   │   │                                                                                  │
│   549 │   │   │   if not engine_bytes:                                                           │
│ ❱ 550 │   │   │   │   G_LOGGER.critical("Invalid Engine. Please ensure the engine was built co   │
│   551 │   │   │                                                                                  │
│   552 │   │   │   G_LOGGER.finish(f"Finished engine building in {end_time - start_time:.3f} se   │
│   553                                                                                            │
│                                                                                                  │
│ /home/arc-zjy3501/anaconda3/envs/sd-webui/lib/python3.10/site-packages/polygraphy/logger/logger. │
│ py:597 in critical                                                                               │
│                                                                                                  │
│   594 │   │   self.log(message, Logger.CRITICAL, stack_depth=3)                                  │
│   595 │   │   from polygraphy.exception import PolygraphyException                               │
│   596 │   │                                                                                      │
│ ❱ 597 │   │   raise PolygraphyException(message) from None                                       │
│   598 │                                                                                          │
│   599 │   def internal_error(self, message):                                                     │
│   600 │   │   from polygraphy import config                                                      │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
PolygraphyException: Invalid Engine. Please ensure the engine was built correctly

Environment

TensorRT Version: 8.6.1

NVIDIA GPU: rtx4090

NVIDIA Driver Version: 525.89.02

CUDA Version: 11.6

CUDNN Version:8.4.0

Operating System: ubuntu 20.04

Python Version (if applicable): 3.10

Tensorflow Version (if applicable): none

PyTorch Version (if applicable): 1.13.1

Baremetal or Container (if so, version):

zhangvia commented 1 year ago

i successufully built the engine file with a100, so i think maybe the cuda memory will be the cause. but the dynamic shape unet engine inference time cost is longer than static shape build,does that make sense?

zerollzeng commented 1 year ago

[E] 10: [optimizer.cpp::computeCosts::3869] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[onnx::LayerNormalization_9097 + (Unnamed Layer* 1230) [Shuffle].../down_blocks.0/attentions.0/Reshape_1 + /down_blocks.0/attentions.0/Transpose_1]}.)

Yes it's caused by insufficient memory.

but the dynamic shape unet engine inference time cost is longer than static shape build,does that make sense?

Yes it's expected. static shape inference can have more aggressive optimization.

zhangvia commented 1 year ago

[E] 10: [optimizer.cpp::computeCosts::3869] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[onnx::LayerNormalization_9097 + (Unnamed Layer* 1230) [Shuffle].../down_blocks.0/attentions.0/Reshape_1 + /down_blocks.0/attentions.0/Transpose_1]}.)
Yes it's caused by insufficient memory.

but the dynamic shape unet engine inference time cost is longer than static shape build,does that make sense?

Yes it's expected. static shape inference can have more aggressive optimization.

i found that if i set refit true in buidling process, the cost of vram will be twice than no refit build. why does it happened?

BowenFu commented 1 year ago

[E] 10: [optimizer.cpp::computeCosts::3869] Error Code 10: Internal Error (Could not find any implementation for node {ForeignNode[onnx::LayerNormalization_9097 + (Unnamed Layer* 1230) [Shuffle].../down_blocks.0/attentions.0/Reshape_1 + /down_blocks.0/attentions.0/Transpose_1]}.)
Yes it's caused by insufficient memory.

but the dynamic shape unet engine inference time cost is longer than static shape build,does that make sense?

Yes it's expected. static shape inference can have more aggressive optimization.
i found that if i set refit true in buidling process, the cost of vram will be twice than no refit build. why does it happened?

It is a known issue that refittable engines will consume more memory compared to non-refittable engines. We will investigate and try to mitigate the situation.

zerollzeng commented 1 year ago

@zhangvia could you please provide a reproduce for the refit memory consumption issue? We can take a further check, thanks!

zhangvia commented 1 year ago

@zhangvia could you please provide a reproduce for the refit memory consumption issue? We can take a further check, thanks!

i use the trtexec to build the stable diffusion v1-5 with dynamic shape and refit, and test it on rtx 4090. besides, do you have any ideas about use refit interface to load lora weights dynamically on tensorrt engine. cause lora can't be converted to onnx format since it is not a nn.module and doesn't have forward function. so i need to load the lora state_dict on torch module,and modify the engine file use refit interface. but unfortunately, not all torch weights names are matched with engine file. some weights are not named like down_blocks.2.attentions.1.transformer_blocks.0.ff.net.0.proj.bias,but named onnx::Mul_9286 in engine file

zerollzeng commented 1 year ago

do you have any ideas about use refit interface to load lora weights dynamically on tensorrt engine. cause lora can't be converted to onnx format since it is not a nn.module and doesn't have forward function.

Unfortunately you have to build the engine to use refit feature.

zerollzeng commented 1 year ago

i use the trtexec to build the stable diffusion v1-5 with dynamic shape and refit

you are also using dynamic shapes? it will also consume more memory, could you please try use static shape and refit? Thanks!

zhangvia commented 1 year ago

could you please try use static shape and refit?

i try it,but it seems that refit will cost more time. refitting the whole unet cost nearly 5 sec. and if i just refit the weights that are associated with lora, it get error like this:

 Error Code 4: Internal Error (missing 61 needed Weights. Call IRefitter::getMissing to get their layer names and roles or IR::getMissingWeights to get their weights names.)

and if i refit the all layers that are associated with lora, it get this error:

Error Code 4: Internal Error (/down_blocks.0/attentions.0/norm/Constant_output_0: network does not have (or does not use) this weights as a refittable weights)

but when i refit all unet weights ,there is no error. and it costs 5 sec

it costs too much time, although it costs much less time than building the engine from onnx file

zhangvia commented 1 year ago

i use the trtexec to build the stable diffusion v1-5 with dynamic shape and refit

you are also using dynamic shapes? it will also consume more memory, could you please try use static shape and refit? Thanks!

sorry,i need the dynamic shape feature. so even if it consumes less vram in static shape, that doesn't make sense to me. besides, just switching on dynamic shape feature doen't cost so much vram like refit feature

BowenFu commented 1 year ago

could you please try use static shape and refit?

i try it,but it seems that refit will cost more time. refitting the whole unet cost nearly 5 sec. and if i just refit the weights that are associated with lora, it get error like this:
 Error Code 4: Internal Error (missing 61 needed Weights. Call IRefitter::getMissing to get their layer names and roles or IR::getMissingWeights to get their weights names.)
and if i refit the all layers that are associated with lora, it get this error:
Error Code 4: Internal Error (/down_blocks.0/attentions.0/norm/Constant_output_0: network does not have (or does not use) this weights as a refittable weights)
but when i refit all unet weights ,there is no error. and it costs 5 sec

it costs too much time, although it costs much less time than building the engine from onnx file

You do not need to refit all layers. Just call setNamedWeights on the weights you want to update. Then call getMissingWeights to get the missing weights list. Then provide these missing weights via setNamedWeights.

zhangvia commented 1 year ago

could you please try use static shape and refit?

i try it,but it seems that refit will cost more time. refitting the whole unet cost nearly 5 sec. and if i just refit the weights that are associated with lora, it get error like this:
 Error Code 4: Internal Error (missing 61 needed Weights. Call IRefitter::getMissing to get their layer names and roles or IR::getMissingWeights to get their weights names.)
and if i refit the all layers that are associated with lora, it get this error:
Error Code 4: Internal Error (/down_blocks.0/attentions.0/norm/Constant_output_0: network does not have (or does not use) this weights as a refittable weights)
but when i refit all unet weights ,there is no error. and it costs 5 sec it costs too much time, although it costs much less time than building the engine from onnx file
You do not need to refit all layers. Just call setNamedWeights on the weights you want to update. Then call getMissingWeights to get the missing weights list. Then provide these missing weights via setNamedWeights.

thank you for youir advice, i try it. but it seems that the time consuming limit of loading lora weights by using refit is about 3 sec. except 264 weights that we need to change according to lora weights, there are another 72 weights need to be supplied when refitting the engine. so the total amount of weights that need to be refit is 336. and it seems that refitting one weight needs about 10ms. so total cost of time will be about 3sec.

zhangvia commented 1 year ago

could you please try use static shape and refit?

i try it,but it seems that refit will cost more time. refitting the whole unet cost nearly 5 sec. and if i just refit the weights that are associated with lora, it get error like this:
 Error Code 4: Internal Error (missing 61 needed Weights. Call IRefitter::getMissing to get their layer names and roles or IR::getMissingWeights to get their weights names.)
and if i refit the all layers that are associated with lora, it get this error:
Error Code 4: Internal Error (/down_blocks.0/attentions.0/norm/Constant_output_0: network does not have (or does not use) this weights as a refittable weights)
but when i refit all unet weights ,there is no error. and it costs 5 sec it costs too much time, although it costs much less time than building the engine from onnx file
You do not need to refit all layers. Just call setNamedWeights on the weights you want to update. Then call getMissingWeights to get the missing weights list. Then provide these missing weights via setNamedWeights.

by the way, i can build the engine file with dynamical shape in 24GB vram gpu with trtexec, but using the python interface engine_from_network will cause the error. does this make sense?

BowenFu commented 1 year ago

could you please try use static shape and refit?

i try it,but it seems that refit will cost more time. refitting the whole unet cost nearly 5 sec. and if i just refit the weights that are associated with lora, it get error like this:
 Error Code 4: Internal Error (missing 61 needed Weights. Call IRefitter::getMissing to get their layer names and roles or IR::getMissingWeights to get their weights names.)
and if i refit the all layers that are associated with lora, it get this error:
Error Code 4: Internal Error (/down_blocks.0/attentions.0/norm/Constant_output_0: network does not have (or does not use) this weights as a refittable weights)
but when i refit all unet weights ,there is no error. and it costs 5 sec it costs too much time, although it costs much less time than building the engine from onnx file
You do not need to refit all layers. Just call setNamedWeights on the weights you want to update. Then call getMissingWeights to get the missing weights list. Then provide these missing weights via setNamedWeights.
by the way, i can build the engine file with dynamical shape in 24GB vram gpu with trtexec, but using the python interface engine_from_network will cause the error. does this make sense?

Could you provide the detailed repro steps including the python script so that we can investigate it?

zhangvia commented 1 year ago

could you please try use static shape and refit?

i try it,but it seems that refit will cost more time. refitting the whole unet cost nearly 5 sec. and if i just refit the weights that are associated with lora, it get error like this:
 Error Code 4: Internal Error (missing 61 needed Weights. Call IRefitter::getMissing to get their layer names and roles or IR::getMissingWeights to get their weights names.)
and if i refit the all layers that are associated with lora, it get this error:
Error Code 4: Internal Error (/down_blocks.0/attentions.0/norm/Constant_output_0: network does not have (or does not use) this weights as a refittable weights)
but when i refit all unet weights ,there is no error. and it costs 5 sec it costs too much time, although it costs much less time than building the engine from onnx file
You do not need to refit all layers. Just call setNamedWeights on the weights you want to update. Then call getMissingWeights to get the missing weights list. Then provide these missing weights via setNamedWeights.
by the way, i can build the engine file with dynamical shape in 24GB vram gpu with trtexec, but using the python interface engine_from_network will cause the error. does this make sense?
Could you provide the detailed repro steps including the python script so that we can investigate it?

i use this repo https://github.com/keddyjin/TensorRT_StableDiffusion_ControlNet. all scripts are in this repo and the weights are stable-diffusion v1-5

BowenFu commented 1 year ago

Can you provide your command?

Also note that TensorRT has demo diffusion (refer to https://github.com/NVIDIA/TensorRT/tree/release/8.6/demo/Diffusion).

zhangvia commented 1 year ago

Can you provide your command?

Also note that TensorRT has demo diffusion (refer to https://github.com/NVIDIA/TensorRT/tree/release/8.6/demo/Diffusion).

https://github.com/keddyjin/TensorRT_StableDiffusion_ControlNet this repo has implemented multi-controlnet. and my command is :

python demo_txt2img_controlnet.py --prompt 'a girl' --force-onnx-export --force-onnx-optimize --force-engine-build --build-enable-refit --build-dynamic-shape

the max resolution in dynamic shape was 1024, and i can't build engine on rtx4090. but using trtexec will be ok. and if change the max resolution to 768, i can also build engine on rtx 4090

BowenFu commented 1 year ago

What is the command for using trtexec? Can you provide the full log of both cases? There must be some differences for the builder flags / options.

zhangvia commented 1 year ago

What is the command for using trtexec? Can you provide the full log of both cases? There must be some differences for the builder flags / options.

python interface

[W] 'colored' module is not installed, will not use colors when logging. To enable colors, please install the 'colored' module: python3 -m pip install colored
[W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of CUDA documentation https://do cs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
[W] onnx2trt_utils.cpp:374: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[W] onnx2trt_utils.cpp:400: One or more weights outside the range of INT32 was clamped
[I]     Configuring with profiles: [Profile().add('sample', min=(1, 4, 32, 32), opt=(2, 4, 64, 64), max=(4, 4, 128, 128)).add('timestep', min=(1,), opt=(2,), max=(4,)).add('encoder_hidden_stat es', min=(1, 77, 768), opt=(2, 77, 768), max=(4, 77, 768))]
[I] Building engine with configuration:
    Flags                  | [FP16]
    Engine Capability      | EngineCapability.DEFAULT
    Memory Pools           | [WORKSPACE: 24217.31 MiB, TACTIC_DRAM: 24217.31 MiB]
    Tactic Sources         | [CUBLAS, CUBLAS_LT, CUDNN, EDGE_MASK_CONVOLUTIONS, JIT_CONVOLUTIONS]
    Profiling Verbosity    | ProfilingVerbosity.DETAILED
    Preview Features       | [FASTER_DYNAMIC_SHAPES_0805, DISABLE_EXTERNAL_TACTIC_SOURCES_FOR_CORE_0805]
[W] Detected layernorm nodes in FP16: Sub_485, Pow_487, ReduceMean_488, Add_490, Sqrt_491, Div_492, Mul_493, Add_494, Sub_673, Pow_675, ReduceMean_676, Add_678, Sqrt_679, Div_680, Mul_681, Add _682, Sub_297, Pow_299, ReduceMean_300, Add_302, Sqrt_303, Div_304, Mul_305, Add_306, Sub_792, Pow_794, ReduceMean_795, Add_797, Sqrt_798, Div_799, Mul_800, Add_801, Sub_980, Pow_982, ReduceMe an_983, Add_985, Sqrt_986, Div_987, Mul_988, Add_989, Sub_1168, Pow_1170, ReduceMean_1171, Add_1173, Sqrt_1174, Div_1175, Mul_1176, Add_1177, Sub_1289, Pow_1291, ReduceMean_1292, Add_1294, Sqr t_1295, Div_1296, Mul_1297, Add_1298, Sub_1477, Pow_1479, ReduceMean_1480, Add_1482, Sqrt_1483, Div_1484, Mul_1485, Add_1486, Sub_1665, Pow_1667, ReduceMean_1668, Add_1670, Sqrt_1671, Div_1672 , Mul_1673, Add_1674, Sub_1784, Pow_1786, ReduceMean_1787, Add_1789, Sqrt_1790, Div_1791, Mul_1792, Add_1793, Sub_1972, Pow_1974, ReduceMean_1975, Add_1977, Sqrt_1978, Div_1979, Mul_1980, Add_ 1981, Sub_2160, Pow_2162, ReduceMean_2163, Add_2165, Sqrt_2166, Div_2167, Mul_2168, Add_2169, Sub_2281, Pow_2283, ReduceMean_2284, Add_2286, Sqrt_2287, Div_2288, Mul_2289, Add_2290, Sub_2469,  Pow_2471, ReduceMean_2472, Add_2474, Sqrt_2475, Div_2476, Mul_2477, Add_2478, Sub_2657, Pow_2659, ReduceMean_2660, Add_2662, Sqrt_2663, Div_2664, Mul_2665, Add_2666, Sub_2776, Pow_2778, Reduce Mean_2779, Add_2781, Sqrt_2782, Div_2783, Mul_2784, Add_2785, Sub_2964, Pow_2966, ReduceMean_2967, Add_2969, Sqrt_2970, Div_2971, Mul_2972, Add_2973, Sub_3152, Pow_3154, ReduceMean_3155, Add_3 157, Sqrt_3158, Div_3159, Mul_3160, Add_3161, Sub_3342, Pow_3344, ReduceMean_3345, Add_3347, Sqrt_3348, Div_3349, Mul_3350, Add_3351, Sub_3530, Pow_3532, ReduceMean_3533, Add_3535, Sqrt_3536,  Div_3537, Mul_3538, Add_3539, Sub_3718, Pow_3720, ReduceMean_3721, Add_3723, Sqrt_3724, Div_3725, Mul_3726, Add_3727, Sub_3987, Pow_3989, ReduceMean_3990, Add_3992, Sqrt_3993, Div_3994, Mul_39 95, Add_3996, Sub_4175, Pow_4177, ReduceMean_4178, Add_4180, Sqrt_4181, Div_4182, Mul_4183, Add_4184, Sub_4363, Pow_4365, ReduceMean_4366, Add_4368, Sqrt_4369, Div_4370, Mul_4371, Add_4372, Su b_4484, Pow_4486, ReduceMean_4487, Add_4489, Sqrt_4490, Div_4491, Mul_4492, Add_4493, Sub_4672, Pow_4674, ReduceMean_4675, Add_4677, Sqrt_4678, Div_4679, Mul_4680, Add_4681, Sub_4860, Pow_4862 , ReduceMean_4863, Add_4865, Sqrt_4866, Div_4867, Mul_4868, Add_4869, Sub_4981, Pow_4983, ReduceMean_4984, Add_4986, Sqrt_4987, Div_4988, Mul_4989, Add_4990, Sub_5169, Pow_5171, ReduceMean_517 2, Add_5174, Sqrt_5175, Div_5176, Mul_5177, Add_5178, Sub_5357, Pow_5359, ReduceMean_5360, Add_5362, Sqrt_5363, Div_5364, Mul_5365, Add_5366, Sub_5480, Pow_5482, ReduceMean_5483, Add_5485, Sqr t_5486, Div_5487, Mul_5488, Add_5489, Sub_5668, Pow_5670, ReduceMean_5671, Add_5673, Sqrt_5674, Div_5675, Mul_5676, Add_5677, Sub_5856, Pow_5858, ReduceMean_5859, Add_5861, Sqrt_5862, Div_5863 , Mul_5864, Add_5865, Sub_5977, Pow_5979, ReduceMean_5980, Add_5982, Sqrt_5983, Div_5984, Mul_5985, Add_5986, Sub_6165, Pow_6167, ReduceMean_6168, Add_6170, Sqrt_6171, Div_6172, Mul_6173, Add_ 6174, Sub_6353, Pow_6355, ReduceMean_6356, Add_6358, Sqrt_6359, Div_6360, Mul_6361, Add_6362, Sub_6474, Pow_6476, ReduceMean_6477, Add_6479, Sqrt_6480, Div_6481, Mul_6482, Add_6483, Sub_6662,  Pow_6664, ReduceMean_6665, Add_6667, Sqrt_6668, Div_6669, Mul_6670, Add_6671, Sub_6850, Pow_6852, ReduceMean_6853, Add_6855, Sqrt_6856, Div_6857, Mul_6858, Add_6859, Sub_6973, Pow_6975, Reduce Mean_6976, Add_6978, Sqrt_6979, Div_6980, Mul_6981, Add_6982, Sub_7161, Pow_7163, ReduceMean_7164, Add_7166, Sqrt_7167, Div_7168, Mul_7169, Add_7170, Sub_7349, Pow_7351, ReduceMean_7352, Add_7 354, Sqrt_7355, Div_7356, Mul_7357, Add_7358, Sub_7470, Pow_7472, ReduceMean_7473, Add_7475, Sqrt_7476, Div_7477, Mul_7478, Add_7479, Sub_7658, Pow_7660, ReduceMean_7661, Add_7663, Sqrt_7664,  Div_7665, Mul_7666, Add_7667, Sub_7846, Pow_7848, ReduceMean_7849, Add_7851, Sqrt_7852, Div_7853, Mul_7854, Add_7855, Sub_7967, Pow_7969, ReduceMean_7970, Add_7972, Sqrt_7973, Div_7974, Mul_79 75, Add_7976, Sub_8155, Pow_8157, ReduceMean_8158, Add_8160, Sqrt_8161, Div_8162, Mul_8163, Add_8164, Sub_8343, Pow_8345, ReduceMean_8346, Add_8348, Sqrt_8349, Div_8350, Mul_8351, Add_8352
[W] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INormalizationLayer, or forcing l ayernorm layers to run in FP32 precision can help with preserving accuracy.
Segmentation fault (core dumped)

if i set maxbatchsize 1,the max shape will be (2, 4, 128, 128),(2),(2,77,768),the engine will be built successfully.

trtexec:


&&&& RUNNING TensorRT.trtexec [TensorRT v8601] # trtexec --onnx=./model/unet/model.onnx --saveEngine=./model/unet/model.trt --minShapes=sample:1x4x32x32,timestep:1,encoder_hidden_states:1x77x7 68 --optShapes=sample:2x4x64x64,timestep:2,encoder_hidden_states:2x77x768 --maxShapes=sample:4x4x128x128,timestep:4,encoder_hidden_states:4x77x768 --explicitBatch --fp16
[09/01/2023-07:53:06] [W] --explicitBatch flag has been deprecated and has no effect!
[09/01/2023-07:53:06] [W] Explicit batch dim is automatically enabled if input model is ONNX or if dynamic shapes are provided when the engine is built.
[09/01/2023-07:53:06] [I] === Model Options ===
[09/01/2023-07:53:06] [I] Format: ONNX
[09/01/2023-07:53:06] [I] Model: ./model/unet/model.onnx
[09/01/2023-07:53:06] [I] Output:
[09/01/2023-07:53:06] [I] === Build Options ===
[09/01/2023-07:53:06] [I] Max batch: explicit batch
[09/01/2023-07:53:06] [I] Memory Pools: workspace: default, dlaSRAM: default, dlaLocalDRAM: default, dlaGlobalDRAM: default
[09/01/2023-07:53:06] [I] minTiming: 1
[09/01/2023-07:53:06] [I] avgTiming: 8
[09/01/2023-07:53:06] [I] Precision: FP32+FP16
[09/01/2023-07:53:06] [I] LayerPrecisions:
[09/01/2023-07:53:06] [I] Layer Device Types:
[09/01/2023-07:53:06] [I] Calibration:
[09/01/2023-07:53:06] [I] Refit: Disabled
[09/01/2023-07:53:06] [I] Version Compatible: Disabled
[09/01/2023-07:53:06] [I] TensorRT runtime: full
[09/01/2023-07:53:06] [I] Lean DLL Path:
[09/01/2023-07:53:06] [I] Tempfile Controls: { in_memory: allow, temporary: allow }
[09/01/2023-07:53:06] [I] Exclude Lean Runtime: Disabled
[09/01/2023-07:53:06] [I] Sparsity: Disabled
[09/01/2023-07:53:06] [I] Safe mode: Disabled
[09/01/2023-07:53:06] [I] Build DLA standalone loadable: Disabled
[09/01/2023-07:53:06] [I] Allow GPU fallback for DLA: Disabled
[09/01/2023-07:53:06] [I] DirectIO mode: Disabled
[09/01/2023-07:53:06] [I] Restricted mode: Disabled
[09/01/2023-07:53:06] [I] Skip inference: Disabled
[09/01/2023-07:53:06] [I] Save engine: ./model/unet/model.trt
[09/01/2023-07:53:06] [I] Load engine:
[09/01/2023-07:53:06] [I] Profiling verbosity: 0
[09/01/2023-07:53:06] [I] Tactic sources: Using default tactic sources
[09/01/2023-07:53:06] [I] timingCacheMode: local
[09/01/2023-07:53:06] [I] timingCacheFile:
[09/01/2023-07:53:06] [I] Heuristic: Disabled
[09/01/2023-07:53:06] [I] Preview Features: Use default preview flags.
[09/01/2023-07:53:06] [I] MaxAuxStreams: -1
[09/01/2023-07:53:06] [I] BuilderOptimizationLevel: -1
[09/01/2023-07:53:06] [I] Input(s)s format: fp32:CHW
[09/01/2023-07:53:06] [I] Output(s)s format: fp32:CHW
[09/01/2023-07:53:06] [I] Input build shape: sample=1x4x32x32+2x4x64x64+4x4x128x128
[09/01/2023-07:53:06] [I] Input build shape: timestep=1+2+4
[09/01/2023-07:53:06] [I] Input build shape: encoder_hidden_states=1x77x768+2x77x768+4x77x768
[09/01/2023-07:53:06] [I] Input calibration shapes: model
[09/01/2023-07:53:06] [I] === System Options ===
[09/01/2023-07:53:06] [I] Device: 0
[09/01/2023-07:53:06] [I] DLACore:
[09/01/2023-07:53:06] [I] Plugins:
[09/01/2023-07:53:06] [I] setPluginsToSerialize:
[09/01/2023-07:53:06] [I] dynamicPlugins:
[09/01/2023-07:53:06] [I] ignoreParsedPluginLibs: 0
[09/01/2023-07:53:06] [I]
[09/01/2023-07:53:06] [I] === Inference Options ===
[09/01/2023-07:53:06] [I] Batch: Explicit
[09/01/2023-07:53:06] [I] Input inference shape: encoder_hidden_states=2x77x768
[09/01/2023-07:53:06] [I] Input inference shape: timestep=2
[09/01/2023-07:53:06] [I] Input inference shape: sample=2x4x64x64
[09/01/2023-07:53:06] [I] Iterations: 10
[09/01/2023-07:53:06] [I] Duration: 3s (+ 200ms warm up)
[09/01/2023-07:53:06] [I] Sleep time: 0ms
[09/01/2023-07:53:06] [I] Idle time: 0ms
[09/01/2023-07:53:06] [I] Inference Streams: 1
[09/01/2023-07:53:06] [I] ExposeDMA: Disabled
[09/01/2023-07:53:06] [I] Data transfers: Enabled
[09/01/2023-07:53:06] [I] Spin-wait: Disabled
[09/01/2023-07:53:06] [I] Multithreading: Disabled
[09/01/2023-07:53:06] [I] CUDA Graph: Disabled
[09/01/2023-07:53:06] [I] Separate profiling: Disabled
[09/01/2023-07:53:06] [I] Time Deserialize: Disabled
[09/01/2023-07:53:06] [I] Time Refit: Disabled
[09/01/2023-07:53:06] [I] NVTX verbosity: 0
[09/01/2023-07:53:06] [I] Persistent Cache Ratio: 0
[09/01/2023-07:53:06] [I] Inputs:
[09/01/2023-07:53:06] [I] === Reporting Options ===
[09/01/2023-07:53:06] [I] Verbose: Disabled
[09/01/2023-07:53:06] [I] Averages: 10 inferences
[09/01/2023-07:53:06] [I] Percentiles: 90,95,99
[09/01/2023-07:53:06] [I] Dump refittable layers:Disabled
[09/01/2023-07:53:06] [I] Dump output: Disabled
[09/01/2023-07:53:06] [I] Profile: Disabled
[09/01/2023-07:53:06] [I] Export timing to JSON file:
[09/01/2023-07:53:06] [I] Export output to JSON file:
[09/01/2023-07:53:06] [I] Export profile to JSON file:
[09/01/2023-07:53:06] [I]
[09/01/2023-07:53:06] [I] === Device Information ===
[09/01/2023-07:53:06] [I] Selected Device: NVIDIA GeForce RTX 4090
[09/01/2023-07:53:06] [I] Compute Capability: 8.9
[09/01/2023-07:53:06] [I] SMs: 128
[09/01/2023-07:53:06] [I] Device Global Memory: 24217 MiB
[09/01/2023-07:53:06] [I] Shared Memory per SM: 100 KiB
[09/01/2023-07:53:06] [I] Memory Bus Width: 384 bits (ECC disabled)
[09/01/2023-07:53:06] [I] Application Compute Clock Rate: 2.52 GHz
[09/01/2023-07:53:06] [I] Application Memory Clock Rate: 10.501 GHz
[09/01/2023-07:53:06] [I]
[09/01/2023-07:53:06] [I] Note: The application clock rates do not reflect the actual clock rates that the GPU is currently running at.
[09/01/2023-07:53:06] [I]
[09/01/2023-07:53:06] [I] TensorRT version: 8.6.1
[09/01/2023-07:53:06] [I] Loading standard plugins
[09/01/2023-07:53:08] [I] [TRT] [MemUsageChange] Init CUDA: CPU +352, GPU +0, now: CPU 367, GPU 8204 (MiB)
[09/01/2023-07:53:15] [I] [TRT] [MemUsageChange] Init builder kernel library: CPU +1218, GPU +266, now: CPU 1665, GPU 8470 (MiB)
[09/01/2023-07:53:16] [W] [TRT] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of C UDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
[09/01/2023-07:53:16] [I] Start parsing network model.
[09/01/2023-07:53:16] [I] [TRT] ----------------------------------------------------------------
[09/01/2023-07:53:16] [I] [TRT] Input filename:   ./model/unet/model.onnx
[09/01/2023-07:53:16] [I] [TRT] ONNX IR version:  0.0.7
[09/01/2023-07:53:16] [I] [TRT] Opset version:    14
[09/01/2023-07:53:16] [I] [TRT] Producer name:    pytorch
[09/01/2023-07:53:16] [I] [TRT] Producer version: 1.12.1
[09/01/2023-07:53:16] [I] [TRT] Domain:
[09/01/2023-07:53:16] [I] [TRT] Model version:    0
[09/01/2023-07:53:16] [I] [TRT] Doc string:
[09/01/2023-07:53:16] [I] [TRT] ----------------------------------------------------------------
[09/01/2023-07:53:17] [W] [TRT] onnx2trt_utils.cpp:374: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
[09/01/2023-07:53:18] [W] [TRT] onnx2trt_utils.cpp:400: One or more weights outside the range of INT32 was clamped
[09/01/2023-07:53:22] [I] Finished parsing network model. Parse time: 6.5019
[09/01/2023-07:53:23] [W] [TRT] Detected layernorm nodes in FP16: Sub_297, Pow_299, ReduceMean_300, Add_302, Sqrt_303, Div_304, Mul_305, Add_306, Sub_485, Pow_487, ReduceMean_488, Add_490, Sqr t_491, Div_492, Mul_493, Add_494, Sub_673, Pow_675, ReduceMean_676, Add_678, Sqrt_679, Div_680, Mul_681, Add_682, Sub_792, Pow_794, ReduceMean_795, Add_797, Sqrt_798, Div_799, Mul_800, Add_801 , Sub_980, Pow_982, ReduceMean_983, Add_985, Sqrt_986, Div_987, Mul_988, Add_989, Sub_1168, Pow_1170, ReduceMean_1171, Add_1173, Sqrt_1174, Div_1175, Mul_1176, Add_1177, Sub_1289, Pow_1291, Re duceMean_1292, Add_1294, Sqrt_1295, Div_1296, Mul_1297, Add_1298, Sub_1477, Pow_1479, ReduceMean_1480, Add_1482, Sqrt_1483, Div_1484, Mul_1485, Add_1486, Sub_1665, Pow_1667, ReduceMean_1668, A dd_1670, Sqrt_1671, Div_1672, Mul_1673, Add_1674, Sub_1784, Pow_1786, ReduceMean_1787, Add_1789, Sqrt_1790, Div_1791, Mul_1792, Add_1793, Sub_1972, Pow_1974, ReduceMean_1975, Add_1977, Sqrt_19 78, Div_1979, Mul_1980, Add_1981, Sub_2160, Pow_2162, ReduceMean_2163, Add_2165, Sqrt_2166, Div_2167, Mul_2168, Add_2169, Sub_2281, Pow_2283, ReduceMean_2284, Add_2286, Sqrt_2287, Div_2288, Mu l_2289, Add_2290, Sub_2469, Pow_2471, ReduceMean_2472, Add_2474, Sqrt_2475, Div_2476, Mul_2477, Add_2478, Sub_2657, Pow_2659, ReduceMean_2660, Add_2662, Sqrt_2663, Div_2664, Mul_2665, Add_2666 , Sub_2776, Pow_2778, ReduceMean_2779, Add_2781, Sqrt_2782, Div_2783, Mul_2784, Add_2785, Sub_2964, Pow_2966, ReduceMean_2967, Add_2969, Sqrt_2970, Div_2971, Mul_2972, Add_2973, Sub_3152, Pow_ 3154, ReduceMean_3155, Add_3157, Sqrt_3158, Div_3159, Mul_3160, Add_3161, Sub_3342, Pow_3344, ReduceMean_3345, Add_3347, Sqrt_3348, Div_3349, Mul_3350, Add_3351, Sub_3530, Pow_3532, ReduceMean _3533, Add_3535, Sqrt_3536, Div_3537, Mul_3538, Add_3539, Sub_3718, Pow_3720, ReduceMean_3721, Add_3723, Sqrt_3724, Div_3725, Mul_3726, Add_3727, Sub_3987, Pow_3989, ReduceMean_3990, Add_3992,  Sqrt_3993, Div_3994, Mul_3995, Add_3996, Sub_4175, Pow_4177, ReduceMean_4178, Add_4180, Sqrt_4181, Div_4182, Mul_4183, Add_4184, Sub_4363, Pow_4365, ReduceMean_4366, Add_4368, Sqrt_4369, Div_ 4370, Mul_4371, Add_4372, Sub_4484, Pow_4486, ReduceMean_4487, Add_4489, Sqrt_4490, Div_4491, Mul_4492, Add_4493, Sub_4672, Pow_4674, ReduceMean_4675, Add_4677, Sqrt_4678, Div_4679, Mul_4680,  Add_4681, Sub_4860, Pow_4862, ReduceMean_4863, Add_4865, Sqrt_4866, Div_4867, Mul_4868, Add_4869, Sub_4981, Pow_4983, ReduceMean_4984, Add_4986, Sqrt_4987, Div_4988, Mul_4989, Add_4990, Sub_51 69, Pow_5171, ReduceMean_5172, Add_5174, Sqrt_5175, Div_5176, Mul_5177, Add_5178, Sub_5357, Pow_5359, ReduceMean_5360, Add_5362, Sqrt_5363, Div_5364, Mul_5365, Add_5366, Sub_5480, Pow_5482, Re duceMean_5483, Add_5485, Sqrt_5486, Div_5487, Mul_5488, Add_5489, Sub_5668, Pow_5670, ReduceMean_5671, Add_5673, Sqrt_5674, Div_5675, Mul_5676, Add_5677, Sub_5856, Pow_5858, ReduceMean_5859, A dd_5861, Sqrt_5862, Div_5863, Mul_5864, Add_5865, Sub_5977, Pow_5979, ReduceMean_5980, Add_5982, Sqrt_5983, Div_5984, Mul_5985, Add_5986, Sub_6165, Pow_6167, ReduceMean_6168, Add_6170, Sqrt_61 71, Div_6172, Mul_6173, Add_6174, Sub_6353, Pow_6355, ReduceMean_6356, Add_6358, Sqrt_6359, Div_6360, Mul_6361, Add_6362, Sub_6474, Pow_6476, ReduceMean_6477, Add_6479, Sqrt_6480, Div_6481, Mu l_6482, Add_6483, Sub_6662, Pow_6664, ReduceMean_6665, Add_6667, Sqrt_6668, Div_6669, Mul_6670, Add_6671, Sub_6850, Pow_6852, ReduceMean_6853, Add_6855, Sqrt_6856, Div_6857, Mul_6858, Add_6859 , Sub_6973, Pow_6975, ReduceMean_6976, Add_6978, Sqrt_6979, Div_6980, Mul_6981, Add_6982, Sub_7161, Pow_7163, ReduceMean_7164, Add_7166, Sqrt_7167, Div_7168, Mul_7169, Add_7170, Sub_7349, Pow_ 7351, ReduceMean_7352, Add_7354, Sqrt_7355, Div_7356, Mul_7357, Add_7358, Sub_7470, Pow_7472, ReduceMean_7473, Add_7475, Sqrt_7476, Div_7477, Mul_7478, Add_7479, Sub_7658, Pow_7660, ReduceMean _7661, Add_7663, Sqrt_7664, Div_7665, Mul_7666, Add_7667, Sub_7846, Pow_7848, ReduceMean_7849, Add_7851, Sqrt_7852, Div_7853, Mul_7854, Add_7855, Sub_7967, Pow_7969, ReduceMean_7970, Add_7972,  Sqrt_7973, Div_7974, Mul_7975, Add_7976, Sub_8155, Pow_8157, ReduceMean_8158, Add_8160, Sqrt_8161, Div_8162, Mul_8163, Add_8164, Sub_8343, Pow_8345, ReduceMean_8346, Add_8348, Sqrt_8349, Div_ 8350, Mul_8351, Add_8352
[09/01/2023-07:53:23] [W] [TRT] Running layernorm after self-attention in FP16 may cause overflow. Exporting the model to the latest available ONNX opset (later than opset 17) to use the INorm alizationLayer, or forcing layernorm layers to run in FP32 precision can help with preserving accuracy.
[09/01/2023-07:53:26] [I] [TRT] Graph optimization time: 3.52873 seconds.
[09/01/2023-07:53:26] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 6039, GPU 10198 (MiB)
[09/01/2023-07:53:26] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +10, now: CPU 6039, GPU 10208 (MiB)
[09/01/2023-07:53:26] [I] [TRT] Local timing cache in use. Profiling results in this builder pass will not be stored.
[09/01/2023-08:06:17] [I] [TRT] Detected 3 inputs and 1 output network tensors.
[09/01/2023-08:06:26] [I] [TRT] Total Host Persistent Memory: 617680
[09/01/2023-08:06:26] [I] [TRT] Total Device Persistent Memory: 34816
[09/01/2023-08:06:26] [I] [TRT] Total Scratch Memory: 755105792
[09/01/2023-08:06:26] [I] [TRT] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 1269 MiB, GPU 8223 MiB
[09/01/2023-08:06:26] [I] [TRT] [BlockAssignment] Started assigning block shifts. This will take 676 steps to complete.
[09/01/2023-08:06:27] [I] [TRT] [BlockAssignment] Algorithm ShiftNTopDown took 101.951ms to assign 18 blocks to 676 nodes requiring 1246422528 bytes.
[09/01/2023-08:06:27] [I] [TRT] Total Activation Memory: 1246422528
[09/01/2023-08:06:28] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 7621, GPU 10578 (MiB)
[09/01/2023-08:06:28] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 7621, GPU 10586 (MiB)
[09/01/2023-08:06:28] [W] [TRT] TensorRT encountered issues when converting weights between types and that could affect accuracy.
[09/01/2023-08:06:28] [W] [TRT] If this is not the desired behavior, please modify the weights or retrain with regularization to adjust the magnitude of the weights.
[09/01/2023-08:06:28] [W] [TRT] Check verbose logs for the list of affected weights.
[09/01/2023-08:06:28] [W] [TRT] - 274 weights are affected by this issue: Detected subnormal FP16 values.
[09/01/2023-08:06:28] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +49, GPU +1640, now: CPU 49, GPU 1640 (MiB)
[09/01/2023-08:06:32] [I] Engine built in 805.385 sec.
[09/01/2023-08:06:33] [I] [TRT] Loaded engine size: 1653 MiB
[09/01/2023-08:06:33] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 4918, GPU 9296 (MiB)
[09/01/2023-08:06:33] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +0, GPU +8, now: CPU 4918, GPU 9304 (MiB)
[09/01/2023-08:06:33] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +1639, now: CPU 0, GPU 1639 (MiB)
[09/01/2023-08:06:33] [I] Engine deserialized in 0.547253 sec.
[09/01/2023-08:06:33] [I] [TRT] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +8, now: CPU 4918, GPU 9296 (MiB)
[09/01/2023-08:06:34] [I] [TRT] [MemUsageChange] Init cuDNN: CPU +1, GPU +8, now: CPU 4919, GPU 9304 (MiB)
[09/01/2023-08:06:34] [I] [TRT] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +1189, now: CPU 0, GPU 2828 (MiB)
[09/01/2023-08:06:34] [W] [TRT] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage and speed up TensorRT initialization. See "Lazy Loading" section of C UDA documentation https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#lazy-loading
[09/01/2023-08:06:34] [I] Setting persistentCacheLimit to 0 bytes.
[09/01/2023-08:06:34] [I] Using random values for input sample
[09/01/2023-08:06:34] [I] Input binding for sample with dimensions 2x4x64x64 is created.
[09/01/2023-08:06:34] [I] Using random values for input timestep
[09/01/2023-08:06:34] [I] Input binding for timestep with dimensions 2 is created.
[09/01/2023-08:06:34] [I] Using random values for input encoder_hidden_states
[09/01/2023-08:06:35] [I] Input binding for encoder_hidden_states with dimensions 2x77x768 is created.
[09/01/2023-08:06:35] [I] Output binding for out_sample with dimensions 2x4x64x64 is created.
[09/01/2023-08:06:35] [I] Starting inference
-------------inference cost time: 37.263 ms-------------
-------------inference cost time: 5.035 ms-------------
-------------inference cost time: 9.320 ms-------------
-------------inference cost time: 13.871 ms-------------
-------------inference cost time: 13.954 ms-------------
-------------inference cost time: 13.891 ms-------------
-------------inference cost time: 14.017 ms-------------
-------------inference cost time: 13.988 ms-------------
-------------inference cost time: 14.289 ms-------------
-------------inference cost time: 13.852 ms-------------
-------------inference cost time: 13.947 ms-------------
-------------inference cost time: 14.047 ms-------------
-------------inference cost time: 18.893 ms-------------
-------------inference cost time: 51.331 ms-------------
-------------inference cost time: 7.285 ms-------------
-------------inference cost time: 41.668 ms-------------
-------------inference cost time: 6.699 ms-------------
-------------inference cost time: 30.149 ms-------------
-------------inference cost time: 11.842 ms-------------
-------------inference cost time: 10.373 ms-------------
-------------inference cost time: 5.012 ms-------------
-------------inference cost time: 14.222 ms-------------
-------------inference cost time: 13.920 ms-------------
-------------inference cost time: 14.165 ms-------------
-------------inference cost time: 14.115 ms-------------
-------------inference cost time: 13.933 ms-------------
-------------inference cost time: 14.080 ms-------------
-------------inference cost time: 13.900 ms-------------
-------------inference cost time: 15.078 ms-------------
-------------inference cost time: 23.884 ms-------------
-------------inference cost time: 6.487 ms-------------
-------------inference cost time: 56.820 ms-------------
-------------inference cost time: 7.842 ms-------------
-------------inference cost time: 7.494 ms-------------
-------------inference cost time: 22.214 ms-------------
-------------inference cost time: 6.465 ms-------------
-------------inference cost time: 19.936 ms-------------
-------------inference cost time: 5.641 ms-------------
-------------inference cost time: 13.959 ms-------------
-------------inference cost time: 14.069 ms-------------
-------------inference cost time: 14.118 ms-------------
-------------inference cost time: 14.181 ms-------------
-------------inference cost time: 14.089 ms-------------
-------------inference cost time: 14.051 ms-------------
-------------inference cost time: 14.026 ms-------------
-------------inference cost time: 13.965 ms-------------
-------------inference cost time: 14.068 ms-------------
-------------inference cost time: 14.708 ms-------------
-------------inference cost time: 28.187 ms-------------
-------------inference cost time: 6.637 ms-------------
-------------inference cost time: 45.277 ms-------------
-------------inference cost time: 10.676 ms-------------
-------------inference cost time: 4.920 ms-------------
-------------inference cost time: 32.805 ms-------------
-------------inference cost time: 5.432 ms-------------
-------------inference cost time: 8.640 ms-------------
-------------inference cost time: 14.267 ms-------------
-------------inference cost time: 14.294 ms-------------
-------------inference cost time: 13.851 ms-------------
-------------inference cost time: 13.955 ms-------------
-------------inference cost time: 14.202 ms-------------
-------------inference cost time: 14.130 ms-------------
-------------inference cost time: 14.117 ms-------------
-------------inference cost time: 14.040 ms-------------
-------------inference cost time: 14.023 ms-------------
-------------inference cost time: 59.882 ms-------------
-------------inference cost time: 13.916 ms-------------
-------------inference cost time: 6.175 ms-------------
-------------inference cost time: 71.233 ms-------------
-------------inference cost time: 10.232 ms-------------
-------------inference cost time: 4.239 ms-------------
-------------inference cost time: 13.336 ms-------------
-------------inference cost time: 20.645 ms-------------
-------------inference cost time: 7.671 ms-------------
-------------inference cost time: 13.916 ms-------------
-------------inference cost time: 14.030 ms-------------
-------------inference cost time: 14.114 ms-------------
-------------inference cost time: 14.050 ms-------------
-------------inference cost time: 14.185 ms-------------
-------------inference cost time: 14.321 ms-------------
-------------inference cost time: 69.480 ms-------------
-------------inference cost time: 4.701 ms-------------
-------------inference cost time: 92.745 ms-------------
-------------inference cost time: 4.869 ms-------------
-------------inference cost time: 9.172 ms-------------
-------------inference cost time: 14.244 ms-------------
-------------inference cost time: 13.979 ms-------------
-------------inference cost time: 14.042 ms-------------
-------------inference cost time: 13.908 ms-------------
-------------inference cost time: 14.076 ms-------------
-------------inference cost time: 14.104 ms-------------
-------------inference cost time: 14.051 ms-------------
-------------inference cost time: 14.048 ms-------------
-------------inference cost time: 14.068 ms-------------
-------------inference cost time: 14.204 ms-------------
-------------inference cost time: 49.923 ms-------------
-------------inference cost time: 10.582 ms-------------
-------------inference cost time: 5.773 ms-------------
-------------inference cost time: 11.457 ms-------------
-------------inference cost time: 55.813 ms-------------
-------------inference cost time: 4.793 ms-------------
-------------inference cost time: 9.329 ms-------------
-------------inference cost time: 14.231 ms-------------
-------------inference cost time: 14.076 ms-------------
-------------inference cost time: 14.104 ms-------------
-------------inference cost time: 14.262 ms-------------
-------------inference cost time: 14.127 ms-------------
-------------inference cost time: 14.109 ms-------------
-------------inference cost time: 14.192 ms-------------
-------------inference cost time: 14.104 ms-------------
-------------inference cost time: 20.511 ms-------------
-------------inference cost time: 68.824 ms-------------
-------------inference cost time: 5.216 ms-------------
-------------inference cost time: 52.319 ms-------------
-------------inference cost time: 4.815 ms-------------
-------------inference cost time: 9.319 ms-------------
-------------inference cost time: 14.108 ms-------------
-------------inference cost time: 14.069 ms-------------
-------------inference cost time: 14.208 ms-------------
-------------inference cost time: 14.114 ms-------------
-------------inference cost time: 14.049 ms-------------
-------------inference cost time: 14.122 ms-------------
-------------inference cost time: 14.082 ms-------------
-------------inference cost time: 14.372 ms-------------
-------------inference cost time: 14.042 ms-------------
-------------inference cost time: 67.934 ms-------------
-------------inference cost time: 4.772 ms-------------
-------------inference cost time: 29.267 ms-------------
-------------inference cost time: 4.797 ms-------------
-------------inference cost time: 20.915 ms-------------
-------------inference cost time: 10.234 ms-------------
-------------inference cost time: 5.390 ms-------------
-------------inference cost time: 14.339 ms-------------
-------------inference cost time: 14.166 ms-------------
-------------inference cost time: 14.084 ms-------------
-------------inference cost time: 13.919 ms-------------
-------------inference cost time: 14.244 ms-------------
-------------inference cost time: 14.050 ms-------------
-------------inference cost time: 13.991 ms-------------
-------------inference cost time: 14.108 ms-------------
-------------inference cost time: 14.077 ms-------------
-------------inference cost time: 32.097 ms-------------
-------------inference cost time: 10.298 ms-------------
-------------inference cost time: 4.440 ms-------------
-------------inference cost time: 60.966 ms-------------
-------------inference cost time: 4.719 ms-------------
-------------inference cost time: 9.405 ms-------------
-------------inference cost time: 14.150 ms-------------
-------------inference cost time: 14.014 ms-------------
-------------inference cost time: 14.087 ms-------------
-------------inference cost time: 14.220 ms-------------
-------------inference cost time: 14.238 ms-------------
-------------inference cost time: 14.109 ms-------------
-------------inference cost time: 14.098 ms-------------
-------------inference cost time: 13.981 ms-------------
-------------inference cost time: 15.080 ms-------------
-------------inference cost time: 68.014 ms-------------
-------------inference cost time: 4.402 ms-------------
-------------inference cost time: 10.553 ms-------------
-------------inference cost time: 40.676 ms-------------
-------------inference cost time: 4.708 ms-------------
-------------inference cost time: 9.449 ms-------------
-------------inference cost time: 14.137 ms-------------
-------------inference cost time: 14.079 ms-------------
-------------inference cost time: 14.114 ms-------------
-------------inference cost time: 13.969 ms-------------
-------------inference cost time: 14.163 ms-------------
-------------inference cost time: 14.113 ms-------------
-------------inference cost time: 13.941 ms-------------
[09/01/2023-08:06:38] [I] Warmup completed 13 queries over 200 ms
[09/01/2023-08:06:38] [I] Timing trace has 170 queries over 2.98332 s
[09/01/2023-08:06:38] [I]
[09/01/2023-08:06:38] [I] === Trace details ===
[09/01/2023-08:06:38] [I] Trace averages of 10 runs:
[09/01/2023-08:06:38] [I] Average on 10 runs - GPU latency: 14.1545 ms - Host latency: 14.1943 ms (enqueue 10.898 ms)
[09/01/2023-08:06:38] [I] Average on 10 runs - GPU latency: 13.9409 ms - Host latency: 13.9857 ms (enqueue 10.1216 ms)
[09/01/2023-08:06:38] [I] Average on 10 runs - GPU latency: 13.9804 ms - Host latency: 14.0214 ms (enqueue 12.2831 ms)
[09/01/2023-08:06:38] [I] Average on 10 runs - GPU latency: 13.9074 ms - Host latency: 13.9475 ms (enqueue 9.2467 ms)
[09/01/2023-08:06:38] [I] Average on 10 runs - GPU latency: 14.1037 ms - Host latency: 14.145 ms (enqueue 13.492 ms)
[09/01/2023-08:06:38] [I] Average on 10 runs - GPU latency: 13.93 ms - Host latency: 13.9709 ms (enqueue 9.43281 ms)
[09/01/2023-08:06:38] [I] Average on 10 runs - GPU latency: 14.2338 ms - Host latency: 14.2755 ms (enqueue 12.4118 ms)
[09/01/2023-08:06:38] [I] Average on 10 runs - GPU latency: 14.023 ms - Host latency: 14.0633 ms (enqueue 12.2721 ms)
[09/01/2023-08:06:38] [I] Average on 10 runs - GPU latency: 13.9879 ms - Host latency: 14.0284 ms (enqueue 10.7828 ms)
[09/01/2023-08:06:38] [I] Average on 10 runs - GPU latency: 14.0705 ms - Host latency: 14.1091 ms (enqueue 12.2525 ms)
[09/01/2023-08:06:38] [I] Average on 10 runs - GPU latency: 14.1374 ms - Host latency: 14.1792 ms (enqueue 12.3047 ms)
[09/01/2023-08:06:38] [I] Average on 10 runs - GPU latency: 14.0569 ms - Host latency: 14.0979 ms (enqueue 10.2635 ms)
[09/01/2023-08:06:38] [I] Average on 10 runs - GPU latency: 14.0456 ms - Host latency: 14.0863 ms (enqueue 11.2908 ms)
[09/01/2023-08:06:38] [I] Average on 10 runs - GPU latency: 14.0231 ms - Host latency: 14.0681 ms (enqueue 11.8683 ms)
[09/01/2023-08:06:38] [I] Average on 10 runs - GPU latency: 13.9529 ms - Host latency: 13.9922 ms (enqueue 10.8329 ms)
[09/01/2023-08:06:38] [I] Average on 10 runs - GPU latency: 14.1435 ms - Host latency: 14.1862 ms (enqueue 12.2632 ms)
[09/01/2023-08:06:38] [I] Average on 10 runs - GPU latency: 14.0698 ms - Host latency: 14.1088 ms (enqueue 12.1247 ms)
[09/01/2023-08:06:38] [I]
[09/01/2023-08:06:38] [I] === Performance summary ===
[09/01/2023-08:06:38] [I] Throughput: 56.9835 qps
[09/01/2023-08:06:38] [I] Latency: min = 13.2425 ms, max = 15.9032 ms, mean = 14.0859 ms, median = 14.1096 ms, percentile(90%) = 14.2161 ms, percentile(95%) = 14.2773 ms, percentile(99%) = 14. 8491 ms
[09/01/2023-08:06:38] [I] Enqueue Time: min = 4.19189 ms, max = 20.5472 ms, mean = 11.4201 ms, median = 13.4706 ms, percentile(90%) = 13.6362 ms, percentile(95%) = 13.7087 ms, percentile(99%)  = 14.073 ms
[09/01/2023-08:06:38] [I] H2D Latency: min = 0.0231934 ms, max = 0.065918 ms, mean = 0.0333104 ms, median = 0.0319214 ms, percentile(90%) = 0.0402832 ms, percentile(95%) = 0.0482788 ms, percen tile(99%) = 0.0585938 ms
[09/01/2023-08:06:38] [I] GPU Compute Time: min = 13.2025 ms, max = 15.8372 ms, mean = 14.0448 ms, median = 14.0706 ms, percentile(90%) = 14.1782 ms, percentile(95%) = 14.2358 ms, percentile(9 9%) = 14.8029 ms
[09/01/2023-08:06:38] [I] D2H Latency: min = 0.00561523 ms, max = 0.0107422 ms, mean = 0.00778037 ms, median = 0.00764465 ms, percentile(90%) = 0.00958252 ms, percentile(95%) = 0.00994873 ms,  percentile(99%) = 0.010498 ms
[09/01/2023-08:06:38] [I] Total Host Walltime: 2.98332 s
[09/01/2023-08:06:38] [I] Total GPU Compute Time: 2.38761 s
[09/01/2023-08:06:38] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[09/01/2023-08:06:38] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[09/01/2023-08:06:38] [W] * GPU compute time is unstable, with coefficient of variance = 1.70098%.
[09/01/2023-08:06:38] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[09/01/2023-08:06:38] [I] Explanations of the performance metrics are printed in the verbose logs.
[09/01/2023-08:06:38] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8601] # trtexec --onnx=./model/unet/model.onnx --saveEngine=./model/unet/model.trt --minShapes=sample:1x4x32x32,timestep:1,encoder_hidden_states:1x77x76 8 --optShapes=sample:2x4x64x64,timestep:2,encoder_hidden_states:2x77x768 --maxShapes=sample:4x4x128x128,timestep:4,encoder_hidden_states:4x77x768 --explicitBatch --fp16

the same dynamic shape using trtrexec, the engine will successfully be built. but it seems that the unet engine file built by trtexec will cause solid black image result. i'm still working on it, maybe it happens because my fault.

BowenFu commented 1 year ago

@zhangvia From the log a possible difference is Memory Pools | [WORKSPACE: 24217.31 MiB, TACTIC_DRAM: 24217.31 MiB] Please consider reducing the value in the script.

Also please make sure the python script is not holding other engines while building unet engine, so to make the GPU memory consumption similar to the trtexec case.

chengzeyi commented 10 months ago

@zhangvia Hi, friend！I know you are suffering great pain from using TRT with diffusers.

So why choose my totally open-sourced alternative: stable-fast? It's on par with TRT on inference speed, faster than torch.compile and AITemplate, and is super dynamic and flexible, supporting ALL SD models and LoRA and ControlNet out of the box!

NVIDIA / TensorRT

load unet.onnx failure of TensorRT 8.6 when running engine_from_network on GPU rtx4090 #3246

Description

Environment