NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.54k stars 2.1k forks source link

Could not find any implementation for node #2791

Closed pjh4993 closed 1 year ago

pjh4993 commented 1 year ago

Description

I'm currently doing some experiment about how TensorRT are fusing layer when different option is given to ONNX exporter of Pytorch. During tests, I've got some internal error as follows.

Internal Error (Could not find any implementation for node Conv_14 + PWN(PWN(Sigmoid_15), Mul_16).)
[03/21/2023-07:58:45] [E] Error[2]: [builder.cpp::buildSerializedNetwork::751] Error Code 2: Internal Error (Assertion engine != nullptr failed. )

As you can see in under attached codes, it is simple Residual convolution blocks with SILU activation and no other foreign nodes are used.

Something interesting is when using small batch size (1 ~ 4) trtexec doesn't failed. However, over 16 batch shows error.

Environment

TensorRT Version: 8.5.3.1 NVIDIA GPU: RTX-3090 NVIDIA Driver Version: 470.161.03 CUDA Version: 11.4 CUDNN Version: 8 Operating System: Ubuntu 20.04 Python Version (if applicable): 3.8.10 Tensorflow Version (if applicable): - PyTorch Version (if applicable): 1.12.1+cu113 Baremetal or Container (if so, version): -

Relevant Files

Codes

Steps To Reproduce

./run.sh in zipped attachment

zerollzeng commented 1 year ago
$ bash run.sh
Building the engine:
trtexec --verbose --nvtxMode=verbose --buildOnly --workspace=16384 --onnx=test.onnx --saveEngine=output/test.onnx.engine --timingCacheFile=./timing.cache --int8

Successfully built the engine.

Engine building metadata: generated output file output/test.onnx.engine.build.metadata.json
Profiling the engine:
trtexec --verbose --noDataTransfers --useCudaGraph --separateProfileRun --useSpinWait --nvtxMode=verbose --loadEngine=output/test.onnx.engine --exportTimes=output/test.onnx.engine.timing.json --exportProfile=output/test.onnx.engine.profile.json --exportLayerInfo=output/test.onnx.engine.graph.json --timingCacheFile=./timing.cache --int8
WARNING:root:Could not lock clocks (Insufficient Permissions).
        Try running as root or locking the clocks from the commandline:
                sudo nvidia-smi --lock-gpu-clocks=1620,1620
                sudo nvidia-smi --applications-clocks=6501,1620
WARNING:root:Could not unlock clocks (Insufficient Permissions).
        Try running as root or unlocking the clocks from the commandline:
                sudo nvidia-smi --reset-gpu-clocks
                sudo nvidia-smi --reset-applications-clocks

Successfully profiled the engine.

Profiling metadata: generated output file output/test.onnx.engine.profile.metadata.json
Can't generate plan SVG graph because some package is not installed
Artifcats directory: output

Didn't reproduce the issue on my side, TRT 8.6, RTX 8000.

zerollzeng commented 1 year ago

Can you reproduce the issue with trtexec --verbose --nvtxMode=verbose --buildOnly --workspace=16384 --onnx=test.onnx --saveEngine=output/test.onnx.engine --timingCacheFile=./timing.cache --int8? this can simplify the reproduce, Thanks!

pjh4993 commented 1 year ago

I'm sorry for dirty codes. Currently Batch size of optimized engine is determined by "test.py:60"

# Test.py code
59: model = TestC3(3, 64).cuda()
60: im = torch.randn(4, 3, 64, 64).cuda() # You need to change here to test with different batch size
61: model(im)
62: file = Path("test.onnx")
63: file, onnx_model = export_onnx(model, im, file=file, opset=12, simplify=False, dynamic=False)

I'll attach url for output files of trtexec with batch size 4 & 32 Each directory contains logs generated from trtexec named as "test.onnx.engine.build.log" Log files

I'm still reproducing same error in my environment right now.

ttyio commented 1 year ago

@zerollzeng do we have internal issue to track this? thanks!

zerollzeng commented 1 year ago

No since I cannot reproduce on my side, @pjh4993 can you reproduce with trtexec?

ttyio commented 1 year ago

closing since no activity for more than 3 weeks, thanks!

sachinlodhi commented 8 months ago

I am trying to export the .pt to the .engine using the following command:

yolo export model=weights2.pt format=engine half=True device=0 workspace=12

this is working fine on google colab and evrything else is good. but when I am using the same on the local machine I am getting the following error:

Ultralytics YOLOv8.1.1 🚀 Python-3.10.12 torch-2.1.2+cu121 CUDA:0 (NVIDIA GeForce MX130, 2003MiB)
Model summary (fused): 168 layers, 3006038 parameters, 0 gradients, 8.1 GFLOPs

PyTorch: starting from 'weights2.pt' with input shape (1, 3, 640, 640) BCHW and output shape(s) (1, 6, 8400) (6.0 MB)

ONNX: starting export with onnx 1.15.0 opset 17...
ONNX: export success ✅ 0.8s, saved as 'weights2.onnx' (5.9 MB)

TensorRT: starting export with TensorRT 8.6.1...
[01/14/2024-15:42:26] [TRT] [I] [MemUsageChange] Init CUDA: CPU +2, GPU +0, now: CPU 319, GPU 197 (MiB)
[01/14/2024-15:42:39] [TRT] [I] [MemUsageChange] Init builder kernel library: CPU +128, GPU +0, now: CPU 523, GPU 197 (MiB)
[01/14/2024-15:42:39] [TRT] [I] ----------------------------------------------------------------
[01/14/2024-15:42:39] [TRT] [I] Input filename:   weights2.onnx
[01/14/2024-15:42:39] [TRT] [I] ONNX IR version:  0.0.8
[01/14/2024-15:42:39] [TRT] [I] Opset version:    17
[01/14/2024-15:42:39] [TRT] [I] Producer name:    pytorch
[01/14/2024-15:42:39] [TRT] [I] Producer version: 2.1.2
[01/14/2024-15:42:39] [TRT] [I] Domain:           
[01/14/2024-15:42:39] [TRT] [I] Model version:    0
[01/14/2024-15:42:39] [TRT] [I] Doc string:       
[01/14/2024-15:42:39] [TRT] [I] ----------------------------------------------------------------
[01/14/2024-15:42:39] [TRT] [W] onnx2trt_utils.cpp:374: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.
TensorRT: input "images" with shape(1, 3, 640, 640) DataType.HALF
TensorRT: output "output0" with shape(1, 6, 8400) DataType.HALF
TensorRT: building FP16 engine as weights2.engine
[01/14/2024-15:42:39] [TRT] [I] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32.
[01/14/2024-15:42:39] [TRT] [I] Graph optimization time: 0.0451716 seconds.
[01/14/2024-15:42:39] [TRT] [I] BuilderFlag::kTF32 is set but hardware does not support TF32. Disabling TF32.
[01/14/2024-15:42:39] [TRT] [I] Local timing cache in use. Profiling results in this builder pass will not be stored.
[01/14/2024-15:42:39] [TRT] [E] 10: Could not find any implementation for node /model.0/conv/Conv.
[01/14/2024-15:42:40] [TRT] [E] 10: [optimizer.cpp::computeCosts::3869] Error Code 10: Internal Error (Could not find any implementation for node /model.0/conv/Conv.)
TensorRT: export failure ❌ 14.2s: __enter__
Traceback (most recent call last):
  File "/home/sachin/anaconda3/envs/rover/bin/yolo", line 8, in <module>
    sys.exit(entrypoint())
  File "/home/sachin/anaconda3/envs/rover/lib/python3.10/site-packages/ultralytics/cfg/__init__.py", line 567, in entrypoint
    getattr(model, mode)(**overrides)  # default args from model
  File "/home/sachin/anaconda3/envs/rover/lib/python3.10/site-packages/ultralytics/engine/model.py", line 347, in export
    return Exporter(overrides=args, _callbacks=self.callbacks)(model=self.model)
  File "/home/sachin/anaconda3/envs/rover/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/sachin/anaconda3/envs/rover/lib/python3.10/site-packages/ultralytics/engine/exporter.py", line 275, in __call__
    f[1], _ = self.export_engine()
  File "/home/sachin/anaconda3/envs/rover/lib/python3.10/site-packages/ultralytics/engine/exporter.py", line 136, in outer_func
    raise e
  File "/home/sachin/anaconda3/envs/rover/lib/python3.10/site-packages/ultralytics/engine/exporter.py", line 131, in outer_func
    f, model = inner_func(*args, **kwargs)
  File "/home/sachin/anaconda3/envs/rover/lib/python3.10/site-packages/ultralytics/engine/exporter.py", line 686, in export_engine
    with builder.build_engine(network, config) as engine, open(f, "wb") as t:
AttributeError: __enter__

My system GPU info is: CUDA:


nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0

GPU:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.146.02             Driver Version: 535.146.02   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce MX130           Off | 00000000:01:00.0 Off |                  N/A |
| N/A   45C    P8              N/A / 200W |      4MiB /  2048MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2015      G   /usr/lib/xorg/Xorg                            2MiB |
+---------------------------------------------------------------------------------------+

RUnning following code:

from numba import cuda
cuda.detect()

gives following output:

Found 1 CUDA devices
id 0    b'NVIDIA GeForce MX130'                              [SUPPORTED]
                      Compute Capability: 5.0
                           PCI Device ID: 0
                              PCI Bus ID: 1
                                    UUID: GPU-8712d22e-d80a-b062-6994-3a5a16e961ab
                                Watchdog: Enabled
             FP32/FP64 Performance Ratio: 32
Summary:
        1/1 devices are supported

ANy lead or help is appreciated.