NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.7k stars 2.12k forks source link

DCHECK(kind_ == value_kind_symbolic_) failed #4177

Open tsaizhenling opened 2 weeks ago

tsaizhenling commented 2 weeks ago

Description

onnx to trt conversion fails for model with dynamic batch

Environment

TensorRT Version: 8.5.2.2

NVIDIA GPU: xavier nx

CUDA Version:12.2

Operating System: ubuntu 20.04

Python Version (if applicable): 3.12.4

Relevant Files

Model link: https://drive.google.com/file/d/13l9CUXUJOiHfth-ryRFtxuq7vlpm1Kur/view?usp=sharing

Steps To Reproduce

import numpy as np                                                                                                                                                                                                                   
from polygraphy.backend.trt import (                                                                                                                                                                                                 
    CreateConfig,                                                                                                                                                                                                                    
    EngineFromNetwork,                                                                                                                                                                                                               
    NetworkFromOnnxPath,                                                                                                                                                                                                             
    SaveEngine,                                                                                                                                                                                                                      
    TrtRunner,                                                                                                                                                                                                                       
    Profile                                                                                                                                                                                                                          
)   
onnx_model = "parseq_recognizer_fix_dynamicbatch.onnx"

profiles = [
        Profile().add("input", min=(1, 3, 32, 128), opt=(1, 3, 32, 128), max=(1, 3, 32, 128)),
        Profile().add("input", min=(1, 3, 32, 128), opt=(4, 3, 32, 128), max=(10, 3, 32, 128)),
        Profile().add(
            "input", min=(10, 3, 32, 128), opt=(10, 3, 32, 128), max=(10, 3, 32, 128)
        ),
    ]

def main():

    inp_data = np.ones(shape=(1, 3, 32, 128), dtype=np.float32)
    rsess = InferenceSession(onnx_model, 
        providers=["CUDAExecutionProvider"]) 
    pred = rsess.run(None, {"input": inp_data})
    print(pred)

    build_engine = EngineFromNetwork(
        NetworkFromOnnxPath(onnx_model), config=CreateConfig(fp16=False, profiles=profiles)
    )
    build_engine = SaveEngine(build_engine, path="parseq_test.engine")
    with TrtRunner(build_engine) as runner:
        outputs = runner.infer(feed_dict={"input": inp_data})
        print(outputs)

if __name__ == "__main__":
    main()

error:

[W] onnx2trt_utils.cpp:375: Your ONNX model has been generated with INT64 weights, while TensorRT does not natively support INT64. Attempting to cast down to INT32.                                                                 
[W] onnx2trt_utils.cpp:403: One or more weights outside the range of INT32 was clamped                                                                                                                                               
[W] Tensor DataType is determined at build time for tensors not marked as input or output.                                                                                                                                           
[I] Configuring with profiles:[                                                                                                                                                                                                      
        Profile 0:                                                                                                                                                                                                                   
            {input [min=(1, 3, 32, 128), opt=(1, 3, 32, 128), max=(1, 3, 32, 128)]},                                                                                                                                                 
        Profile 1:                                                                                                                                                                                                                   
            {input [min=(1, 3, 32, 128), opt=(4, 3, 32, 128), max=(10, 3, 32, 128)]},                                                                                                                                                
        Profile 2:                                                                                                                                                                                                                   
            {input [min=(10, 3, 32, 128), opt=(10, 3, 32, 128), max=(10, 3, 32, 128)]}                                                                                                                                               
    ]                                                                                                                                                                                                                                
[I] Building engine with configuration:                                                                                                                                                                                              
    Flags                  | []                                                                                                                                                                                                      
    Engine Capability      | EngineCapability.DEFAULT                                                                                                                                                                                
    Memory Pools           | [WORKSPACE: 6854.28 MiB]                                                                                                                                                                                
    Tactic Sources         | [CUBLAS, CUBLAS_LT, CUDNN, EDGE_MASK_CONVOLUTIONS, JIT_CONVOLUTIONS]                                                                                                                                    
    Profiling Verbosity    | ProfilingVerbosity.DETAILED                                                                                                                                                                             
    Optimization Profiles  | 3 profile(s)                                                                                                                                                                                            
[W] DLA requests all profiles have same min, max, and opt value. All dla layers are falling back to GPU                                                                                                                              
[W] Using PreviewFeature::kFASTER_DYNAMIC_SHAPES_0805 can help improve performance and resolve potential functional issues.                                                                                                          
value.h:413: DCHECK(kind_ == value_kind_symbolic_) failed.                                                                                                                                                                           
Aborted (core dumped)  
lix19937 commented 2 weeks ago

It should be a bug, try to use TensorRT v8.6 . @tsaizhenling

lix19937 commented 2 weeks ago

On my machine, it build info:


[10/03/2024-10:47:59] [I] === Performance summary ===
[10/03/2024-10:47:59] [I] Throughput: 130.298 qps
[10/03/2024-10:47:59] [I] Latency: min = 5.38672 ms, max = 13.3664 ms, mean = 7.60628 ms, median = 7.16092 ms, percentile(90%) = 10.2996 ms, percentile(95%) = 10.8698 ms, percentile(99%) = 12.9675 ms
[10/03/2024-10:47:59] [I] Enqueue Time: min = 2.72876 ms, max = 15.0201 ms, mean = 7.63062 ms, median = 7.86694 ms, percentile(90%) = 10.1792 ms, percentile(95%) = 11.7217 ms, percentile(99%) = 14.1848 ms
[10/03/2024-10:47:59] [I] H2D Latency: min = 0.0065918 ms, max = 0.0698242 ms, mean = 0.0122212 ms, median = 0.00814819 ms, percentile(90%) = 0.0253906 ms, percentile(95%) = 0.0270996 ms, percentile(99%) = 0.0515137 ms
[10/03/2024-10:47:59] [I] GPU Compute Time: min = 5.34595 ms, max = 13.3388 ms, mean = 7.58404 ms, median = 7.14139 ms, percentile(90%) = 10.2881 ms, percentile(95%) = 10.8175 ms, percentile(99%) = 12.9485 ms
[10/03/2024-10:47:59] [I] D2H Latency: min = 0.00390625 ms, max = 0.110596 ms, mean = 0.0100141 ms, median = 0.00439453 ms, percentile(90%) = 0.0254517 ms, percentile(95%) = 0.0319824 ms, percentile(99%) = 0.0994873 ms
[10/03/2024-10:47:59] [I] Total Host Walltime: 3.00082 s
[10/03/2024-10:47:59] [I] Total GPU Compute Time: 2.96536 s
[10/03/2024-10:47:59] [W] * Throughput may be bound by Enqueue Time rather than GPU Compute and the GPU may be under-utilized.
[10/03/2024-10:47:59] [W]   If not already in use, --useCudaGraph (utilize CUDA graphs where possible) may increase the throughput.
[10/03/2024-10:47:59] [W] * GPU compute time is unstable, with coefficient of variance = 26.7909%.
[10/03/2024-10:47:59] [W]   If not already in use, locking GPU clock frequency or adding --useSpinWait may improve the stability.
[10/03/2024-10:47:59] [I] Explanations of the performance metrics are printed in the verbose logs.
[10/03/2024-10:47:59] [V]
[10/03/2024-10:47:59] [V] === Explanations of the performance metrics ===
[10/03/2024-10:47:59] [V] Total Host Walltime: the host walltime from when the first query (after warmups) is enqueued to when the last query is completed.
[10/03/2024-10:47:59] [V] GPU Compute Time: the GPU latency to execute the kernels for a query.
[10/03/2024-10:47:59] [V] Total GPU Compute Time: the summation of the GPU Compute Time of all the queries. If this is significantly shorter than Total Host Walltime, the GPU may be under-utilized because of host-side overheads or data transfers.
[10/03/2024-10:47:59] [V] Throughput: the observed throughput computed by dividing the number of queries by the Total Host Walltime. If this is significantly lower than the reciprocal of GPU Compute Time, the GPU may be under-utilized because of host-side overheads or data transfers.
[10/03/2024-10:47:59] [V] Enqueue Time: the host latency to enqueue a query. If this is longer than GPU Compute Time, the GPU may be under-utilized.
[10/03/2024-10:47:59] [V] H2D Latency: the latency for host-to-device data transfers for input tensors of a single query.
[10/03/2024-10:47:59] [V] D2H Latency: the latency for device-to-host data transfers for output tensors of a single query.
[10/03/2024-10:47:59] [V] Latency: the summation of H2D Latency, GPU Compute Time, and D2H Latency. This is the latency to infer a single query.
[10/03/2024-10:47:59] [I]
&&&& PASSED TensorRT.trtexec [TensorRT v8601] # trtexec --verbose --onnx=./parseq_recognizer_fix_dynamicbatch.onnx