NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.51k stars 2.1k forks source link

Can't run network with INonZeroLayer on TensorRT 8.6.1.6 and GPU NVIDIA GeForce RTX 3060 #3550

Closed ArseniuML closed 1 month ago

ArseniuML commented 8 months ago

Description

  1. Create TensorRT network with INonZeroLayer from scratch. Save TensorRT engine.
  2. Deserialize cuda engine and try to create execution context.
  3. Execution context = nullptr.

Code:

#include <NvInfer.h>
using namespace nvinfer1;

#include "cuda_runtime_api.h"

#include <iostream>
#include <fstream>
#include <assert.h>

class Logger : public ILogger           
{
    void log(Severity severity, const char* msg) noexcept override
    {
        std::cout << msg << std::endl;
    }
} logger;

int main()
{

    {
        IBuilder* builder = createInferBuilder(logger);
        INetworkDefinition* network = builder->createNetworkV2(1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH));

        Dims dims;
        dims.nbDims = 1;
        dims.d[0] = 32;
        ITensor& input = *network->addInput("input", DataType::kINT32, dims);

        auto nzLayer = network->addNonZero(input);

        ITensor& output = *nzLayer->getOutput(0);
        output.setName("output");
        network->markOutput(output);

        IBuilderConfig* config = builder->createBuilderConfig();
        config->setMaxWorkspaceSize(1 << 20);

        ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);

        std::cout << "Serializing model" << std::endl;
        IHostMemory *serializedModel = engine->serialize();
        std::cout << "Model serialized" << std::endl;

        std::ofstream p("../dynamic.engine", std::ios::binary);
        if (!p) 
        {
            std::cerr << "could not create engine file" << std::endl;
            return -1;
        }

        p.write(reinterpret_cast<const char*>(serializedModel->data()), serializedModel->size());
        std::cout << "Engine file written" << std::endl;

        delete network;
        delete config;
        delete builder;
        delete serializedModel;
        delete engine;
    }

    std::ifstream file("../dynamic.engine", std::ios::binary);
    if (!file.good())
    {
        std::cout << "Engine is not good" << std::endl;
    }

    IRuntime* runtime = createInferRuntime(logger);
    assert(runtime);

    file.seekg(0, file.end);
    auto size = file.tellg();
    file.seekg(0, file.beg);
    auto trt_model_stream = new char[size];
    file.read(trt_model_stream, size);
    file.close();

    ICudaEngine* engine = runtime->deserializeCudaEngine(trt_model_stream, size);
    assert(engine);
    delete[] trt_model_stream;

    IExecutionContext* context = engine->createExecutionContext();
    assert(context);

    return 0;
}

Trace:

[MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 15, GPU 641 (MiB)
Trying to load shared library libnvinfer_builder_resource.so.8.6.1
Loaded shared library libnvinfer_builder_resource.so.8.6.1
[MemUsageChange] Init builder kernel library: CPU +1449, GPU +252, now: CPU 1541, GPU 879 (MiB)
CUDA lazy loading is enabled.
Original: 2 layers
After dead-layer removal: 2 layers
Graph construction completed in 0.000297353 seconds.
After Myelin optimization: 2 layers
Applying ScaleNodes fusions.
After scale fusion: 2 layers
After dupe layer removal: 2 layers
After final dead-layer removal: 2 layers
After tensor merging: 2 layers
After vertical fusions: 2 layers
After dupe layer removal: 2 layers
After final dead-layer removal: 2 layers
After tensor merging: 2 layers
After slice removal: 2 layers
After concat removal: 2 layers
Trying to split Reshape and strided tensor
Graph optimization time: 0.000145561 seconds.
Building graph using backend strategy 2
Local timing cache in use. Profiling results in this builder pass will not be stored.
Constructing optimization profile number 0 [1/1].
Applying generic optimizations to the graph for inference.
Reserving memory for host IO tensors. Host: 0 bytes
=============== Computing costs for (Unnamed Layer* 0) [NonZero]
*************** Autotuning format combination: Int32(1) -> Int32((# 0 (VALUE (Unnamed Layer* 0) [NonZero][size])),1), Int32() ***************
--------------- Timing Runner: (Unnamed Layer* 0) [NonZero] (NonZero[0x80000033])
Tactic: 0x0000000000000000 Time: 0.0134912
(Unnamed Layer* 0) [NonZero] (NonZero[0x80000033]) profiling completed in 0.00521287 seconds. Fastest Tactic: 0x0000000000000000 Time: 0.0134912
>>>>>>>>>>>>>>> Chose Runner Type: NonZero Tactic: 0x0000000000000000
=============== Computing costs for (Unnamed Layer* 0) [NonZero][size][DevicetoShapeHostCopy]
*************** Autotuning format combination: Int32() ->  ***************
=============== Computing reformatting costs
=============== Computing reformatting costs
=============== Computing reformatting costs
=============== Computing reformatting costs
Formats and tactics selection completed in 0.00554525 seconds.
After reformat layers: 2 layers
Total number of blocks in pre-optimized block assignment: 2
Detected 1 inputs and 1 output network tensors.
Layer: (Unnamed Layer* 0) [NonZero] Host Persistent: 0 Device Persistent: 0 Scratch Memory: 771
Skipped printing memory information for 1 layers with 0 memory size i.e. Host Persistent + Device Persistent + Scratch Memory == 0.
Total Host Persistent Memory: 0
Total Device Persistent Memory: 0
Total Scratch Memory: 771
[MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 4 MiB
[BlockAssignment] Started assigning block shifts. This will take 2 steps to complete.
[BlockAssignment] Algorithm ShiftNTopDown took 0.007542ms to assign 2 blocks to 2 nodes requiring 1536 bytes.
Total number of blocks in optimized block assignment: 2
Total Activation Memory: 1536
Total number of generated kernels selected for the engine: 0
Disabling unused tactic source: EDGE_MASK_CONVOLUTIONS
Disabling unused tactic source: JIT_CONVOLUTIONS
Engine generation completed in 0.00749379 seconds.
Engine Layer Information:
Layer(NonZero): (Unnamed Layer* 0) [NonZero], Tactic: 0x0000000000000000, input (Int32[32]) -> output (Int32[1,-1]), (Unnamed Layer* 0) [NonZero][size] (Int32[])
Layer(DeviceToShapeHost): (Unnamed Layer* 0) [NonZero][size][DevicetoShapeHostCopy], Tactic: 0x0000000000000000, (Unnamed Layer* 0) [NonZero][size] (Int32[]) -> 
[MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
Serializing model
Adding 1 engine(s) to plan file.
Model serialized
Engine file written
Loaded engine size: 0 MiB
Deserialization required 300 microseconds.
[MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
Total per-runner device persistent memory is 0
Total per-runner host persistent memory is 0
Allocated activation device memory of size 1536
1: Unexpected exception vector<bool>::_M_range_check: __n (which is 0) >= this->size() (which is 0)
spconv_deploy: /home/arseniy.marin@nami.local/Projects/spconv_deploy/spconv_deploy.cpp:87: int main(): Assertion `context' failed.
Aborted (core dumped)

Environment

TensorRT 8.6.1.6

NVIDIA GeForce RTX 3060

CUDA 11.1

CUDNN 8.9.0.131

Ubuntu 20.04

zerollzeng commented 8 months ago

Checking

ttyio commented 1 month ago

closing this is duplicate to https://github.com/NVIDIA/TensorRT/issues/3335, thanks!