Can't run network with INonZeroLayer on TensorRT 8.6.1.6 and GPU NVIDIA GeForce RTX 3060

Description

Create TensorRT network with INonZeroLayer from scratch. Save TensorRT engine.
Deserialize cuda engine and try to create execution context.
Execution context = nullptr.

Code:

#include <NvInfer.h>
using namespace nvinfer1;

#include "cuda_runtime_api.h"

#include <iostream>
#include <fstream>
#include <assert.h>

class Logger : public ILogger           
{
    void log(Severity severity, const char* msg) noexcept override
    {
        std::cout << msg << std::endl;
    }
} logger;

int main()
{

    {
        IBuilder* builder = createInferBuilder(logger);
        INetworkDefinition* network = builder->createNetworkV2(1U << static_cast<uint32_t>(NetworkDefinitionCreationFlag::kEXPLICIT_BATCH));

        Dims dims;
        dims.nbDims = 1;
        dims.d[0] = 32;
        ITensor& input = *network->addInput("input", DataType::kINT32, dims);

        auto nzLayer = network->addNonZero(input);

        ITensor& output = *nzLayer->getOutput(0);
        output.setName("output");
        network->markOutput(output);

        IBuilderConfig* config = builder->createBuilderConfig();
        config->setMaxWorkspaceSize(1 << 20);

        ICudaEngine* engine = builder->buildEngineWithConfig(*network, *config);

        std::cout << "Serializing model" << std::endl;
        IHostMemory *serializedModel = engine->serialize();
        std::cout << "Model serialized" << std::endl;

        std::ofstream p("../dynamic.engine", std::ios::binary);
        if (!p) 
        {
            std::cerr << "could not create engine file" << std::endl;
            return -1;
        }

        p.write(reinterpret_cast<const char*>(serializedModel->data()), serializedModel->size());
        std::cout << "Engine file written" << std::endl;

        delete network;
        delete config;
        delete builder;
        delete serializedModel;
        delete engine;
    }

    std::ifstream file("../dynamic.engine", std::ios::binary);
    if (!file.good())
    {
        std::cout << "Engine is not good" << std::endl;
    }

    IRuntime* runtime = createInferRuntime(logger);
    assert(runtime);

    file.seekg(0, file.end);
    auto size = file.tellg();
    file.seekg(0, file.beg);
    auto trt_model_stream = new char[size];
    file.read(trt_model_stream, size);
    file.close();

    ICudaEngine* engine = runtime->deserializeCudaEngine(trt_model_stream, size);
    assert(engine);
    delete[] trt_model_stream;

    IExecutionContext* context = engine->createExecutionContext();
    assert(context);

    return 0;
}

Trace:

[MemUsageChange] Init CUDA: CPU +14, GPU +0, now: CPU 15, GPU 641 (MiB)
Trying to load shared library libnvinfer_builder_resource.so.8.6.1
Loaded shared library libnvinfer_builder_resource.so.8.6.1
[MemUsageChange] Init builder kernel library: CPU +1449, GPU +252, now: CPU 1541, GPU 879 (MiB)
CUDA lazy loading is enabled.
Original: 2 layers
After dead-layer removal: 2 layers
Graph construction completed in 0.000297353 seconds.
After Myelin optimization: 2 layers
Applying ScaleNodes fusions.
After scale fusion: 2 layers
After dupe layer removal: 2 layers
After final dead-layer removal: 2 layers
After tensor merging: 2 layers
After vertical fusions: 2 layers
After dupe layer removal: 2 layers
After final dead-layer removal: 2 layers
After tensor merging: 2 layers
After slice removal: 2 layers
After concat removal: 2 layers
Trying to split Reshape and strided tensor
Graph optimization time: 0.000145561 seconds.
Building graph using backend strategy 2
Local timing cache in use. Profiling results in this builder pass will not be stored.
Constructing optimization profile number 0 [1/1].
Applying generic optimizations to the graph for inference.
Reserving memory for host IO tensors. Host: 0 bytes
=============== Computing costs for (Unnamed Layer* 0) [NonZero]
*************** Autotuning format combination: Int32(1) -> Int32((# 0 (VALUE (Unnamed Layer* 0) [NonZero][size])),1), Int32() ***************
--------------- Timing Runner: (Unnamed Layer* 0) [NonZero] (NonZero[0x80000033])
Tactic: 0x0000000000000000 Time: 0.0134912
(Unnamed Layer* 0) [NonZero] (NonZero[0x80000033]) profiling completed in 0.00521287 seconds. Fastest Tactic: 0x0000000000000000 Time: 0.0134912
>>>>>>>>>>>>>>> Chose Runner Type: NonZero Tactic: 0x0000000000000000
=============== Computing costs for (Unnamed Layer* 0) [NonZero][size][DevicetoShapeHostCopy]
*************** Autotuning format combination: Int32() ->  ***************
=============== Computing reformatting costs
=============== Computing reformatting costs
=============== Computing reformatting costs
=============== Computing reformatting costs
Formats and tactics selection completed in 0.00554525 seconds.
After reformat layers: 2 layers
Total number of blocks in pre-optimized block assignment: 2
Detected 1 inputs and 1 output network tensors.
Layer: (Unnamed Layer* 0) [NonZero] Host Persistent: 0 Device Persistent: 0 Scratch Memory: 771
Skipped printing memory information for 1 layers with 0 memory size i.e. Host Persistent + Device Persistent + Scratch Memory == 0.
Total Host Persistent Memory: 0
Total Device Persistent Memory: 0
Total Scratch Memory: 771
[MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 0 MiB, GPU 4 MiB
[BlockAssignment] Started assigning block shifts. This will take 2 steps to complete.
[BlockAssignment] Algorithm ShiftNTopDown took 0.007542ms to assign 2 blocks to 2 nodes requiring 1536 bytes.
Total number of blocks in optimized block assignment: 2
Total Activation Memory: 1536
Total number of generated kernels selected for the engine: 0
Disabling unused tactic source: EDGE_MASK_CONVOLUTIONS
Disabling unused tactic source: JIT_CONVOLUTIONS
Engine generation completed in 0.00749379 seconds.
Engine Layer Information:
Layer(NonZero): (Unnamed Layer* 0) [NonZero], Tactic: 0x0000000000000000, input (Int32[32]) -> output (Int32[1,-1]), (Unnamed Layer* 0) [NonZero][size] (Int32[])
Layer(DeviceToShapeHost): (Unnamed Layer* 0) [NonZero][size][DevicetoShapeHostCopy], Tactic: 0x0000000000000000, (Unnamed Layer* 0) [NonZero][size] (Int32[]) -> 
[MemUsageChange] TensorRT-managed allocation in building engine: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
Serializing model
Adding 1 engine(s) to plan file.
Model serialized
Engine file written
Loaded engine size: 0 MiB
Deserialization required 300 microseconds.
[MemUsageChange] TensorRT-managed allocation in engine deserialization: CPU +0, GPU +0, now: CPU 0, GPU 0 (MiB)
Total per-runner device persistent memory is 0
Total per-runner host persistent memory is 0
Allocated activation device memory of size 1536
1: Unexpected exception vector<bool>::_M_range_check: __n (which is 0) >= this->size() (which is 0)
spconv_deploy: /home/arseniy.marin@nami.local/Projects/spconv_deploy/spconv_deploy.cpp:87: int main(): Assertion `context' failed.
Aborted (core dumped)

Environment

TensorRT 8.6.1.6

NVIDIA GeForce RTX 3060

CUDA 11.1

CUDNN 8.9.0.131

Ubuntu 20.04

NVIDIA / TensorRT

Can't run network with INonZeroLayer on TensorRT 8.6.1.6 and GPU NVIDIA GeForce RTX 3060 #3550

Description

Environment