C++ TensorRT execution - Githubissues

Hi, the issue is probably out of scope of the repo but I've been struggling to solve the issue I've been getting for a whole day now, thought you might give me a hint whether I do something wrong pre & post processing model data

What I've got in a nutshell: C++, TensorRT, CVCUDA Model: 9 slots, 24 char alphabet Model input: 1, 70, 140, 1 Model output: 1, 216

YOLO8 does license plate detection, then I crop license plate and pass to fast-plate-ocr

// license plate bbox
short4 bb = boxes[i];

// crop license plate from full frame
nvcv::Tensor cropTensor(1, { bb.z, bb.w }, nvcv::FMT_RGB8);
cvcuda::CustomCrop crop;
crop(stream, frame->preprocessedFrame->rgbTensor, cropTensor, { bb.x, bb.y, bb.z, bb.w });

// convert to grayscale
nvcv::Tensor colorTensor(1, {bb.z, bb.w}, nvcv::FMT_U8);
cvcuda::CvtColor color;
color(stream, cropTensor, colorTensor, NVCV_COLOR_RGB2GRAY);

// create input tensor for TensorRT inference
nvcv::Tensor::Requirements reqsInputLayer = nvcv::Tensor::CalcRequirements(1, { stepModel.input.dims[2], stepModel.input.dims[1] }, nvcv::FMT_U8);
nvcv::TensorDataStridedCuda::Buffer bufInputLayer;
std::copy(reqsInputLayer.strides, reqsInputLayer.strides + NVCV_TENSOR_MAX_RANK, bufInputLayer.strides);
CHECK_CUDA_ERROR(cudaMalloc((void**)&bufInputLayer.basePtr, CalcTotalSizeBytes(nvcv::Requirements{reqsInputLayer.mem}.cudaMem())));
nvcv::TensorDataStridedCuda inputLayerTensorData(nvcv::TensorShape{reqsInputLayer.shape, reqsInputLayer.rank, reqsInputLayer.layout}, nvcv::DataType{reqsInputLayer.dtype}, bufInputLayer);
nvcv::Tensor inputLayerTensor = TensorWrapData(inputLayerTensorData);

// resize cropped frame to model input size
cvcuda::Resize resize;
resize(stream, colorTensor, inputLayerTensor, NVCV_INTERP_LINEAR);

Then I do brute-force post-processing like this:

for (int i = 0; i < outputWidth; i += alphabetLen)
{   
    int maxIdx = 0;
    float max = 0;

    for (int j = 0; j < alphabetLen; j++)
    {
        float newMax = std::max(max, x[i+j]);
        if (newMax > max)
        {
            maxIdx = j;
            max = newMax;
        }
    }

    chars += lpr_alphabet[maxIdx];
}

However, what I get is completely different to what I get compared to Python inference via either fast-place-ocr inference or manual inference and post-processing pretty much copied from your script

Recognized plate is completely wrong.

Here's example of image saved from inputLayerTensor (pre-processed image passed to fast-plate-ocr inference)

gray

Sample output for the image: T351CT15_. This is also hugely unstable, jumping from one prediction to another all the time

My best guess here is that strange model input dims: 1, 70, 140, 1, which means interleaved data expected since channel is present. My tensor created is of NHWC shape, however data is purely planar (FMT_U8 data type)

Would love to hear back if you have any ideas, thanks!

Hi @VitalyVaryvdin!

I'd try debugging and checking/comparing with python process to see where the problem comes from. Does the final output is consistent with fast-plate-ocr py output?

I'm not familiar with CV-CUDA, but when run the following C++ code using onnxruntime it works well:

#include <iostream>
#include <vector>
#include <opencv2/opencv.hpp>
#include <onnxruntime/onnxruntime_cxx_api.h>

cv::Mat preprocess_image(const cv::Mat &input_image, int img_height, int img_width) {
    cv::Mat gray_image, resized_image, final_image;
    // convert to grayscale
    cv::cvtColor(input_image, gray_image, cv::COLOR_BGR2GRAY);
    // resize image
    cv::resize(gray_image, resized_image, cv::Size(img_width, img_height));
    // uint8 format
    resized_image.convertTo(final_image, CV_8U);
    // add batch dimension and channel dim
    final_image = final_image.reshape(1, {1, img_height, img_width, 1});
    return final_image;
}

// postprocess model output
std::string postprocess_output(const std::vector<float> &output, int max_plate_slots, const std::string &alphabet) {
    auto alphabet_len = alphabet.size();
    std::string plate;
    for (int i = 0; i < max_plate_slots; ++i) {
        float max_val = -std::numeric_limits<float>::infinity();
        int max_idx = 0;
        for (int j = 0; j < alphabet_len; ++j) {
            if (output[i * alphabet_len + j] > max_val) {
                max_val = output[i * alphabet_len + j];
                max_idx = j;
            }
        }
        plate += alphabet[max_idx];
    }
    return plate;
}

int main(int argc, char *argv[]) {

    const std::string model_path = "./assets/arg_cnn_ocr_synth.onnx";
    const std::string image_path = "./assets/test_plate_1.png";
    const std::string alphabet = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_";
    const int max_plate_slots = 7;
    const int img_height = 70;
    const int img_width = 140;

    // init ONNX Runtime
    Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "test");
    Ort::SessionOptions session_options;
    Ort::Session session(env, model_path.c_str(), session_options);

    // read and preprocess image
    cv::Mat input_image = cv::imread(image_path);
    if (input_image.empty()) {
        std::cerr << "Failed to read image: " << image_path << std::endl;
        return 1;
    }
    cv::Mat processed_image = preprocess_image(input_image, img_height, img_width);

    // create input tensor
    Ort::MemoryInfo memory_info = Ort::MemoryInfo::CreateCpu(OrtArenaAllocator, OrtMemTypeDefault);
    std::vector<int64_t> input_shape = {1, img_height, img_width, 1};
    std::vector<uint8_t> input_tensor_values(processed_image.begin<uint8_t>(), processed_image.end<uint8_t>());
    Ort::Value input_tensor = Ort::Value::CreateTensor<uint8_t>(memory_info, input_tensor_values.data(),
                                                                input_tensor_values.size(), input_shape.data(),
                                                                input_shape.size());

    // define input and output nodes
    const char *input_node_names[] = {"input"};
    const char *output_node_names[] = {"concatenate"};

    // run model
    auto output_tensors = session.Run(Ort::RunOptions{nullptr}, input_node_names, &input_tensor, 1, output_node_names,
                                      1);
    std::vector<float> output_tensor_values(output_tensors.front().GetTensorMutableData<float>(),
                                            output_tensors.front().GetTensorMutableData<float>() +
                                            max_plate_slots * alphabet.size());

    // postprocess output
    std::string plate = postprocess_output(output_tensor_values, max_plate_slots, alphabet);
    std::cout << "Recognized plate: " << plate << std::endl;

    return 0;
}

test_plate_1

The plate AD799KB is predicted correctly and it matches very closely to the python output.

Can you please check what's your input tensor memory size in the code above?

upd:

Tried the code you shared, input tensor memory size is 9800 in my case as it should be (1x70x140x1) Code using CVCUDA reports tensor stride0 (buffer size) to be 11200, and stride1 (row stride) to be 160. And 160x70 gives exactly 11200

Reading raw buffer as L8 image: Row stride 140 gives messed up image Row stride 160 gives license plate image

Seems like my image is actually stored as 160x70 in memory, thus giving wrong results

Kinda solved the issue, but more of a hacky way

nvcv::TensorDataStridedCuda::Buffer inBuf;
inBuf.strides[3] = sizeof(uint8_t);
inBuf.strides[2] = 1 * inBuf.strides[3];
inBuf.strides[1] = 140 * inBuf.strides[2];
inBuf.strides[0] = 70 * inBuf.strides[1];
CHECK_CUDA_ERROR(cudaMallocAsync(&inBuf.basePtr, 1 * inBuf.strides[0], stream));

nvcv::Tensor::Requirements inReqs = nvcv::Tensor::CalcRequirements(1, {140, 70}, nvcv::FMT_U8);
nvcv::TensorDataStridedCuda inData(nvcv::TensorShape{inReqs.shape, inReqs.rank, inReqs.layout}, nvcv::DataType{inReqs.dtype}, inBuf);
nvcv::Tensor resizeTensor = TensorWrapData(inData);

Allocating tensor manually with specified stride sizes did solve the issue, but I'm not sure why CVCUDA did allocated wrong strides in the first place.

Thanks for the help, your ONNX snippet helped me a lot!

ankandrew / fast-plate-ocr

C++ TensorRT execution #18