read access violation exception

JasonRuan5 commented 1 year ago

Thanks for this great repository. We have trained our custom dataset with Yolov7 and converted to onnx/trt with the following command: python export.py --weights path to trained.pt --dynamic --grid trtexec.exe --onnx=name.onnx --saveEngine=name.trt --minShapes=images:1x3x640x640 --optShapes=images:4x3x640x640 --maxShapes=images:8x3x640x640 --buildOnly --fp16 We are using TensorRT-8.4.2.4, and cu113. It runs great most time, but sometime, num_boxes = -2147483648 and got read access violation error at int keep_flag = ptr[6] in postprocess method as shown below. `
int num_boxes = std::min((int)(m_output_objects_host + bi (m_param.topK m_output_objects_width + 1))[0], m_param.topK);

    for (size_t i = 0; i < num_boxes; i++)
    {
        float* ptr = m_output_objects_host + bi * (m_param.topK * m_output_objects_width + 1) + m_output_objects_width * i + 1;
        int keep_flag = ptr[6];
        if (keep_flag)
        {
            float x_lt = m_dst2src.v0 * ptr[0] + m_dst2src.v1 * ptr[1] + m_dst2src.v2; // left & top
            float y_lt = m_dst2src.v3 * ptr[0] + m_dst2src.v4 * ptr[1] + m_dst2src.v5;
            float x_rb = m_dst2src.v0 * ptr[2] + m_dst2src.v1 * ptr[3] + m_dst2src.v2; // right & bottom
            float y_rb = m_dst2src.v3 * ptr[2] + m_dst2src.v4 * ptr[3] + m_dst2src.v5;
            m_objectss[bi].emplace_back(x_lt, y_lt, x_rb, y_rb, ptr[4], (int)ptr[5]); //
        }
    }

` What could cause this exception?

FeiYull commented 1 year ago

@JasonRuan5 As shown in the figure, the number of objects detected is calculated as follows: num_boxes <- m_output_objects_host <- m_output_objects_device,

Steps:

The variable m_output_objects_device needs to be debugged.
Set 1 on line 374, put a breakpoint on line 386,
You need debug to verify whether the cv::Mat prediction variable is null,
If it is empty, continue to check the previous code in a similar way.

234

JasonRuan5 commented 1 year ago

Thanks so much for such a rapid response and clear instructions. We have tried to reproduce the problem by running several hundred big images (70000x70000) in batch. Unfortunately, we cannot reproduce the problem yet. However, we have noticed memory in resource manager keeps raising very slowly though. The deleaker tool also reported some memory leaks mostly from line 52 of xmemory0, which we are not sure it is real or not. How do we make sure no memory leaks?

We need to switch models between Yolov7 and Yolov4, which in darknet, during analysis, so we need to load and unload Yolov7 model for each image. We call _yolo7->reset() before delete _yolo7 when unload. We will try to load both models without needing to unload to see if this fixes the problem. Are we doing correct? Any advice you have will be greatly appreciated

JasonRuan5 commented 1 year ago

FeiYull, Thanks again for your help! We have tried several times in batch and seems able to reproduce the exception. If we load and unload yolov7 at each image, the exception occurs after 200 images. However, if we just load yolov7 once, we can run over 500 images without any exception. We plan to run over 1000 and we expect it should be fine, too. We also find it is ok to load both yolov4 and yolov7 models, so we are ok for now! Just tot sure why exception if we do load and unload repeatedly?

FeiYull / TensorRT-Alpha

read access violation exception #23