enazoe / yolo-tensorrt

TensorRT8.Support Yolov5n,s,m,l,x .darknet -> tensorrt. Yolov4 Yolov3 use raw darknet *.weights and *.cfg fils. If the wrapper is useful to you,please Star it.
MIT License
1.19k stars 315 forks source link

关于yolov4的测速 #34

Closed qinxianglinya closed 4 years ago

qinxianglinya commented 4 years ago

博主,你好。目前我已经在windows上将yolov4编译成功了。 环境:win10, tensorrt6.0.1.5, cuda10.0, cudnn7.6.5, 1080Ti。 目前我针对自己训练的模型进行了测速。 配置文件中图片大小为:800x800x3,tensorrt精度为FP16,batchsize为1。 enquequ()+cudaMemcpyAsync()的时间为1ms,但是cudaStreamSynchronize()操作花费了29ms,请问这个地方有没有能够改善了方法,非常感谢。

enazoe commented 4 years ago

@qinxianglinya 时间这里你能描述的再清楚一点吗?cudaMemcpyAsync是异步拷贝数据到显存,因为是异步所以函数会立即执行完不阻塞。后面的cudaStreamSynchronize是用于同步的,等待执行完成。所以你统计出来的26 ms基本就是执行时间。还是不理解的话你可以查下那两个cuda函数的作用

qinxianglinya commented 4 years ago

下面是我打印的时间信息,不知道是不是打印的有问题。 clock_t start = clock(); m_Context->enqueue(batchSize, m_DeviceBuffers.data(), m_CudaStream, nullptr); clock_t end = clock(); std::cout << "infer time:" << (float)(end - start) << "ms" << std::endl; std::cout << m_OutputTensors.size() << std::endl; clock_t start1 = clock(); for (auto& tensor : m_OutputTensors) { NV_CUDA_CHECK(cudaMemcpyAsync(tensor.hostBuffer, m_DeviceBuffers.at(tensor.bindingIndex), batchSize tensor.volume sizeof(float), cudaMemcpyDeviceToHost, m_CudaStream)); } clock_t end1 = clock(); std::cout << "gpu to cpu:" << float(end1 - start1) << "ms" << std:: endl; clock_t start2 = clock(); cudaStreamSynchronize(m_CudaStream); clock_t end2 = clock(); std::cout << "cudaStreamSynchronize time:" << float(end2 - start2) << "ms" <<std::endl;

enazoe commented 4 years ago

uncomment line536 and line550

Timer timer;
    assert(batchSize <= m_BatchSize && "Image batch size exceeds TRT engines batch size");
    NV_CUDA_CHECK(cudaMemcpyAsync(m_DeviceBuffers.at(m_InputBindingIndex), input,
                                  batchSize * m_InputSize * sizeof(float), cudaMemcpyHostToDevice,
                                  m_CudaStream));
    m_Context->enqueue(batchSize, m_DeviceBuffers.data(), m_CudaStream, nullptr);
    for (auto& tensor : m_OutputTensors)
    {
        NV_CUDA_CHECK(cudaMemcpyAsync(tensor.hostBuffer, m_DeviceBuffers.at(tensor.bindingIndex),
                                      batchSize * tensor.volume * sizeof(float),
                                      cudaMemcpyDeviceToHost, m_CudaStream));
    }
    cudaStreamSynchronize(m_CudaStream);
    timer.out("inference");
qinxianglinya commented 4 years ago

取消注释第536行和第550行

Timer timer;
    assert(batchSize <= m_BatchSize && "Image batch size exceeds TRT engines batch size");
    NV_CUDA_CHECK(cudaMemcpyAsync(m_DeviceBuffers.at(m_InputBindingIndex), input,
                                  batchSize * m_InputSize * sizeof(float), cudaMemcpyHostToDevice,
                                  m_CudaStream));
    m_Context->enqueue(batchSize, m_DeviceBuffers.data(), m_CudaStream, nullptr);
    for (auto& tensor : m_OutputTensors)
    {
        NV_CUDA_CHECK(cudaMemcpyAsync(tensor.hostBuffer, m_DeviceBuffers.at(tensor.bindingIndex),
                                      batchSize * tensor.volume * sizeof(float),
                                      cudaMemcpyDeviceToHost, m_CudaStream));
    }
    cudaStreamSynchronize(m_CudaStream);
    timer.out("inference");

ok,谢谢

enazoe commented 4 years ago

@qinxianglinya 觉得有用点个start哈

qinxianglinya commented 4 years ago

@qinxianglinya 觉得有用点个start哈

嗯嗯