NVIDIA / TensorRT

NVIDIA® TensorRT™ is an SDK for high-performance deep learning inference on NVIDIA GPUs. This repository contains the open source components of TensorRT.
https://developer.nvidia.com/tensorrt
Apache License 2.0
10.3k stars 2.09k forks source link

multi-threads inference latency is much slower than the single one #3403

Closed Garfield2005 closed 7 months ago

Garfield2005 commented 8 months ago

I am testing multi-threads inference with two diff models, but inference latency is much slower than the single one

  1. one model is segmentaion model, the other one is detection model.
  2. only run the seg model, fps=23, gpu_util=60%(use cpu for postprocessing)
  3. only run the det model, fps=33, gpu_util=95%(use gpu for postprocessing)
  4. run seg and det simultaneously, fps=16, gpu_util=75%

use nsys to analyze the infer time:

  1. only run seg, enqueue time: 26.5ms
  2. only run det, enqueue time: 24.6ms
  3. run seg and det simultaneously, seg enqueue time: 50-70ms; det enqueue time: 24-39ms nsys file: debug_seg_det.profile.nsys-rep.zip(just remove .zip) debug_seg.profile.nsys-rep.zip debug_det.profile.nsys-rep.zip

my questions as follows:

  1. the gpu_utils is just 75%, why the enqueue time increase so much
  2. in the following profile image: 1) what's the difference between two timelines(e.g. red box vs blue box; yellow box vs green box) 2) in the blue box, what is cudaMemcpyAsync doing, which comsumes too much time. 3) Why does os have a pthread_rwlock_wrlock lock when trt is doing inference? It takes exactly the same amount of time as the layer Conv_934+Relu_935 4) The time consumption of Conv_934+Relu_935 reaches 20ms, which seems abnormal. Why?

nsys_seg_det

Environment

TensorRT Version: 8.2.4 NVIDIA GPU: 2080Ti NVIDIA Driver Version: 460.56 CUDA Version: 11.6 CUDNN Version: 8.4 Operating System: ubuntu20.04 Python Version (if applicable): python3.8 Libtorch Version (if applicable): 1.13.1 Docker Container (if so, version): nvcr.io/nvidia/tensorrt:22.04-py3

my code:

#include <string>
#include "concurrent_blocking_queue.h"

struct VideoFrameInfo {
  using Ptr = std::shared_ptr<VideoFrameInfo>;
  using ConstPtr = std::shared_ptr<const VideoFrameInfo>;
  torch::Tensor image_tensor;
};

void test_pipline_infer(int argc, char** argv){
  std::string seg_model_file = "seg_cls51_20230815_h1024w1224_fp16_trt8.2.4_2080Ti.onnx.engine";
  std::string det_model_file = "det_cls11_20230306_h1248w1248_fp16_trt8.2.4_2080Ti.onnx.engine";
  std::string video_file = "/data/deeplearning/videos/track68.mp4";

  if (argc != 3){
    std::cout << "Usage: exe ${seg} ${det}" << std::endl;
    return;
  }
  bool seg = std::strcmp(argv[1], "1") == 0 ? true : false;
  bool det = std::strcmp(argv[2], "1") == 0 ? true : false;
  std::cout << "seg: " << seg << std::endl;
  std::cout << "det: " << det << std::endl;

  // queue
  ConcurrentBlockingQueue<VideoFrameInfo::Ptr> seg_input_queue(50);
  ConcurrentBlockingQueue<VideoFrameInfo::Ptr> det_input_queue(50);

  std::thread seg_thread([&seg_input_queue, &seg_model_file, &seg]() {
    if (!seg){
      return;
    }
    torch::NoGradGuard no_grad;
    at::cuda::CUDAStream myStream = at::cuda::getStreamFromPool();
    {
      at::cuda::CUDAStreamGuard guard(myStream);
      // load trt model, e.g: 
      // nvinfer1::IRuntime* runtime = nvinfer1::createInferRuntime(gLogger_);
      // engine_ = runtime->deserializeCudaEngine(trt_model_stream, model_size, nullptr);
      // context_ = engine_->createExecutionContext();
      // ...
      LoadTrtModel(seg_model_file);
      std::cout << "seg init success" << std::endl;

      int count = 0;
      auto starttime = std::chrono::system_clock::now();
      while(true){
        VideoFrameInfo::Ptr video_frame_info_ptr = nullptr;
        seg_input_queue.pop(video_frame_info_ptr);
        if (video_frame_info_ptr == nullptr) {
          break;
        }
        // preprocessing(GPU libtorch): resize, norm
        PreProcessingWithLibtorch(video_frame_info_ptr);

        // 1. at::cuda::CUDAStream currentStream = at::cuda::getCurrentCUDAStream();
        // 2. context_->enqueueV2(buffers_, currentStream.stream(), nullptr);
        InferWithTrt(video_frame_info_ptr);

        // 1. to cv::Mat
        // 2. cv::findContours
        PostProcessingWithOpencv(video_frame_info_ptr);

        std::cout << "seg[" << count++ << "] queue: " << seg_input_queue.size() << std::endl;
      }
      std::cout << "seg finish" << std::endl;
      auto endtime = std::chrono::system_clock::now();
      double time_all = std::chrono::duration_cast<std::chrono::milliseconds>(endtime - starttime).count();
      std::cout <<  "SEG_FPS = " << count << "/" << time_all/1000.0 << "=" << count/time_all*1000 << std::endl;
    }
  });

  std::thread det_thread([&det_input_queue, &det_model_file, &det]() {
    if (!det){
      return;
    }
    torch::NoGradGuard no_grad;
    at::cuda::CUDAStream myStream = at::cuda::getStreamFromPool();
    {
      at::cuda::CUDAStreamGuard guard(myStream);
      // load trt model, e.g: 
      // nvinfer1::IRuntime* runtime = nvinfer1::createInferRuntime(gLogger_);
      // engine_ = runtime->deserializeCudaEngine(trt_model_stream, model_size, nullptr);
      // context_ = engine_->createExecutionContext();
      // ...
      LoadTrtModel(det_model_file);
      std::cout << "det init success" << std::endl;

      int count = 0;
      auto starttime = std::chrono::system_clock::now();
      while(true){
        VideoFrameInfo::Ptr video_frame_info_ptr = nullptr;
        det_input_queue.pop(video_frame_info_ptr);
        if (video_frame_info_ptr == nullptr) {
          break;
        }
        // preprocessing(GPU libtorch): resize, norm
        PreProcessingWithLibtorch(video_frame_info_ptr);

        // 1. at::cuda::CUDAStream currentStream = at::cuda::getCurrentCUDAStream();
        // 2. context_->enqueueV2(buffers_, currentStream.stream(), nullptr);
        InferWithTrt(video_frame_info_ptr);

        // 1. bbox filter with libtorch(GPU)
        // 2. nms(GPU)
        PostProcessingWithLibtorch(video_frame_info_ptr);

        std::cout << "det[" << count++ << "] queue: " << det_input_queue.size() << std::endl;
      }
      std::cout << "det finish" << std::endl;
      auto endtime = std::chrono::system_clock::now();
      double time_all = std::chrono::duration_cast<std::chrono::milliseconds>(endtime - starttime).count();
      std::cout <<  "DET_FPS = " << count << "/" << time_all/1000.0 << "=" << count/time_all*1000 << std::endl;
    }
  });

  int gpu_id = 0;
  int frame_index = 0;
  auto starttime = std::chrono::system_clock::now();
  torch::NoGradGuard no_grad;
  at::cuda::CUDAStream myStream = at::cuda::getStreamFromPool();
  {
    at::cuda::CUDAStreamGuard guard(myStream);
    // read video frame with vpf wrapper(Video Processing Framework)
    vpf::VideoCapture capture(video_file, gpu_id);
    if (!capture.isOpened()) {
      std::cout << "failed to open VideoCapture" << std::endl;
      capture.release();
      return;
    }

    while (true){
      torch::Tensor frame_tensor; // CHW/RGB
      try{
        if (!capture.read(frame_tensor)){
          std::cout << "finish to read frame tensor from video:" << video_file << std::endl;
          break;
        }
        frame_tensor = frame_tensor.unsqueeze(0).to(torch::kFloat32); /** CHW->NCHW */
      }catch(std::exception &e){
        std::cerr << e.what() << std::endl;
        std::cout << "decode failed, please checking the video file" << std::endl;
        break;
      }
      frame_index++;
      std::cout << "processing " << frame_index << std::endl;
      VideoFrameInfo::Ptr video_frame_info = std::make_shared<VideoFrameInfo>();
      video_frame_info->image_tensor = frame_tensor.clone();
      if (seg){
        seg_input_queue.push(video_frame_info);
      }
      if (det){
        det_input_queue.push(video_frame_info);
      }
    }
  }
  if (seg){
    seg_input_queue.push(nullptr);
  }
  if (det){
    det_input_queue.push(nullptr);
  }
  if (seg){
    seg_thread.join();
  }
  if (det){
    det_thread.join();
  }

  auto endtime = std::chrono::system_clock::now();
  double time_all = std::chrono::duration_cast<std::chrono::milliseconds>(endtime - starttime).count();
  std::cout <<  "FPS_all = " << frame_index << "/" << time_all/1000.0 << "=" << frame_index/time_all*1000 << std::endl;

}

int main(int argc, char** argv)
{
  test_pipline_infer(argc, argv);   

  return 0;
}

nsys:

# run seg and det simultaneously:
/usr/local/bin/nsys profile -o debug_seg_det_parallel.profile ./bin/test_video_seg_det_parallel 1 1
# only run seg:
/usr/local/bin/nsys profile -o debug_seg.profile ./bin/test_video_seg_det_parallel 1 0
# only run det:
/usr/local/bin/nsys profile -o debug_det.profile ./bin/test_video_seg_det_parallel 0 1
nvpohanh commented 8 months ago

Could you show the code for PreProcessingWithLibtorch and PostProcessingWithLibtorch? It looks like these two functions have some cudaMemcpyAsync() calls from (or to) host pageable memory, and that would be implicitly synchronous. To fix that, allocate those buffers using cudaHostAlloc() so that the host memory is pinned instead of being pageable.

nvpohanh commented 8 months ago

what's the difference between two timelines(e.g. red box vs blue box; yellow box vs green box)

Please refer to the docs here: https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#nvprof

Some are GPU activities and some are CPU activities.

nvpohanh commented 8 months ago

Why does os have a pthread_rwlock_wrlock lock when trt is doing inference? It takes exactly the same amount of time as the layer Conv_934+Relu_935

It is because the other thread is calling the cudaMemcpyAsync() which triggers the implicit synchronization due to paged host mem.

Garfield2005 commented 8 months ago

Could you show the code for PreProcessingWithLibtorch and PostProcessingWithLibtorch? It looks like these two functions have some cudaMemcpyAsync() calls from (or to) host pageable memory, and that would be implicitly synchronous. To fix that, allocate those buffers using cudaHostAlloc() so that the host memory is pinned instead of being pageable.

The Det process code:


#include "torch/torch.h"

namespace Index = torch::indexing;
namespace F = torch::nn::functional;

void PreProcessingWithLibtorch(VideoFrameInfo::Ptr video_frame_info_ptr) {
  torch::Tensor input_tensor = video_frame_info_ptr->image_tensor;

  src_height_ = input_tensor.size(2);
  src_width_ = input_tensor.size(3);
  ratio_ = std::min(height_ / float(src_height_),  width_ / float(src_width_));

  // new shape
  int64_t new_height = src_height_ * ratio_;
  int64_t new_width = src_width_ * ratio_;
  padding_left_ = (width_ - new_width) / 2;
  padding_right_ = width_ - new_width - padding_left_;
  padding_top_ = (height_ - new_height) / 2;
  padding_bottom_ = height_ - new_height - padding_top_;

  // resize and pad
  img_tensor = F::interpolate(input_tensor, 
                                F::InterpolateFuncOptions() 
                                .size(std::vector<int64_t>({new_height, new_width})) 
                                .mode(torch::kBilinear) 
                                .align_corners(false));
  img_tensor = F::pad(img_tensor, 
                        F::PadFuncOptions({padding_left_, padding_right_, padding_top_, padding_bottom_}) 
                        .mode(torch::kConstant).value(PAD_VELUE_));
  // norm
  img_tensor = img_tensor / 255.0;
  img_tensor = img_tensor.contiguous();

  // buffers_ is trt input buffer
  // buffers_[input_indexs_[0]] = img_tensor.data_ptr<float>();

}

bool PostProcessingWithLibtorch(std::vector<torch::Tensor> &preds, std::vector<kd::dl::InstanceAsset::Ptr> &objects_info)
{
  bool include_mask = true;

  torch::Tensor pred_mask = preds[0]; // trt output: 1*32*312*312, 312=1248/4(stride)
  torch::Tensor featuremap = preds[1]; // trt output: 1*47*31941, 47=4(box)+11(classes)+32(mask)
  int mask_feat_dim = pred_mask.size(1);
  int num_classes = featuremap.size(1) - 4 - mask_feat_dim; 

  featuremap = featuremap.permute({0, 2, 1}); // 1*47*N -> 1*N*47
  auto featuremap_boxes = featuremap.slice(2, 0, 4); // 1*N*4(box)
  auto featuremap_scores = featuremap.slice(2, 4, 4 + num_classes); // 1*N*11(classes)
  std::tuple<torch::Tensor, torch::Tensor> pred = torch::max(featuremap_scores, 2);
  torch::Tensor scores = std::get<0>(pred); // 1*N
  torch::Tensor labeles = std::get<1>(pred); // 1*N

  auto scores_index = torch::where(scores > conf_thresh_);
  auto boxes_nms = featuremap_boxes.index_select(1, scores_index[1]); // 1*N'*4
  auto score_nms = scores.index_select(1, scores_index[1]); // 1*N'
  auto labels_nms = labeles.index_select(1, scores_index[1]); // 1*N
  if (boxes_nms.size(1) == 0){
    C10_LOG_EVERY_MS(INFO, 0) << "no bbox after thresh filter";
    return true;
  }
  // xywh->xyxy
  auto xmin = boxes_nms.index({"...", 0}) - boxes_nms.index({"...", 2}) / 2.0;
  auto ymin = boxes_nms.index({"...", 1}) - boxes_nms.index({"...", 3}) / 2.0;
  auto xmax = boxes_nms.index({"...", 0}) + boxes_nms.index({"...", 2}) / 2.0;
  auto ymax = boxes_nms.index({"...", 1}) + boxes_nms.index({"...", 3}) / 2.0;
  boxes_nms.index({"...", 0}) = xmin;
  boxes_nms.index({"...", 1}) = ymin;
  boxes_nms.index({"...", 2}) = xmax;
  boxes_nms.index({"...", 3}) = ymax;

  // nms
  at::Tensor nms_index = kd::vision::ops::nms_cuda(boxes_nms.squeeze(0), score_nms.squeeze(0), nms_thresh_, inner_ignore_);
  auto final_boxes = boxes_nms.index_select(1, nms_index); // 1*N_nms*4, xywh
  auto final_scores = score_nms.index_select(1, nms_index); // 1*N_nms
  auto final_labeles = labels_nms.index_select(1, nms_index); // 1*N_nms

  // get mask
  torch::Tensor mask_output; // same shape with trt input shape
  if (include_mask){
    int pred_mask_h = pred_mask.size(2);
    int pred_mask_w = pred_mask.size(3);
    auto featuremap_mask = featuremap.slice(2, 4 + num_classes, featuremap.size(2)); // 1*N*32(mask)
    auto mask_nms = featuremap_mask.index_select(1, scores_index[1]); // 1*N'
    auto final_mask = mask_nms.index_select(1, nms_index); // 1*N_nms*32
    mask_output = final_mask.matmul(pred_mask.reshape({pred_mask.size(0), -1, pred_mask_h*pred_mask_w})); // 1xN_nmsx32 * 1x32x(312x312) = 1xN_nmsx(312x312)
    mask_output = torch::sigmoid(mask_output); // 1xN_nmsx(312x312)
    mask_output = mask_output.reshape({mask_output.size(0), mask_output.size(1), pred_mask_h, pred_mask_w});
    mask_output = F::interpolate(mask_output, F::InterpolateFuncOptions()
                                            .size(std::vector<int64_t>({height_, width_}))
                                            .mode(torch::kBilinear)
                                            .align_corners(false)
                                            .recompute_scale_factor(false));
    mask_output = mask_output > 0.5; // 1*N_nms*h*w
  }

  for (int k = 0; k < final_boxes.size(1); ++k){
    int xmin = final_boxes[0][k][0].item<float>();
    int ymin = final_boxes[0][k][1].item<float>();
    int xmax = final_boxes[0][k][2].item<float>();
    int ymax = final_boxes[0][k][3].item<float>();
    object_ptr->label_id = final_labeles[0][k].item<long>();
    object_ptr->confidence = final_scores[0][k].item<float>();
    objects_info.push_back(object_ptr);

    // get mask
    if (include_mask){
      // crop bbox
      torch::Tensor box_mask = mask_output.index({0, long(k), Index::Slice(ymin, ymax + 1), Index::Slice(xmin, xmax + 1)});
      box_mask = box_mask.to(torch::kByte).contiguous().to(torch::kCPU);
      cv::Mat mask_mat(ymax - ymin + 1, xmax - xmin + 1, CV_8UC1, box_mask.data_ptr());
      // get mask with opencv
      // ...
    }

  }
  return true;
}
Garfield2005 commented 8 months ago

Why does os have a pthread_rwlock_wrlock lock when trt is doing inference? It takes exactly the same amount of time as the layer Conv_934+Relu_935

It is because the other thread is calling the cudaMemcpyAsync() which triggers the implicit synchronization due to paged host mem.

@nvpohanh Do you mean the cudaMemcpyAsync in red box(det thread) leads to pthread_rwlock_wrlock lock(seg thread)?

image

nvpohanh commented 8 months ago

Do you mean the cudaMemcpyAsync in red box(det thread) leads to pthread_rwlock_wrlock lock(seg thread)?

Yes, the implicit synchronization caused by cudaMemcpyAsync w/ pageable host mem will cause the kernel launch to be blocked

nvpohanh commented 8 months ago

I think the issue is this line:

      torch::Tensor box_mask = mask_output.index({0, long(k), Index::Slice(ymin, ymax + 1), Index::Slice(xmin, xmax + 1)});
      box_mask = box_mask.to(torch::kByte).contiguous().to(torch::kCPU);

The .to(torch::kCPU) is copying to a pageable host memory. I would suggest that you first allocate pinned memory for box_mask and then copy to the pinned memory cpu tensor instead. See: https://stackoverflow.com/questions/63324584/pinned-memory-in-libtorch

Garfield2005 commented 8 months ago

I think the issue is this line:

      torch::Tensor box_mask = mask_output.index({0, long(k), Index::Slice(ymin, ymax + 1), Index::Slice(xmin, xmax + 1)});
      box_mask = box_mask.to(torch::kByte).contiguous().to(torch::kCPU);

The .to(torch::kCPU) is copying to a pageable host memory. I would suggest that you first allocate pinned memory for box_mask and then copy to the pinned memory cpu tensor instead. See: https://stackoverflow.com/questions/63324584/pinned-memory-in-libtorch

Thanks, i am testing the pinned-memory. Firstly, i removed all the postprocessing(seg and det model), the gpu_utils=100%:

  1. Sometimes trt does nothing(e.g. red box). Is it because of the pthread_rwlock_wrlock ?
  2. Why are there so many locks, is it because different cuda kernels are competing with each other?

image

nvpohanh commented 8 months ago

Sometimes trt does nothing(e.g. red box). Is it because of the pthread_rwlock_wrlock ?

No, the red box you showed is because GPU is busy with the kernel (trt_turing...) on another stream. Kernels can only run in parallel if the kernel does not occupy all the computation resources on the GPU. In this case, that kernel probably already occupies all the resource (100% utilization) so kernels from the other thread cannot run.

nvpohanh commented 8 months ago

Why are there so many locks, is it because different cuda kernels are competing with each other?

I am not 100% sure, but it seems that CPU is waiting for GPU to digest (i.e. finish some kernel executions) before launching new kernels. I think this is because you have hit the limit of number of kernels waiting in the queue.

However, this is not an issue as long as GPU utilization is already 100%. It simply means that CPU is giving more work than GPU can handle so CPU must wait for a bit.

ttyio commented 7 months ago

Closing since no activity for more than 3 weeks, thanks all!