Closed Garfield2005 closed 7 months ago
Could you show the code for PreProcessingWithLibtorch
and PostProcessingWithLibtorch
? It looks like these two functions have some cudaMemcpyAsync() calls from (or to) host pageable memory, and that would be implicitly synchronous. To fix that, allocate those buffers using cudaHostAlloc() so that the host memory is pinned instead of being pageable.
what's the difference between two timelines(e.g. red box vs blue box; yellow box vs green box)
Please refer to the docs here: https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#nvprof
Some are GPU activities and some are CPU activities.
Why does os have a pthread_rwlock_wrlock lock when trt is doing inference? It takes exactly the same amount of time as the layer Conv_934+Relu_935
It is because the other thread is calling the cudaMemcpyAsync() which triggers the implicit synchronization due to paged host mem.
Could you show the code for
PreProcessingWithLibtorch
andPostProcessingWithLibtorch
? It looks like these two functions have some cudaMemcpyAsync() calls from (or to) host pageable memory, and that would be implicitly synchronous. To fix that, allocate those buffers using cudaHostAlloc() so that the host memory is pinned instead of being pageable.
The Det process code:
#include "torch/torch.h"
namespace Index = torch::indexing;
namespace F = torch::nn::functional;
void PreProcessingWithLibtorch(VideoFrameInfo::Ptr video_frame_info_ptr) {
torch::Tensor input_tensor = video_frame_info_ptr->image_tensor;
src_height_ = input_tensor.size(2);
src_width_ = input_tensor.size(3);
ratio_ = std::min(height_ / float(src_height_), width_ / float(src_width_));
// new shape
int64_t new_height = src_height_ * ratio_;
int64_t new_width = src_width_ * ratio_;
padding_left_ = (width_ - new_width) / 2;
padding_right_ = width_ - new_width - padding_left_;
padding_top_ = (height_ - new_height) / 2;
padding_bottom_ = height_ - new_height - padding_top_;
// resize and pad
img_tensor = F::interpolate(input_tensor,
F::InterpolateFuncOptions()
.size(std::vector<int64_t>({new_height, new_width}))
.mode(torch::kBilinear)
.align_corners(false));
img_tensor = F::pad(img_tensor,
F::PadFuncOptions({padding_left_, padding_right_, padding_top_, padding_bottom_})
.mode(torch::kConstant).value(PAD_VELUE_));
// norm
img_tensor = img_tensor / 255.0;
img_tensor = img_tensor.contiguous();
// buffers_ is trt input buffer
// buffers_[input_indexs_[0]] = img_tensor.data_ptr<float>();
}
bool PostProcessingWithLibtorch(std::vector<torch::Tensor> &preds, std::vector<kd::dl::InstanceAsset::Ptr> &objects_info)
{
bool include_mask = true;
torch::Tensor pred_mask = preds[0]; // trt output: 1*32*312*312, 312=1248/4(stride)
torch::Tensor featuremap = preds[1]; // trt output: 1*47*31941, 47=4(box)+11(classes)+32(mask)
int mask_feat_dim = pred_mask.size(1);
int num_classes = featuremap.size(1) - 4 - mask_feat_dim;
featuremap = featuremap.permute({0, 2, 1}); // 1*47*N -> 1*N*47
auto featuremap_boxes = featuremap.slice(2, 0, 4); // 1*N*4(box)
auto featuremap_scores = featuremap.slice(2, 4, 4 + num_classes); // 1*N*11(classes)
std::tuple<torch::Tensor, torch::Tensor> pred = torch::max(featuremap_scores, 2);
torch::Tensor scores = std::get<0>(pred); // 1*N
torch::Tensor labeles = std::get<1>(pred); // 1*N
auto scores_index = torch::where(scores > conf_thresh_);
auto boxes_nms = featuremap_boxes.index_select(1, scores_index[1]); // 1*N'*4
auto score_nms = scores.index_select(1, scores_index[1]); // 1*N'
auto labels_nms = labeles.index_select(1, scores_index[1]); // 1*N
if (boxes_nms.size(1) == 0){
C10_LOG_EVERY_MS(INFO, 0) << "no bbox after thresh filter";
return true;
}
// xywh->xyxy
auto xmin = boxes_nms.index({"...", 0}) - boxes_nms.index({"...", 2}) / 2.0;
auto ymin = boxes_nms.index({"...", 1}) - boxes_nms.index({"...", 3}) / 2.0;
auto xmax = boxes_nms.index({"...", 0}) + boxes_nms.index({"...", 2}) / 2.0;
auto ymax = boxes_nms.index({"...", 1}) + boxes_nms.index({"...", 3}) / 2.0;
boxes_nms.index({"...", 0}) = xmin;
boxes_nms.index({"...", 1}) = ymin;
boxes_nms.index({"...", 2}) = xmax;
boxes_nms.index({"...", 3}) = ymax;
// nms
at::Tensor nms_index = kd::vision::ops::nms_cuda(boxes_nms.squeeze(0), score_nms.squeeze(0), nms_thresh_, inner_ignore_);
auto final_boxes = boxes_nms.index_select(1, nms_index); // 1*N_nms*4, xywh
auto final_scores = score_nms.index_select(1, nms_index); // 1*N_nms
auto final_labeles = labels_nms.index_select(1, nms_index); // 1*N_nms
// get mask
torch::Tensor mask_output; // same shape with trt input shape
if (include_mask){
int pred_mask_h = pred_mask.size(2);
int pred_mask_w = pred_mask.size(3);
auto featuremap_mask = featuremap.slice(2, 4 + num_classes, featuremap.size(2)); // 1*N*32(mask)
auto mask_nms = featuremap_mask.index_select(1, scores_index[1]); // 1*N'
auto final_mask = mask_nms.index_select(1, nms_index); // 1*N_nms*32
mask_output = final_mask.matmul(pred_mask.reshape({pred_mask.size(0), -1, pred_mask_h*pred_mask_w})); // 1xN_nmsx32 * 1x32x(312x312) = 1xN_nmsx(312x312)
mask_output = torch::sigmoid(mask_output); // 1xN_nmsx(312x312)
mask_output = mask_output.reshape({mask_output.size(0), mask_output.size(1), pred_mask_h, pred_mask_w});
mask_output = F::interpolate(mask_output, F::InterpolateFuncOptions()
.size(std::vector<int64_t>({height_, width_}))
.mode(torch::kBilinear)
.align_corners(false)
.recompute_scale_factor(false));
mask_output = mask_output > 0.5; // 1*N_nms*h*w
}
for (int k = 0; k < final_boxes.size(1); ++k){
int xmin = final_boxes[0][k][0].item<float>();
int ymin = final_boxes[0][k][1].item<float>();
int xmax = final_boxes[0][k][2].item<float>();
int ymax = final_boxes[0][k][3].item<float>();
object_ptr->label_id = final_labeles[0][k].item<long>();
object_ptr->confidence = final_scores[0][k].item<float>();
objects_info.push_back(object_ptr);
// get mask
if (include_mask){
// crop bbox
torch::Tensor box_mask = mask_output.index({0, long(k), Index::Slice(ymin, ymax + 1), Index::Slice(xmin, xmax + 1)});
box_mask = box_mask.to(torch::kByte).contiguous().to(torch::kCPU);
cv::Mat mask_mat(ymax - ymin + 1, xmax - xmin + 1, CV_8UC1, box_mask.data_ptr());
// get mask with opencv
// ...
}
}
return true;
}
Why does os have a pthread_rwlock_wrlock lock when trt is doing inference? It takes exactly the same amount of time as the layer Conv_934+Relu_935
It is because the other thread is calling the cudaMemcpyAsync() which triggers the implicit synchronization due to paged host mem.
@nvpohanh Do you mean the cudaMemcpyAsync in red box(det thread) leads to pthread_rwlock_wrlock lock(seg thread)?
Do you mean the cudaMemcpyAsync in red box(det thread) leads to pthread_rwlock_wrlock lock(seg thread)?
Yes, the implicit synchronization caused by cudaMemcpyAsync w/ pageable host mem will cause the kernel launch to be blocked
I think the issue is this line:
torch::Tensor box_mask = mask_output.index({0, long(k), Index::Slice(ymin, ymax + 1), Index::Slice(xmin, xmax + 1)});
box_mask = box_mask.to(torch::kByte).contiguous().to(torch::kCPU);
The .to(torch::kCPU)
is copying to a pageable host memory. I would suggest that you first allocate pinned memory for box_mask
and then copy to the pinned memory cpu tensor instead. See: https://stackoverflow.com/questions/63324584/pinned-memory-in-libtorch
I think the issue is this line:
torch::Tensor box_mask = mask_output.index({0, long(k), Index::Slice(ymin, ymax + 1), Index::Slice(xmin, xmax + 1)}); box_mask = box_mask.to(torch::kByte).contiguous().to(torch::kCPU);
The
.to(torch::kCPU)
is copying to a pageable host memory. I would suggest that you first allocate pinned memory forbox_mask
and then copy to the pinned memory cpu tensor instead. See: https://stackoverflow.com/questions/63324584/pinned-memory-in-libtorch
Thanks, i am testing the pinned-memory. Firstly, i removed all the postprocessing(seg and det model), the gpu_utils=100%:
Sometimes trt does nothing(e.g. red box). Is it because of the pthread_rwlock_wrlock ?
No, the red box you showed is because GPU is busy with the kernel (trt_turing...
) on another stream. Kernels can only run in parallel if the kernel does not occupy all the computation resources on the GPU. In this case, that kernel probably already occupies all the resource (100% utilization) so kernels from the other thread cannot run.
Why are there so many locks, is it because different cuda kernels are competing with each other?
I am not 100% sure, but it seems that CPU is waiting for GPU to digest (i.e. finish some kernel executions) before launching new kernels. I think this is because you have hit the limit of number of kernels waiting in the queue.
However, this is not an issue as long as GPU utilization is already 100%. It simply means that CPU is giving more work than GPU can handle so CPU must wait for a bit.
Closing since no activity for more than 3 weeks, thanks all!
I am testing multi-threads inference with two diff models, but inference latency is much slower than the single one
use nsys to analyze the infer time:
my questions as follows:
Environment
TensorRT Version: 8.2.4 NVIDIA GPU: 2080Ti NVIDIA Driver Version: 460.56 CUDA Version: 11.6 CUDNN Version: 8.4 Operating System: ubuntu20.04 Python Version (if applicable): python3.8 Libtorch Version (if applicable): 1.13.1 Docker Container (if so, version): nvcr.io/nvidia/tensorrt:22.04-py3
my code:
nsys: