Closed LinuxCup closed 1 year ago
Do you try DSVT_TrtEngine
?
Do you try
DSVT_TrtEngine
?
By the way, you should skip the early stages to bypass the GPU warmup process.
Do you try
DSVT_TrtEngine
?By the way, you should skip the early stages to bypass the GPU warmup process.
I used the way parsing onnx file. However, by loading DSVT_TrtEngine file is the same situation, the elasped time remains unchanged. The time mentioned above represents the average time.(remove the warmup iteration time) Have you ever encountered any of the above-mentioned siuation? How much faster is the TensorRT version compared to the pytorch version based on your previous experience? Have any other suggestions to solve the above problems. model link dsvt_blocks.zip
Thanks!
I use the python API of TensorRT, and I have noticed the FP16 version of TensorRT is nearly twice as fast compared to the PyTorch version.
the following is the bash script to generate the engine_trt file, do you find some any problems? attach the output log file
./trtexec --onnx=dsvt_blocks.onnx \ --saveEngine=dsvt_blocks.engine \ --memPoolSize=workspace:4096 --verbose --buildOnly --device=0 --fp16 --tacticSources=+CUDNN,+CUBLAS,-CUBLAS_LT,+EDGE_MASK_CONVOLUTIONS \ --minShapes=src:1000x128,set_voxel_inds_tensor_shift_0:2x50x36,set_voxel_inds_tensor_shift_1:2x50x36,set_voxel_masks_tensor_shift_0:2x50x36,set_voxel_masks_tensor_shift_1:2x50x36,pos_embed_tensor:4x2x1000x128 \ --optShapes=src:24629x128,set_voxel_inds_tensor_shift_0:2x1156x36,set_voxel_inds_tensor_shift_1:2x834x36,set_voxel_masks_tensor_shift_0:2x1156x36,set_voxel_masks_tensor_shift_1:2x834x36,pos_embed_tensor:4x2x24629x128 \ --maxShapes=src:100000x128,set_voxel_inds_tensor_shift_0:2x5000x36,set_voxel_inds_tensor_shift_1:2x3200x36,set_voxel_masks_tensor_shift_0:2x5000x36,set_voxel_masks_tensor_shift_1:2x3200x36,pos_embed_tensor:4x2x100000x128 &>result.txt
the following is the bash script to generate the engine_trt file, do you find some any problems? attach the output log file
./trtexec --onnx=dsvt_blocks.onnx --saveEngine=dsvt_blocks.engine --memPoolSize=workspace:4096 --verbose --buildOnly --device=0 --fp16 --tacticSources=+CUDNN,+CUBLAS,-CUBLAS_LT,+EDGE_MASK_CONVOLUTIONS --minShapes=src:1000x128,set_voxel_inds_tensor_shift_0:2x50x36,set_voxel_inds_tensor_shift_1:2x50x36,set_voxel_masks_tensor_shift_0:2x50x36,set_voxel_masks_tensor_shift_1:2x50x36,pos_embed_tensor:4x2x1000x128 --optShapes=src:24629x128,set_voxel_inds_tensor_shift_0:2x1156x36,set_voxel_inds_tensor_shift_1:2x834x36,set_voxel_masks_tensor_shift_0:2x1156x36,set_voxel_masks_tensor_shift_1:2x834x36,pos_embed_tensor:4x2x24629x128 --maxShapes=src:100000x128,set_voxel_inds_tensor_shift_0:2x5000x36,set_voxel_inds_tensor_shift_1:2x3200x36,set_voxel_masks_tensor_shift_0:2x5000x36,set_voxel_masks_tensor_shift_1:2x3200x36,pos_embed_tensor:4x2x100000x128 &>result.txt
Hi, I have found that improper dynamic shapes were utilized in the TRT command, leading to the observed results as follows: Test on RTX3090: Pytorch: 36.0ms TRT-fp16: 32.9ms
After analyzing the distribution of these dynamic shapes in Waymo validation dataset, I suggest employing the following command for optimal results:
trtexec --onnx=./deploy_files/dsvt.onnx --saveEngine=./deploy_files/dsvt.engine \
--memPoolSize=workspace:4096 --verbose --buildOnly --device=1 --fp16 \
--tacticSources=+CUDNN,+CUBLAS,-CUBLAS_LT,+EDGE_MASK_CONVOLUTIONS \
--minShapes=src:3000x192,set_voxel_inds_tensor_shift_0:2x170x36,set_voxel_inds_tensor_shift_1:2x100x36,set_voxel_masks_tensor_shift_0:2x170x36,set_voxel_masks_tensor_shift_1:2x100x36,pos_embed_tensor:4x2x3000x192 \
--optShapes=src:20000x192,set_voxel_inds_tensor_shift_0:2x1000x36,set_voxel_inds_tensor_shift_1:2x700x36,set_voxel_masks_tensor_shift_0:2x1000x36,set_voxel_masks_tensor_shift_1:2x700x36,pos_embed_tensor:4x2x20000x192 \
--maxShapes=src:35000x192,set_voxel_inds_tensor_shift_0:2x1500x36,set_voxel_inds_tensor_shift_1:2x1200x36,set_voxel_masks_tensor_shift_0:2x1500x36,set_voxel_masks_tensor_shift_1:2x1200x36,pos_embed_tensor:4x2x35000x192 \
> debug.log 2>&1
Subsequently, you will obtain the following results: Test on RTX3090: Pytorch: 36.0ms TRT-fp16: 13.8ms
This issue seems to have been solved and will be closed.
I think this is due to hardware reasons! in my nvidia p4000, I implemented dsvt using tensorrt definition c++ api and plugin. similarly, there was no acceleration!you can try other hardware,such as 2080ti.
I think this is due to hardware reasons! in my nvidia p4000, I implemented dsvt using tensorrt definition c++ api and plugin. similarly, there was no acceleration!you can try other hardware,such as 2080ti.
New command of trtexec will also be slow in P4000?
trtexec --onnx=./deploy_files/dsvt.onnx --saveEngine=./deploy_files/dsvt.engine \
--memPoolSize=workspace:4096 --verbose --buildOnly --device=1 --fp16 \
--tacticSources=+CUDNN,+CUBLAS,-CUBLAS_LT,+EDGE_MASK_CONVOLUTIONS \
--minShapes=src:3000x192,set_voxel_inds_tensor_shift_0:2x170x36,set_voxel_inds_tensor_shift_1:2x100x36,set_voxel_masks_tensor_shift_0:2x170x36,set_voxel_masks_tensor_shift_1:2x100x36,pos_embed_tensor:4x2x3000x192 \
--optShapes=src:20000x192,set_voxel_inds_tensor_shift_0:2x1000x36,set_voxel_inds_tensor_shift_1:2x700x36,set_voxel_masks_tensor_shift_0:2x1000x36,set_voxel_masks_tensor_shift_1:2x700x36,pos_embed_tensor:4x2x20000x192 \
--maxShapes=src:35000x192,set_voxel_inds_tensor_shift_0:2x1500x36,set_voxel_inds_tensor_shift_1:2x1200x36,set_voxel_masks_tensor_shift_0:2x1500x36,set_voxel_masks_tensor_shift_1:2x1200x36,pos_embed_tensor:4x2x35000x192 \
> debug.log 2>&1
I only test it in RTX3090. It may be depending on the device, the P4000 has less GPU cores.
I think this is due to hardware reasons! in my nvidia p4000, I implemented dsvt using tensorrt definition c++ api and plugin. similarly, there was no acceleration!you can try other hardware,such as 2080ti.
Maybe this is the reason for FP16. P4000 has much lower computational power and fewer GPU cores in FP16. Please refer here. RTX 3090 is 500x faster than P4000 in FP16 computation. Our TensorRT deployment mainly focuses on FP16.
thank you all above, There is a huge diffrence in hardware performence between p4000 and 3090ti. this issure will be closed
Nice!
I attempted to deploy the dsvt model to TensorRT according to your deployment code, By the TensorRT official example code I used dynamic shape for dsvt_block model input, Model inference time is about 260ms. However, using pytorch version takes less time, about 140ms. Why the time takes more with TensorRT c++ code?
Environment TensorRT Version: 8.5.1.7 CUDA Version: 11.8 CUDNN Version: 8.6 Hardware GPU: p4000 (the rest is the same as the public)
inference code
according to results, the average time cost of each stage, as following: t1-t0:0.00860953 t2-t1:0.0124242 t3-t2:4.72069e-05 t4-t3:8.10623e-06 t5-t4:0.260188 t6-t5:0.00110817
c++ code takes more time? Have some mistakes in inference code?