gdlg / pytorch_nms

CUDA implementation of NMS for PyTorch
Other
85 stars 16 forks source link

how to compile it with cuda c++ #3

Closed CasonTsai closed 5 years ago

CasonTsai commented 5 years ago

hello, i met a problem when use nvcc to compiled this project 。 i use nvcc -o cudatest nms_kernel.cu -L -lcudart -I D:\Code-software\NNF\libtorch\libtorch\include -I D:\Code-software\NNF\libtorch\libtorch\include\torch\csrc\api\include to compiled it ,because i want to use it with c++ but got the error: member "torch::jit::ArgumentSpecCreator::DEPTH_LIMIT" may not be initialized What did I do wrong? Can i use this project in libtorch?

gdlg commented 5 years ago

Would you be able to provide the exact log produced by nvcc? What version of PyTorch and CUDA are you using?

CasonTsai commented 5 years ago

@gdlg i 'am sorry ,i forgot to describe the necessary information PyTorch version : 1.2.0 CUDA (nvcc) : 10.0 libtorch version : 1.2.0 system : win10 error: D:/Code-software/NNF/libtorch/libtorch/include\torch/csrc/jit/argument_spec.h(181): error: member "torch::jit::ArgumentSpecCreator::DEPTH_LIMIT" may not be initialized 1 error detected in the compilation of "C:/Users/Cason/AppData/Local/Temp/tmpxft_00001b28_00000000-10_nms_kernel.cpp1.ii"

gdlg commented 5 years ago

It seems to be an error related to PyTorch compilation on Windows rather than this specific project. I don’t have any Windows machine to reproduce this bug but please try the workaround mentioned in pytorch/extension-cpp#37.

CasonTsai commented 5 years ago

@gdlg thanks, yes,it may be an error related to pytorch comilation, so i separated libtorch code and cuda code, and it worked successfully! When i applied it in SSD((single shot multibox detector),i found nms_collect costed 26ms per image and per class,when there were many boxes whose scores are bigger than threshold. So i changed nms_collect to cpu mode,it only costed 5-6ms. i think my cpu may be faster than 1080ti when in serial mode. When i changed this nms_kernel and nms_collect to batching,all nms operation costed 16ms with batch_size =1,class_num=21,all nms operation costed 10ms with batch_size =25,class_num=21,it's amazing.Thank you very much! There is another small problem,i have no idea about learning ATen library. eg. i want to create a tensor whose dtype is long, such as this type of code 'auto longOptions = torch::TensorOptions().device(torch::kCUDA).dtype(torch::kLong).is_variable(true);' ,I can't found this in official Website tutorial. Where can I find such an operation ?

gdlg commented 5 years ago

Thanks for sharing your stats, that’s quite interesting. I never expected the serial part to be particularly fast but I assumed that it would still be faster than synchronously copying the data back to the CPU. I might have been wrong.

Yes, the documentation for ATen is still in infancy and the API is still evolving quite a bit. In general, I have found that you can guess the prototype of a C++ operation by looking at the documentation of the Python equivalent. The only tricky bit is the dtype; it’s often much easier to copy the type of an existing tensor. For instance: auto output = torch.zeros({B,C,H,W}, input.type()); When I can’t guess the prototype, I usually have a look at ATen’s C++ header files.

CasonTsai commented 4 years ago

@gdlg enen,thanks a lot