Bug in computation of h_mask_size_

Hi,

There is a bug in the code in computing h_masksize.

As a reminder, we first compute det_num as follow :

det_num_ = param_.feature_size.x * param_.feature_size.y * param_.num_anchors; # [216 x 248 x 6 = 321408]

Then h_mask_size as follow (with #define DIVUP(x, y) (x + y - 1) / y) :

h_mask_size_ = det_num_ * DIVUP(det_num_, NMS_THREADS_PER_BLOCK) * sizeof(uint64_t); So this line is replaced by : h_mask_size_ = det_num_ * (det_num_ + NMS_THREADS_PER_BLOCK - 1) / NMS_THREADS_PER_BLOCK * sizeof(uint64_t); The first product is done first, and lead to 321408 * (321408 + 64 - 1) = 103323351168 that is waaaaay too large to fit in a unsigned int. It lead to a big mistake and the final h_masksize is 30517008 (30Mo) instead of the real value that should be 12915418896 (12Go) !! If I had parenthesis around DIVUP to got the real value, then checkRuntime(cudaMemsetAsync(h_mask_, 0, h_mask_size_, _stream)); take tooooo much time ...

Are you sure about this h_masksize computation ? I'm not an expert of nms so I can't fix it myself :/

NVIDIA-AI-IOT / CUDA-PointPillars

Bug in computation of h_mask_size_ #118