Open glenn-jocher opened 4 years ago
Looking at batched_nms
, it looks like what we call cross_class
NMS, but I'm not sure what that would make multi_cls=True
.
Anyway, here's our implementation of Fast NMS: https://github.com/dbolya/yolact/blob/092554ad707c2749631dfe545c8a953b2b3f4a68/layers/functions/detection.py#L137-L180
It works on boxes, so you can just ignore the mask stuff. The relevant inputs are boxes ([N, 4])
and scores ([N, num_classes])
. The inputs and outputs should both be on the GPU (or whatever your fastest device is, and make sure nothing ever touches the CPU in this function), and we pass in all detections with > 0.05 confidence, but I don't think passing everything in will hurt performance much since we take the top 200 anyway. Also, read the big comment about the second threshold.
Most of the code is setup and postprocessing, the core of the algorithm is actually just:
iou = jaccard(boxes, boxes)
iou.triu_(diagonal=1)
iou_max, _ = iou.max(dim=1)
# Now just filter out the ones higher than the threshold
keep = (iou_max <= iou_threshold)
which is what's in the paper.
@dbolya great thanks! torchvision.ops.boxes.batched_nms()
just means that the function accepts multiple classes at once. torchvision has a seperate torchvision.ops.boxes.nms()
function that only accepts single-class boxes, which you'd need to drop into a for c in classes:
type of python loop.
We use BCE for class loss, not CE, so its possible that multiple classes may be above threshold for a given box in our repo. multi_cls=True
means that we output multiple detections (same box, different classes) in this case. multi_cls=False
means we only pick the very top class above threshold.
I will try to implement this week and post the results here if I'm successful!
@dbolya @Zzh-tju I've imported the YOLACT FastNMS functions into ultralytics/yolov3, and get the following results. The times are for inference+NMS on the 5k COCO2014 val images using a Google Colab instance with Tesla T4.
fast_batched
below is the YOLACT FastNMS. I call it batched because I only call it once per image (it handles all classes at once). It is faster than torchvision.ops.boxes.batched_nms()
, but with a mAP penalty of about 0.3-0.4 unfortunately. It may be much faster than torchvision, its unclear from these tests, as the below operations are likely dominated by inference time rather than NMS time. When I have time I will rerun on a large GCP VM with 16 cores and a V100 for the best comparison metrics.
NMS method | Time ms/img |
Time mm:ss |
mAP @0.5:0.95 |
mAP @0.5 |
---|---|---|---|---|
'vision_batched' |
49ms | 4:03 | 41.9 | 61.8 |
'merge' |
120ms | 9:58 | 42.3 | 62.0 |
'fast_batched' |
44ms | 3:41 | 41.5 | 61.5 |
Very interesting, and yeah I'm guessing the situations where fast nms would offer a huge speed increase depend on the detector and the rest of the code. Maybe the setup and post processing are a little too bloated too.
Also, now that you mention it, I'm fairly sure I could create a fast merge nms that would be slightly worse than what you list there but almost a fast as fast NMS. This will have to wait until after a certain very close deadline tho >.>
Update: I discovered a majority of time in ultralytics/yolov3/test.py was spent building pycocotools JSON files for official mAPs. If I turn off this functionality (compute mAP only with internal repo code) I get the following much improved times for the 5k COCO2014 val images. Machine is a 12-vCPU V100 instance.
python3 test.py --weights yolov3-spp-ultralytics.pt --cfg yolov3-spp --img 608
NMS method | Time ms/img |
Time mm:ss |
mAP @0.5:0.95 |
mAP @0.5 |
---|---|---|---|---|
'vision_batched' (default) |
15.2 ms | 1:16 | 41.9 | 61.8 |
'merge' |
103.0 ms | 8:35 | 42.3 | 62.0 |
'fast_batched' |
14.6 ms | 1:13 | 41.5 | 61.5 |
I get a 4% drop in time for a 1% drop in mAP by switching to fast from vision batched, which isn't bad, though I suspect img-size reductions may yield slightly more favorable ratios. In any case, both implementations are much faster than python for
loop nms used in merge
. Merge simply creates new boxes using a weighted combination of the scores rather than deleting lower score boxes outright. It seems to provide a +0.4mAP bump, which might take fast nms back to the same mAP produced by vision_batched, but then we'd be back were we started unfortunately.
To further clarify the timing, I added profiling code to test.py that specifically tracks inference and NMS times in https://github.com/ultralytics/yolov3/commit/e482392161c30d4e4dbf4b4eebdb4672fcc6a134. This can be accessed with the --profile
flag:
python3 test.py --weights yolov3-spp-ultralytics.pt --img 608 --conf 0.001 --profile
I ran with both default torchvision NMS and the yolact FastNMS, and actually saw a slight speed decrease with FastNMS:
Default: Profile results: 1.3/6.9/8.1 ms inference/NMS/total per image
FastNMS: Profile results: 1.3/7.1/8.4 ms inference/NMS/total per image
So perhaps the slight speed increase from FastNMS observed in the total test time is due simply to a reduced box count produced by this NMS method, which results in less postprocessing work during testing (mAP calculation etc.).
The other surprise was the great amount of total time spent on NMS vs inference. Even under the default settings 6.9/8.1 = 85% of the total time is spent on NMS!
CORRECTION: My previous analysis was incorrect, it lacked the torch.cuda.synchronize()
operations necessary when profiling cuda operations. I've fixed this in https://github.com/ultralytics/yolov3/commit/1430a1e4083609ab197cf1947a12ab8692b20593. Corrected results, consistent across several runs:
python3 test.py --weights yolov3-spp-ultralytics.pt --img 608 --conf 0.001 --profile
Default: Profile results: 6.6/1.6/8.2 ms inference/NMS/total per image
FastNMS: Profile results: 6.6/1.9/8.5 ms inference/NMS/total per image
Conclusion is that inference uses most (80%) of the runtime in both cases, and that FastNMS appears to run slightly slower than default torchvision.ops.boxes.batched_nms()
.
thanks, so it is carried out on one class, isn't is? e.g., cc_fast_NMS, collapse all the classes into 1. How about multi-class?
And how many boxes do you choose? (top n)
@Zzh-tju I imported the FastNMS code here. It's very clever, but unfortunately it seems to be a dead end, as it's slower and produces worse mAP than the default method. https://github.com/ultralytics/yolov3/blob/8b6c8a53182b2415fd61459fc9a0ccbdef8dc904/utils/utils.py#L558-L568
I use all boxes above --conf
threshold, I don't discard any boxes or put any upper limit on the number of boxes.
The times and tests above are for the usual 5000 image COCO val set using yolov3-spp-ultralytics.pt for all 80 classes. Everything is the exact same in the tests between the default output and the FastNMS output. You can reproduce by simply running
python3 test.py --weights yolov3-spp-ultralytics.pt --img 608 --conf 0.001 --profile
@Zzh-tju perhaps I'm not understanding the purpose of the top n boxes. I assumed this was a memory saver or speed enhancer, so I neglected to implement it as I saw no out of memory errors when running on full size COCO images, so I assumed all was well.
Is it possible that since I did not implement the top n boxes I'm not recreating FastNMS correctly? The code I have is very simple, I think it captures the core intention (the upper triangular iou matrix):
# Batched NMS
if batched:
c = pred[:, 5] * 0 if agnostic else pred[:, 5] # class-agnostic NMS
boxes, scores = pred[:, :4].clone(), pred[:, 4]
if method == 'vision_batch':
i = torchvision.ops.boxes.batched_nms(boxes, scores, c, iou_thres)
elif method == 'fast_batch': # FastNMS from https://github.com/dbolya/yolact
boxes += c.view(-1, 1) * max_wh # seperate boxes by class
iou = box_iou(boxes, boxes).triu_(diagonal=1) # upper triangular iou matrix
i = iou.max(dim=0)[0] < iou_thres
yeah, I mean your batch_nms is cross-class NMS, right? And I'm confused by your mention above what is the difference between multi_cls=True or False? Do you mean YOLO provide a box with multi-label? And if set to False, it will be only one class for a box. But the NMS for the two modes are cross class. Since 'False' can be faster than 'True', I'm wonder the setting differences between the two.
Another question is how about doing NMS for each class? (Fast NMS vs traditional NMS)
@Zzh-tju it's very simple.
multi_cls
allows more than one label per box.multi_cls
@Zzh-tju ah I think I understand your confusion. Maybe I should rename multi_cls
as multi_label
to better explain it. This is what it is doing.
https://en.wikipedia.org/wiki/Multi-label_classification
It's intended for multi-label datasets like OIv5 where a 'person' can also be a 'man' or a 'woman' (i.e. two correct labels for one object). It also helps out COCO mAP a bit, despite it being a single label dataset.
Update: fixed in https://github.com/ultralytics/yolov3/commit/692b006f4dda066a81800b94a34ec51c574c380f
@glenn-jocher yeah, now I just want to know the speed when doing NMS for each class. For traditional NMS, it must do for a c loop, right? So I guess Fast NMS will be faster since it does once for all classes simultaneously.
@Zzh-tju the speeds provided are for NMS for all 80 COCO classes for each image: 1.6 ms per image for all classes. The batched methods do all classes simultaneously.
def batched_nms(boxes, scores, idxs, iou_threshold):
# type: (Tensor, Tensor, Tensor, float)
"""
Performs non-maximum suppression in a batched fashion.
Each index value correspond to a category, and NMS
will not be applied between elements of different categories.
Parameters
----------
boxes : Tensor[N, 4]
boxes where NMS will be performed. They
are expected to be in (x1, y1, x2, y2) format
scores : Tensor[N]
scores for each one of the boxes
idxs : Tensor[N]
indices of the categories for each one of the boxes.
iou_threshold : float
discards all overlapping boxes
with IoU > iou_threshold
Returns
-------
keep : Tensor
int64 tensor with the indices of
the elements that have been kept by NMS, sorted
in decreasing order of scores
"""
I saw a new matrix nms. https://arxiv.org/abs/2003.10152 https://github.com/aim-uofa/AdelaiDet/
@Gaondong yes I already tried to implement it, and was unable to reproduce their results.
@Gaondong yes I already tried to implement it, and was unable to reproduce their results.
Thanks.
@Gaondong see https://github.com/ultralytics/yolov3/issues/679#issuecomment-604164825
I used this code for Matrix (Soft) NMS:
elif method == 'matrix': # Matrix NMS from https://arxiv.org/abs/2003.10152
iou = box_iou(boxes, boxes).triu_(diagonal=1) # upper triangular iou matrix
m = iou.max(0)[0].view(-1, 1) # max values
decay = torch.exp(-(iou ** 2 - m ** 2) / 0.5).min(0)[0] # gauss with sigma=0.5
scores *= decay
i = torch.full((boxes.shape[0],), fill_value=1).bool()
@dbolya we've had a request from @Zzh-tju to implement FastNMS in https://github.com/ultralytics/yolov3 per https://github.com/ultralytics/yolov3/issues/679. Can you point us to the location in your code where the function is? Can we use it for boxes rather than masks?
We currently use multi-class
torchvision.ops.boxes.batched_nms()
(middle row) as a compromise between speed and accuracy. We apply it once per image (all classes at once), and see an inference speed of 49ms/img (inference + NMS) at 608 image size,conf_thresh=0.001
on a Tesla T4, giving us about 42.0/62.0 mAP@0.5/0.5...0.95 on COCO2014. We do not do masks though, only boxes.BTW, we also developed the
merge
nms method below, which is slower simply because it is implemented in python rather than C, but it may be possible to combinefast
andmerge
together to get the best of both worlds.s/img
mm:ss
@0.5:0.95
@0.5
'vision_batched', multi_cls=False
'vision_batched', multi_cls=True
'merge', multi_cls=True