FastNMS on Ultralytics YOLOv3

glenn-jocher commented 4 years ago

@dbolya we've had a request from @Zzh-tju to implement FastNMS in https://github.com/ultralytics/yolov3 per https://github.com/ultralytics/yolov3/issues/679. Can you point us to the location in your code where the function is? Can we use it for boxes rather than masks?

We currently use multi-class torchvision.ops.boxes.batched_nms() (middle row) as a compromise between speed and accuracy. We apply it once per image (all classes at once), and see an inference speed of 49ms/img (inference + NMS) at 608 image size, conf_thresh=0.001 on a Tesla T4, giving us about 42.0/62.0 mAP@0.5/0.5...0.95 on COCO2014. We do not do masks though, only boxes.

BTW, we also developed the merge nms method below, which is slower simply because it is implemented in python rather than C, but it may be possible to combine fast and merge together to get the best of both worlds.

NMS method	Time s/img	Time mm:ss	mAP @0.5:0.95	mAP @0.5
`'vision_batched', multi_cls=False`	46ms	3:50	41.2	60.8
`'vision_batched', multi_cls=True`	49ms	4:03	41.9	61.8
`'merge', multi_cls=True`	120ms	9:58	42.3	62.0

dbolya commented 4 years ago

Looking at batched_nms, it looks like what we call cross_class NMS, but I'm not sure what that would make multi_cls=True.

Anyway, here's our implementation of Fast NMS: https://github.com/dbolya/yolact/blob/092554ad707c2749631dfe545c8a953b2b3f4a68/layers/functions/detection.py#L137-L180

It works on boxes, so you can just ignore the mask stuff. The relevant inputs are boxes ([N, 4]) and scores ([N, num_classes]). The inputs and outputs should both be on the GPU (or whatever your fastest device is, and make sure nothing ever touches the CPU in this function), and we pass in all detections with > 0.05 confidence, but I don't think passing everything in will hurt performance much since we take the top 200 anyway. Also, read the big comment about the second threshold.

Most of the code is setup and postprocessing, the core of the algorithm is actually just:

     iou = jaccard(boxes, boxes) 
     iou.triu_(diagonal=1) 
     iou_max, _ = iou.max(dim=1) 

     # Now just filter out the ones higher than the threshold 
     keep = (iou_max <= iou_threshold)

which is what's in the paper.

glenn-jocher commented 4 years ago

@dbolya great thanks! torchvision.ops.boxes.batched_nms() just means that the function accepts multiple classes at once. torchvision has a seperate torchvision.ops.boxes.nms() function that only accepts single-class boxes, which you'd need to drop into a for c in classes: type of python loop.

We use BCE for class loss, not CE, so its possible that multiple classes may be above threshold for a given box in our repo. multi_cls=True means that we output multiple detections (same box, different classes) in this case. multi_cls=False means we only pick the very top class above threshold.

I will try to implement this week and post the results here if I'm successful!

glenn-jocher commented 4 years ago

@dbolya @Zzh-tju I've imported the YOLACT FastNMS functions into ultralytics/yolov3, and get the following results. The times are for inference+NMS on the 5k COCO2014 val images using a Google Colab instance with Tesla T4.

fast_batched below is the YOLACT FastNMS. I call it batched because I only call it once per image (it handles all classes at once). It is faster than torchvision.ops.boxes.batched_nms(), but with a mAP penalty of about 0.3-0.4 unfortunately. It may be much faster than torchvision, its unclear from these tests, as the below operations are likely dominated by inference time rather than NMS time. When I have time I will rerun on a large GCP VM with 16 cores and a V100 for the best comparison metrics.

NMS method	Time ms/img	Time mm:ss	mAP @0.5:0.95	mAP @0.5
`'vision_batched'`	49ms	4:03	41.9	61.8
`'merge'`	120ms	9:58	42.3	62.0
`'fast_batched'`	44ms	3:41	41.5	61.5

dbolya commented 4 years ago

Very interesting, and yeah I'm guessing the situations where fast nms would offer a huge speed increase depend on the detector and the rest of the code. Maybe the setup and post processing are a little too bloated too.

Also, now that you mention it, I'm fairly sure I could create a fast merge nms that would be slightly worse than what you list there but almost a fast as fast NMS. This will have to wait until after a certain very close deadline tho >.>

glenn-jocher commented 4 years ago

Update: I discovered a majority of time in ultralytics/yolov3/test.py was spent building pycocotools JSON files for official mAPs. If I turn off this functionality (compute mAP only with internal repo code) I get the following much improved times for the 5k COCO2014 val images. Machine is a 12-vCPU V100 instance.

python3 test.py --weights yolov3-spp-ultralytics.pt --cfg yolov3-spp --img 608

NMS method	Time ms/img	Time mm:ss	mAP @0.5:0.95	mAP @0.5
`'vision_batched'` (default)	15.2 ms	1:16	41.9	61.8
`'merge'`	103.0 ms	8:35	42.3	62.0
`'fast_batched'`	14.6 ms	1:13	41.5	61.5

I get a 4% drop in time for a 1% drop in mAP by switching to fast from vision batched, which isn't bad, though I suspect img-size reductions may yield slightly more favorable ratios. In any case, both implementations are much faster than python for loop nms used in merge. Merge simply creates new boxes using a weighted combination of the scores rather than deleting lower score boxes outright. It seems to provide a +0.4mAP bump, which might take fast nms back to the same mAP produced by vision_batched, but then we'd be back were we started unfortunately.

glenn-jocher commented 4 years ago

To further clarify the timing, I added profiling code to test.py that specifically tracks inference and NMS times in https://github.com/ultralytics/yolov3/commit/e482392161c30d4e4dbf4b4eebdb4672fcc6a134. This can be accessed with the --profile flag:

python3 test.py --weights yolov3-spp-ultralytics.pt --img 608 --conf 0.001 --profile

I ran with both default torchvision NMS and the yolact FastNMS, and actually saw a slight speed decrease with FastNMS:

Default: Profile results: 1.3/6.9/8.1 ms inference/NMS/total per image FastNMS: Profile results: 1.3/7.1/8.4 ms inference/NMS/total per image

So perhaps the slight speed increase from FastNMS observed in the total test time is due simply to a reduced box count produced by this NMS method, which results in less postprocessing work during testing (mAP calculation etc.).

The other surprise was the great amount of total time spent on NMS vs inference. Even under the default settings 6.9/8.1 = 85% of the total time is spent on NMS!

glenn-jocher commented 4 years ago

CORRECTION: My previous analysis was incorrect, it lacked the torch.cuda.synchronize() operations necessary when profiling cuda operations. I've fixed this in https://github.com/ultralytics/yolov3/commit/1430a1e4083609ab197cf1947a12ab8692b20593. Corrected results, consistent across several runs:

python3 test.py --weights yolov3-spp-ultralytics.pt --img 608 --conf 0.001 --profile

Default: Profile results: 6.6/1.6/8.2 ms inference/NMS/total per image FastNMS: Profile results: 6.6/1.9/8.5 ms inference/NMS/total per image

Conclusion is that inference uses most (80%) of the runtime in both cases, and that FastNMS appears to run slightly slower than default torchvision.ops.boxes.batched_nms().

Zzh-tju commented 4 years ago

thanks, so it is carried out on one class, isn't is? e.g., cc_fast_NMS, collapse all the classes into 1. How about multi-class?

Zzh-tju commented 4 years ago

And how many boxes do you choose? (top n)

glenn-jocher commented 4 years ago

@Zzh-tju I imported the FastNMS code here. It's very clever, but unfortunately it seems to be a dead end, as it's slower and produces worse mAP than the default method. https://github.com/ultralytics/yolov3/blob/8b6c8a53182b2415fd61459fc9a0ccbdef8dc904/utils/utils.py#L558-L568

I use all boxes above --conf threshold, I don't discard any boxes or put any upper limit on the number of boxes.

The times and tests above are for the usual 5000 image COCO val set using yolov3-spp-ultralytics.pt for all 80 classes. Everything is the exact same in the tests between the default output and the FastNMS output. You can reproduce by simply running

python3 test.py --weights yolov3-spp-ultralytics.pt --img 608 --conf 0.001 --profile

glenn-jocher commented 4 years ago

@Zzh-tju perhaps I'm not understanding the purpose of the top n boxes. I assumed this was a memory saver or speed enhancer, so I neglected to implement it as I saw no out of memory errors when running on full size COCO images, so I assumed all was well.

Is it possible that since I did not implement the top n boxes I'm not recreating FastNMS correctly? The code I have is very simple, I think it captures the core intention (the upper triangular iou matrix):

        # Batched NMS
        if batched:
            c = pred[:, 5] * 0 if agnostic else pred[:, 5]  # class-agnostic NMS
            boxes, scores = pred[:, :4].clone(), pred[:, 4]
            if method == 'vision_batch':
                i = torchvision.ops.boxes.batched_nms(boxes, scores, c, iou_thres)
            elif method == 'fast_batch':  # FastNMS from https://github.com/dbolya/yolact
                boxes += c.view(-1, 1) * max_wh  # seperate boxes by class
                iou = box_iou(boxes, boxes).triu_(diagonal=1)  # upper triangular iou matrix
                i = iou.max(dim=0)[0] < iou_thres

Zzh-tju commented 4 years ago

yeah, I mean your batch_nms is cross-class NMS, right? And I'm confused by your mention above what is the difference between multi_cls=True or False? Do you mean YOLO provide a box with multi-label? And if set to False, it will be only one class for a box. But the NMS for the two modes are cross class. Since 'False' can be faster than 'True', I'm wonder the setting differences between the two.

Another question is how about doing NMS for each class? (Fast NMS vs traditional NMS)

glenn-jocher commented 4 years ago

@Zzh-tju it's very simple.

All classes are processed correctly, no matter the class count.
multi_cls allows more than one label per box.
FastNMS appears slower, and produces worse mAP regardless of class count or multi_cls

glenn-jocher commented 4 years ago

@Zzh-tju ah I think I understand your confusion. Maybe I should rename multi_cls as multi_label to better explain it. This is what it is doing. https://en.wikipedia.org/wiki/Multi-label_classification

It's intended for multi-label datasets like OIv5 where a 'person' can also be a 'man' or a 'woman' (i.e. two correct labels for one object). It also helps out COCO mAP a bit, despite it being a single label dataset.

Update: fixed in https://github.com/ultralytics/yolov3/commit/692b006f4dda066a81800b94a34ec51c574c380f

Zzh-tju commented 4 years ago

@glenn-jocher yeah, now I just want to know the speed when doing NMS for each class. For traditional NMS, it must do for a c loop, right? So I guess Fast NMS will be faster since it does once for all classes simultaneously.

glenn-jocher commented 4 years ago

@Zzh-tju the speeds provided are for NMS for all 80 COCO classes for each image: 1.6 ms per image for all classes. The batched methods do all classes simultaneously.

glenn-jocher commented 4 years ago

https://github.com/pytorch/vision/blob/b6f28ec1a8c5fdb8d01cc61946e8f87dddcfa830/torchvision/ops/boxes.py#L39

def batched_nms(boxes, scores, idxs, iou_threshold):
    # type: (Tensor, Tensor, Tensor, float)
    """
    Performs non-maximum suppression in a batched fashion.
    Each index value correspond to a category, and NMS
    will not be applied between elements of different categories.
    Parameters
    ----------
    boxes : Tensor[N, 4]
        boxes where NMS will be performed. They
        are expected to be in (x1, y1, x2, y2) format
    scores : Tensor[N]
        scores for each one of the boxes
    idxs : Tensor[N]
        indices of the categories for each one of the boxes.
    iou_threshold : float
        discards all overlapping boxes
        with IoU > iou_threshold
    Returns
    -------
    keep : Tensor
        int64 tensor with the indices of
        the elements that have been kept by NMS, sorted
        in decreasing order of scores
    """

Gaondong commented 4 years ago

I saw a new matrix nms. https://arxiv.org/abs/2003.10152 https://github.com/aim-uofa/AdelaiDet/

glenn-jocher commented 4 years ago

@Gaondong yes I already tried to implement it, and was unable to reproduce their results.

Gaondong commented 4 years ago

@Gaondong yes I already tried to implement it, and was unable to reproduce their results.

Thanks.

glenn-jocher commented 4 years ago

@Gaondong see https://github.com/ultralytics/yolov3/issues/679#issuecomment-604164825

I used this code for Matrix (Soft) NMS:

            elif method == 'matrix':  # Matrix NMS from https://arxiv.org/abs/2003.10152
                iou = box_iou(boxes, boxes).triu_(diagonal=1)  # upper triangular iou matrix
                m = iou.max(0)[0].view(-1, 1)  # max values
                decay = torch.exp(-(iou ** 2 - m ** 2) / 0.5).min(0)[0]  # gauss with sigma=0.5
                scores *= decay
                i = torch.full((boxes.shape[0],), fill_value=1).bool()

dbolya / yolact

FastNMS on Ultralytics YOLOv3 #366