NVIDIA / caffe

Caffe: a fast open framework for deep learning.
http://caffe.berkeleyvision.org/
Other
672 stars 263 forks source link

detectnet/clustering.py incorrectly calculates bounding box height #557

Closed samsparks closed 5 years ago

samsparks commented 5 years ago

vote_boxes() in clustering.py calculates each detection height by subtracting each bounding box's index 1 from index 3.

However, since a bounding box is a cv::Rect, index 3 is height and index 1 is y. The clustering algorithm should test bounding box height as follows:

            if rect[3] >= self.min_height:
samsparks commented 5 years ago

Actually, as I look closer, the issue is move invasive. clustering.py assumes [[x1, y1, x2, y2]] boxes as the vote_boxes()'s input and output. Therefore, the algorithm needs to convert the input prior to the call to groupRectangles(), test the height using rect[3], and convert back to [x1, y1, x2, y2] when populating detections_per_image

samsparks commented 5 years ago

Is there a better venue for this discussion? I haven't received a response on nvidia's forums. TIA

drnikolaev commented 5 years ago

hi @samsparks sorry for the delay. I'm trying this but I'd appreciate some sample/unit test to check that it works correctly

samsparks commented 5 years ago

Sure, @drnikolaev, I would be happy to provide example code. However, as this is an interface issue between clustering.py and opencv, I'm not sure what to provide beyond an inspection of the code.

Lines 167-172 show the extraction of the top left and bottom right coordinates of each bounding box into candidate boxes

    y1 = (np.asarray([net_boxes[1][y[i]][x[i]] for i in list(range(x.size))]) + my)
    x2 = (np.asarray([net_boxes[2][y[i]][x[i]] for i in list(range(x.size))]) + mx)
    y2 = (np.asarray([net_boxes[3][y[i]][x[i]] for i in list(range(x.size))]) + my)

    boxes = np.transpose(np.vstack((x1, y1, x2, y2)))

These coordinates are returned from gridbox_to_boxes() and passed to vote_boxes() on lines 224-226

            propose_boxes, propose_cvgs, mask = gridbox_to_boxes(cur_cvg, cur_boxes, self)
            # Vote across the proposals to get bboxes
            boxes_cur_image = vote_boxes(propose_boxes, propose_cvgs, mask, self)

Finally (unless I am missing something), these values are passed without being converted properly to (x, y, width, height) in vote_boxes() on line 189

    nboxes, weights = cv.groupRectangles(
        np.array(propose_boxes).tolist(),
        self.gridbox_rect_thresh,
        self.gridbox_rect_eps)

This looks wrong based on the opencv documentation.

Additionally, I rebuilt opencv to test the interface after posting this question on their forum. By adding debug statements to the implementation of groupRectangles(), I was able to prove the python code is expected (x, y, width, height).

Do you have an idea for what I can provide as example code? I am happy to do whatever I can to help.

drnikolaev commented 5 years ago

@samsparks please just give me example how exactly you execute clustering.py, against what dataset and/or model and what you expect as the correct outcome.

samsparks commented 5 years ago

Hi @drnikolaev - I have not forgotten about this. Unfortunately, I do not have a trained model I can provide, and DIGITS does not allow testing of pretrained models :-(. So I am going to have to train something from scratch.

In the meantime, I have and example where I modified clustering.py to print out the input to group rectangles right before it is called, as follows: print("proposed: {}".format(np.array(propose_boxes).tolist()))

This output the following set of bounding boxes in clustering.py: [[547,432,701,639],[557,435,700,640],[560,438,695,641],[560,438,694,640],[88,443,336,663],[83,444,357,671],[83,444,373,676],[87,449,377,676],[87,454,380,677],[76,453,388,680],[72,447,394,683],[80,437,393,683],[101,430,392,678],[547,433,702,641],[555,433,701,645],[558,437,696,647],[556,440,696,644],[84,443,357,664],[73,448,369,665],[74,449,375,664],[81,451,373,664],[85,454,375,666],[81,454,385,672],[74,452,392,676],[77,445,396,679],[91,433,392,680],[547,430,705,644],[553,429,704,649],[555,434,697,649],[552,438,695,649],[85,445,365,661],[69,451,376,662],[69,452,379,663],[76,453,374,663],[80,452,377,666],[79,451,382,671],[74,449,388,673],[77,445,393,673],[90,434,389,674],[546,429,706,643],[553,428,703,647],[554,432,695,649],[553,435,693,654],[81,445,370,663],[68,454,383,664],[67,455,388,664],[72,454,384,667],[77,452,382,669],[71,448,386,671],[66,443,388,672],[73,438,389,671],[92,429,388,673],[545,429,706,642],[553,429,703,643],[553,432,695,647],[553,432,696,658],[79,450,367,664],[72,459,379,663],[71,459,387,665],[75,458,388,667],[75,455,390,666],[65,448,389,668],[63,441,387,669],[73,433,384,672],[100,425,388,675],[549,429,707,648],[550,429,701,652],[554,434,703,662],[79,462,356,665],[73,462,374,665],[74,461,383,666],[73,460,387,667],[69,457,391,664],[60,447,390,668],[63,435,385,673],[81,430,384,676],[116,433,390,677]]

And it returns: [[553, 433, 700, 647], [ 75, 449, 382, 669], [ 95, 430, 390, 676]]

Passing the values in c++ return the following [[546,431,704,642],[70,447,389,672],[555,435,696,647],[74,455,381,666]]

I expect these two to match, but they do not. I think the problem is in how clustering.py is calling groupRectangles()

The full source of the example can be found here

samsparks commented 5 years ago

Hi @drnikolaev -

I used the default DIGITS DetectNet (KITTI) model and KITTI images contained in data_object_image_2.zip.

The two images 003716.png and 003719.png provide good examples for the problem.

I can reproduce this reliably in jetson-inference only by malforming the construction of opencv::Rect objects.

I believe the current implementation of clustering.py works most of the time because groupRectangles() is grouping like objects. It is reasonably forgiving if you pass in [x1, y1, x2, y2] instead of [x1, y1, width, height] because it is just matching a pair of points instead of a point and width-height. However, it does not work as well when detections are in the bottom right (too inclusive) or top left (too exclusive) of the image.

See my fork of jetson-inference for the "broken" C++ code that replicates clustering.py. There is a define of REPLICATE_CLUSTERING_PY in detectNet.cpp that switches between the correct and incorrect construction of the cv::Rect objects.

Please note this will change the required values for epsilon. I plan on retraining my network after applying the following patch:

index 380df4a..d5c0589 100644
--- a/python/caffe/layers/detectnet/clustering.py
+++ b/python/caffe/layers/detectnet/clustering.py
@@ -188,14 +188,14 @@ def vote_boxes(propose_boxes, propose_cvgs, mask, self):
     # GROUP RECTANGLES Clustering
     ######################################################################
     nboxes, weights = cv.groupRectangles(
-        np.array(propose_boxes).tolist(),
+        [[e[0],e[1],e[2]-e[0],e[3]-e[1]] for e in np.array(propose_boxes).tolist()],
         self.gridbox_rect_thresh,
         self.gridbox_rect_eps)
     if len(nboxes):
         for rect, weight in zip(nboxes, weights):
-            if (rect[3] - rect[1]) >= self.min_height:
+            if rect[3] >= self.min_height:
                 confidence = math.log(weight[0])
-                detection = [rect[0], rect[1], rect[2], rect[3], confidence]
+                detection = [rect[0], rect[1], rect[0]+rect[2], rect[1]+rect[3], confidence]
                 detections_per_image.append(detection)

     return detections_per_image
drnikolaev commented 5 years ago

Fixed in v0.17.3