experiencor / keras-yolo3

Training and Detecting Objects with YOLO3
MIT License
1.61k stars 861 forks source link

Question about true_boxes in yolo.py #312

Closed ghost closed 2 years ago

ghost commented 3 years ago

I am testing out several modifications to yolo, and one of them was to use four detection scales instead of 3, like the original. However, modifying the shapes of some variables can be a little confusing. One in particular is true_boxes on line 236 in yolo.py. true_boxes = Input(shape=(1, 1, 1, max_box_per_image, 4)) There was a very recent issue that was posted a few days ago about grid scales, and from what I understand changing it from [1, 1, 1] to [1, 1, 1, 1] would mean there would be four grids, which is helpful for my problem. Similarly, what is the intuition behind shape=(1, 1, 1, ...)? Are there 3 1's because there are three detection layers? Are there any other variables I should be considering? The assumption in yolov3 is that there are three detection layers so it there are default numbers, so there ussually isn't any comments about it.

lexuansanh commented 3 years ago

true_boxes = (1, 1, 1, max_box_per_image, 4) 4 is for (center_x, center_y, w,h)) with center_x,center_y are center of object and depend on grid cell. in other word center_x, center_y is in the range (0,13.0) or (0,26.) or(0.52.): w,h are shape of object which was output of _aug_image function in BatchGenerator Class true_boxs is reshape to (1,1,1, max_box_per_image,4) to compute broadcasting instead of using multi loops And if you wish add four detection scales instead of 3. You must modify your model. Because yolov3 Model has 3 output with shapes are (None ,13,13, 3(4+1+ number_class)), (None, 26,26, 3(4+1+ number_class)) and (None, 52,52, 3*(4+1+ number_class)). it means you must modify both backbone(Darknet53) and it's output

You should check YoloLayer and BatchGenerator Class to have more information

ghost commented 3 years ago

true_boxes = (1, 1, 1, max_box_per_image, 4) 4 is for (center_x, center_y, w,h)) with center_x,center_y are center of object and depend on grid cell. in other word center_x, center_y is in the range (0,13.0) or (0,26.) or(0.52.): w,h are shape of object which was output of _aug_image function in BatchGenerator Class true_boxs is reshape to (1,1,1, max_box_per_image,4) to compute broadcasting instead of using multi loops And if you wish add four detection scales instead of 3. You must modify your model. Because yolov3 Model has 3 output with shapes are (None ,13,13, 3(4+1+ number_class)), (None, 26,26, 3(4+1+ number_class)) and (None, 52,52, 3*(4+1+ number_class)). it means you must modify both backbone(Darknet53) and it's output

You should check YoloLayer and BatchGenerator Class to have more information

Thank you very much for replying so quickly. Just one more question: line 237 in yolo.py. true_yolo_1 = Input(shape=(None, None, len(anchors)//6, 4+1+nb_class)) # grid_h, grid_w, nb_anchor, 5+nb_class The original yolov3 model has 9 anchors, and I am assuming it is 3 anchors per detection layer, so I changed generating 9 anchors to 12 in gen_anchors.py. Would that mean, to keep the 3 anchors per detection layer constant, I would have to change the above code to true_yolo_1 = Input(shape=(None, None, len(anchors)//8, 4+1+nb_class)) # grid_h, grid_w, nb_anchor, 5+nb_class? 9 (# of anchors in original) 2 / 6 coincidentally equals 3, so following the same pattern, 12 (# of anchors) 2 / 8 = 3. I tried training it and got 0 map so either using 4 detection layers is really bad or I made some errors in implementing it(more likely reason).

lexuansanh commented 3 years ago

You have changed true_yolo_box_1, this is not enough. You need to add true_yolo_box_4, dummy_loss_4 and modify your model to create pred_yolo_4. They are hard to do. Because backbone of model is Darknet53 with 3 output [13,13], [26,26], [52,52]. If you want add 1 more output, maybe it's shape is [104,104] haha

ghost commented 3 years ago

yolo.py changes true_yolo_4 = Input(shape=(None, None, len(anchors)//8, 4+1+nb_class)) # grid_h, grid_w, nb_anchor, 5+nb_class

# added these on after 3rd yolo detection layer 
x = _conv_block(x, [{'filter': 64, 'kernel': 1, 'stride': 1, 'bnorm': True, 'leaky': True, 'layer_idx': 108}], do_skip=False)
    # Upsampling
    x = UpSampling2D(2)(x)
    # concatenate
    x = concatenate([x, skip_11])

    x = _conv_block(x, [{'filter': 64, 'kernel': 1, 'stride': 1, 'bnorm': True,  'leaky': True,  'layer_idx': 111},
                    {'filter': 128, 'kernel': 3, 'stride': 1, 'bnorm': True,  'leaky': True,  'layer_idx': 112},
                    {'filter': 64, 'kernel': 1, 'stride': 1, 'bnorm': True,  'leaky': True,  'layer_idx': 113},
                    {'filter': 128, 'kernel': 3, 'stride': 1, 'bnorm': True,  'leaky': True,  'layer_idx': 114},
                    {'filter': 64, 'kernel': 1, 'stride': 1, 'bnorm': True,  'leaky': True,  'layer_idx': 115}], do_skip=False)

    pred_yolo_4 = _conv_block(x, [{'filter': 128, 'kernel': 3, 'stride': 1, 'bnorm': True,  'leaky': True, 'layer_idx': 116}, 
                            {'filter': (3*(5+nb_class)), 'kernel': 1, 'stride': 1, 'bnorm': False, 'leaky': False, 'layer_idx': 117}], do_skip=False)

    loss_yolo_4 = YoloLayer(anchors[:6], 
                            [8*num for num in max_grid], 
                            batch_size, 
                            warmup_batches, 
                            ignore_thresh, 
                            grid_scales[3],
                            obj_scale,
                            noobj_scale,
                            xywh_scale,
                            class_scale)([input_image, pred_yolo_4, true_yolo_4, true_boxes]) 

For the darknet model, I added these lines and added the skip connection in layer 11. Also, at the end, of course I changed the return model variables to:

train_model = Model([input_image, true_boxes, true_yolo_1, true_yolo_2, true_yolo_3, true_yolo_4], [loss_yolo_1, loss_yolo_2, loss_yolo_3, loss_yolo_4])
infer_model = Model(input_image, [pred_yolo_1, pred_yolo_2, pred_yolo_3, pred_yolo_4])

In generator.py, line 61 - 65, I added

`yolo_4 = np.zeros((r_bound - l_bound, 8*base_grid_h,  8*base_grid_w, len(self.anchors)//4, 4+1+len(self.labels))) # desired network output 4`
yolos = [yolo_4, yolo_3, yolo_2, yolo_1]
dummy_yolo_4 = np.zeros((r_bound - l_bound, 1))

and on line 95

# determine the yolo to be responsible for this bounding box
yolo = yolos[max_index//4]

These were the extent to my changes, along with generating 12 anchors instead of the original 9, changing grid_scales to [1, 1, 1, 1] in config.json, and fixing anchor slices (like changing to anchors[6:12] for loss_yolo_3 in response to changed anchor count) for other three loss_yolos. Still, I get a rather disgusting low map. There was a paper that discussed using 4 detection layers as optimization, so I am sure I somehow messed up in the implementation.

lexuansanh commented 3 years ago

I see you have reasonable changes. You can Model.Summary() to see the size of pred_yolo_4. The image input size should be divisible by this size. like 416/32=13, 416/16 = 26, 416/8 = 52. That's just my opinion. because I haven't tried adding 1 grid for detecting in my model ^^!

ghost commented 3 years ago

Should I change true_boxes = Input(shape=(1, 1, 1, max_box_per_image, 4))? Sorry, I still don't really understand the point of those 3 1's and if it needs modifying for 4 detection layers. Is there any other thing could have missed? I am still getting 0 map, although using the default one gives 0.80 map.

lexuansanh commented 3 years ago

We have: true_boxes = Input(shape=(1, 1, 1, max_box_per_image, 4)) y_pred have shape (grid_h, grid_w, 3(4+1+num_class)) when training: true_boxes shape = (batch_size, 1, 1, 1, max_box_per_image, 4) y_pred shape = (batch_size, grid_h, grid_w, 3(4+1+num_class)) Firstly, y_pred is reshaped to (batch_size, grid_h, grid_w, 3, (4+1+num_class)). then: Rank of true_boxes is 6 Rank of y_pred is 5 Check from line 57 to line 106 of file yolo.py. you will see: To computer IOU, y_pred is expand_dim 1 rank for equal rank of true_boxes. Purpose to broadcasting Specifical: line 86: pred_xy = tf.expand_dims(pred_box_xy / grid_factor, 4) line 87: pred_wh = tf.expand_dims(tf.exp(pred_box_wh) * self.anchors / net_factor, 4)