Closed ghost closed 2 years ago
true_boxes = (1, 1, 1, max_box_per_image, 4) 4 is for (center_x, center_y, w,h)) with center_x,center_y are center of object and depend on grid cell. in other word center_x, center_y is in the range (0,13.0) or (0,26.) or(0.52.): w,h are shape of object which was output of _aug_image function in BatchGenerator Class true_boxs is reshape to (1,1,1, max_box_per_image,4) to compute broadcasting instead of using multi loops And if you wish add four detection scales instead of 3. You must modify your model. Because yolov3 Model has 3 output with shapes are (None ,13,13, 3(4+1+ number_class)), (None, 26,26, 3(4+1+ number_class)) and (None, 52,52, 3*(4+1+ number_class)). it means you must modify both backbone(Darknet53) and it's output
You should check YoloLayer and BatchGenerator Class to have more information
true_boxes = (1, 1, 1, max_box_per_image, 4) 4 is for (center_x, center_y, w,h)) with center_x,center_y are center of object and depend on grid cell. in other word center_x, center_y is in the range (0,13.0) or (0,26.) or(0.52.): w,h are shape of object which was output of _aug_image function in BatchGenerator Class true_boxs is reshape to (1,1,1, max_box_per_image,4) to compute broadcasting instead of using multi loops And if you wish add four detection scales instead of 3. You must modify your model. Because yolov3 Model has 3 output with shapes are (None ,13,13, 3(4+1+ number_class)), (None, 26,26, 3(4+1+ number_class)) and (None, 52,52, 3*(4+1+ number_class)). it means you must modify both backbone(Darknet53) and it's output
You should check YoloLayer and BatchGenerator Class to have more information
Thank you very much for replying so quickly. Just one more question: line 237 in yolo.py.
true_yolo_1 = Input(shape=(None, None, len(anchors)//6, 4+1+nb_class)) # grid_h, grid_w, nb_anchor, 5+nb_class
The original yolov3 model has 9 anchors, and I am assuming it is 3 anchors per detection layer, so I changed generating 9 anchors to 12 in gen_anchors.py. Would that mean, to keep the 3 anchors per detection layer constant, I would have to change the above code to
true_yolo_1 = Input(shape=(None, None, len(anchors)//8, 4+1+nb_class)) # grid_h, grid_w, nb_anchor, 5+nb_class
?
9 (# of anchors in original) 2 / 6 coincidentally equals 3, so following the same pattern, 12 (# of anchors) 2 / 8 = 3. I tried training it and got 0 map so either using 4 detection layers is really bad or I made some errors in implementing it(more likely reason).
You have changed true_yolo_box_1, this is not enough. You need to add true_yolo_box_4, dummy_loss_4 and modify your model to create pred_yolo_4. They are hard to do. Because backbone of model is Darknet53 with 3 output [13,13], [26,26], [52,52]. If you want add 1 more output, maybe it's shape is [104,104] haha
yolo.py changes
true_yolo_4 = Input(shape=(None, None, len(anchors)//8, 4+1+nb_class)) # grid_h, grid_w, nb_anchor, 5+nb_class
# added these on after 3rd yolo detection layer
x = _conv_block(x, [{'filter': 64, 'kernel': 1, 'stride': 1, 'bnorm': True, 'leaky': True, 'layer_idx': 108}], do_skip=False)
# Upsampling
x = UpSampling2D(2)(x)
# concatenate
x = concatenate([x, skip_11])
x = _conv_block(x, [{'filter': 64, 'kernel': 1, 'stride': 1, 'bnorm': True, 'leaky': True, 'layer_idx': 111},
{'filter': 128, 'kernel': 3, 'stride': 1, 'bnorm': True, 'leaky': True, 'layer_idx': 112},
{'filter': 64, 'kernel': 1, 'stride': 1, 'bnorm': True, 'leaky': True, 'layer_idx': 113},
{'filter': 128, 'kernel': 3, 'stride': 1, 'bnorm': True, 'leaky': True, 'layer_idx': 114},
{'filter': 64, 'kernel': 1, 'stride': 1, 'bnorm': True, 'leaky': True, 'layer_idx': 115}], do_skip=False)
pred_yolo_4 = _conv_block(x, [{'filter': 128, 'kernel': 3, 'stride': 1, 'bnorm': True, 'leaky': True, 'layer_idx': 116},
{'filter': (3*(5+nb_class)), 'kernel': 1, 'stride': 1, 'bnorm': False, 'leaky': False, 'layer_idx': 117}], do_skip=False)
loss_yolo_4 = YoloLayer(anchors[:6],
[8*num for num in max_grid],
batch_size,
warmup_batches,
ignore_thresh,
grid_scales[3],
obj_scale,
noobj_scale,
xywh_scale,
class_scale)([input_image, pred_yolo_4, true_yolo_4, true_boxes])
For the darknet model, I added these lines and added the skip connection in layer 11. Also, at the end, of course I changed the return model variables to:
train_model = Model([input_image, true_boxes, true_yolo_1, true_yolo_2, true_yolo_3, true_yolo_4], [loss_yolo_1, loss_yolo_2, loss_yolo_3, loss_yolo_4])
infer_model = Model(input_image, [pred_yolo_1, pred_yolo_2, pred_yolo_3, pred_yolo_4])
In generator.py, line 61 - 65, I added
`yolo_4 = np.zeros((r_bound - l_bound, 8*base_grid_h, 8*base_grid_w, len(self.anchors)//4, 4+1+len(self.labels))) # desired network output 4`
yolos = [yolo_4, yolo_3, yolo_2, yolo_1]
dummy_yolo_4 = np.zeros((r_bound - l_bound, 1))
and on line 95
# determine the yolo to be responsible for this bounding box
yolo = yolos[max_index//4]
These were the extent to my changes, along with generating 12 anchors instead of the original 9, changing grid_scales to [1, 1, 1, 1] in config.json, and fixing anchor slices (like changing to anchors[6:12] for loss_yolo_3 in response to changed anchor count) for other three loss_yolos. Still, I get a rather disgusting low map. There was a paper that discussed using 4 detection layers as optimization, so I am sure I somehow messed up in the implementation.
I see you have reasonable changes. You can Model.Summary() to see the size of pred_yolo_4. The image input size should be divisible by this size. like 416/32=13, 416/16 = 26, 416/8 = 52. That's just my opinion. because I haven't tried adding 1 grid for detecting in my model ^^!
Should I change
true_boxes = Input(shape=(1, 1, 1, max_box_per_image, 4))
?
Sorry, I still don't really understand the point of those 3 1's and if it needs modifying for 4 detection layers. Is there any other thing could have missed? I am still getting 0 map, although using the default one gives 0.80 map.
We have:
true_boxes = Input(shape=(1, 1, 1, max_box_per_image, 4))
y_pred have shape (grid_h, grid_w, 3(4+1+num_class))
when training:
true_boxes shape = (batch_size, 1, 1, 1, max_box_per_image, 4)
y_pred shape = (batch_size, grid_h, grid_w, 3(4+1+num_class))
Firstly, y_pred is reshaped to (batch_size, grid_h, grid_w, 3, (4+1+num_class)). then:
Rank of true_boxes is 6
Rank of y_pred is 5
Check from line 57 to line 106 of file yolo.py. you will see: To computer IOU, y_pred is expand_dim 1 rank for equal rank of true_boxes. Purpose to broadcasting
Specifical:
line 86: pred_xy = tf.expand_dims(pred_box_xy / grid_factor, 4)
line 87: pred_wh = tf.expand_dims(tf.exp(pred_box_wh) * self.anchors / net_factor, 4)
I am testing out several modifications to yolo, and one of them was to use four detection scales instead of 3, like the original. However, modifying the shapes of some variables can be a little confusing. One in particular is true_boxes on line 236 in yolo.py.
true_boxes = Input(shape=(1, 1, 1, max_box_per_image, 4))
There was a very recent issue that was posted a few days ago about grid scales, and from what I understand changing it from [1, 1, 1] to [1, 1, 1, 1] would mean there would be four grids, which is helpful for my problem. Similarly, what is the intuition behind shape=(1, 1, 1, ...)? Are there 3 1's because there are three detection layers? Are there any other variables I should be considering? The assumption in yolov3 is that there are three detection layers so it there are default numbers, so there ussually isn't any comments about it.