JohnWnx commented 6 years ago

Hi may I know what needs to be changed for training with 4-point coordinates labels, rather than xywh?

I have been trying to edit the current version of YOLO to train labels containing such format: x1,y1,x2,y2,x3,y3,x4,y4 rather than the current xywh format.

1) what does index & entry_index() in yolo_layer.c do? I understand that the values "i & j" are used in this function wheref "i" is related to truth.x while "j" is related to truth.y. In this case of x1-x4 and y1-y4, will i need j1-4 and i1-4?

2) Replacing instances of (4+1) with (8+1) in yolo_layer.c I have replaced instances of " int class_id = state.truth[t(4 + 1) + bl.truths + 4];" with: "int class_id = state.truth[t(8 + 1) + bl.truths + 8];" I have replaced 4 with 8, as there are 8 parameters (excluding class id) for each bounding box instead of the original 4 (xywh). I have also performed the changes on: box truth = float_to_box_stride(state.truth + t(8 + 1) + bl.truths, 1); //UPDATED

Would I also need to replace 4 to 8 for the following functions? static int entry_index(layer l, int batch, int location, int entry) { int n = location / (l.wl.h); int loc = location % (l.wl.h); return batchl.outputs + nl.w*l.h(4+l.classes+1) + entryl.wl.h + loc; } I have also tried changing the following line: //l.outputs = hwn(classes + 4 + 1); to l.outputs = hwn*(classes + 8 + 1);

However, I receive the following error when attempting to run: "Error: l.outputs == params.inputs filters= in the [convolutional]-layer doesn't correspond to classes= or mask= in [yolo]-layer "

3) Is this the correct method to predict the coordinates for the 4 coordinates of the bounding boxes? (I don't see the connection between the prediction equations in figure 2 of the yolov3 paper being related to the calculations performed in get_yolo_box() or delta_yolo_box(). )

in get_yolo_box(): of yolo_layer.c I'm no longer using this: b.w = exp(x[index + 2stride]) biases[2n] / w; Instead, I predict the 8 values of the 4 coordinates (except of my code is shown below: ie. b.x1 = (i + x[index + 0stride]) / lw; stored in x --> x[] b.y1 = (j + x[index + 1stride]) / lh; stored in x --> x[] b.x2 = (i + x[index + 2stride]) / lw; stored in x --> x[] b.y2 = (j + x[index + 3stride]) / lh;

Also in delta_yolo_box() of yolo_layer.c: i'm no longer using this: float tw = log(truth.ww / biases[2n]); Instead, I predict the 8 values of the 4 coordinates (except of my code is shown below): float tx1 = (truth.x1lw - i); float ty1 = (truth.y1lh - j); float tx2 = (truth.x2lw - i); float ty2 = (truth.y2lh - j);

delta[index + 0*stride] = scale * (tx1 - x[index + 0*stride]); 
delta[index + 1*stride] = scale * (ty1 - x[index + 1*stride]);
delta[index + 2*stride] = scale * (tx2 - x[index + 2*stride]); 
delta[index + 3*stride] = scale * (ty2 - x[index + 3*stride]);

Thank you.

*Thus far, I have made changes to mainly data.c (handling the reading of new label format), yolo_layer.c (for predictions) and box.c (for computation of IOU).

AlexeyAB commented 6 years ago

@JohnWnx Hi,

Function entry_index() is required to extract (x,y,w,h,objectness,class_prob_0,class_porb1, ...) from Channel-axis of final activations [width, height, channel]

You should rewrite yolo_layer.c - change each function except backward_yolo_layer() and backward_yolo_layer_gpu() - Including you have to change almost every place where there is a number 4. Also change data loading read_boxes(), fill_truth_detection(), ... in data.c, also get_network_boxes(), do_nms_sort(), draw_detections_v3(), ...

However, I receive the following error when attempting to run: "Error: l.outputs == params.inputs filters= in the [convolutional]-layer doesn't correspond to classes= or mask= in [yolo]-layer "

You should change filters= in your cfg-file before each of 3 [yolo]-layers to (1 + 8 + classes)*3

You should use

delta[index + 0*stride] = scale * (tx - x[index + 0*stride]);
delta[index + 1*stride] = scale * (ty - x[index + 1*stride]);
delta[index + 2*stride] = scale * (tw - x[index + 2*stride]);
delta[index + 3*stride] = scale * (th - x[index + 3*stride]);
delta[index + 4*stride] = scale * (tx - x[index + 4*stride]);
delta[index + 5*stride] = scale * (ty - x[index + 5*stride]);
delta[index + 6*stride] = scale * (tw - x[index + 6*stride]);
delta[index + 7*stride] = scale * (th - x[index + 7*stride]);

and etc...

(I don't see the connection between the prediction equations in figure 2 of the yolov3 paper being related to the calculations performed in get_yolo_box() or delta_yolo_box(). )

As you can see: https://github.com/AlexeyAB/darknet/blob/18d5e4f39c1441f2c21043ac3204b5cb279f8758/src/yolo_layer.c#L84-L92 Logistic activations for Tx and Ty: https://github.com/AlexeyAB/darknet/blob/18d5e4f39c1441f2c21043ac3204b5cb279f8758/src/yolo_layer.c#L175

In the get_yolo_box() is calculated:

    b.x = (Cx + Tx_logistic_activated) / layer_w;
    b.y = (Cy + Ty_logistic_activated) / layer_h;
    b.w = exp(Tw) * Pw   / layer_w;
    b.h = exp(Th) * Ph / layer_h;

that is almost the same as in the image: 45011093-5464f500-b043-11e8-9d27-64b81193a7b1

JohnWnx commented 6 years ago

Hi @AlexeyAB ,

(Sorry for the lengthy post) I understand that YOLO uses a fixed rectangle bounding box defined by (xywh) for training and prediction. The labels which I have been trying to supply to YOLO is in the format of (x1,y1,x2,y2,x3,y3,x4,y4) which is no longer a fixed rectangle shape. (The labels are text datasets from ICDAR 2015, 1280 by 720 size images) After the performing the changes,the IOU is always zero. After plotting out the actual coordinates, the predicted box (left) seems to be extremely off compared to the ground truth box (right).

From my understanding, I am using the detector_train() function in detector.c during the training process. The functions do_nms_sort(), draw_detections_v3() from image.c does not seem be be used during training. Although, I understand that they will be used during either demo() or detector_test() when using YOLO for detecting images. Therefore, would I still need to edit these functions in order for training the new labels?

I have made the following changes:

yolo-text.cfg: I have changed the filters in my cfg file before each of the 3 yolo-layers to (1+8+classes)3 = (1+8+1)3=30

1) data.c:

read_boxes(): modified to scan ground truth bounding box labels in the format of:(x1,y1,x2,y2,x3,,y3,x4,y4)

Fill_truth_detection(): truth[i*9+0] = x1; etc is now being used instead as each truth[] now consists of 8 +1 parameters. ((x1,y1,x2,y2,x3,,y3,x4,y4 + id)

Load_data_detection(): I have changed make_matrix ‘s value of 5 to 9, as there are now (8 parameters + id =9).

* what are “pleft,pright,ptop,pbot”**? I realized that they will be used later in image_data_augmentation(). Would training still work for (x1,y1,x2,y2,x3,y3,x4,y4) format? I realized that the “p-values” are just randomly generated numbers within the range of image height and width of the image. Thus, I have also created p-values for (x1, y1,x2,y2,x3, y3,x4,y4). These values would be later used for correct_yolo_boxes().

correct_yolo_boxes(): I have replaced the adjustment of boxes’s parameters to (x1,y1,x2,y2,x3,y3,x4,y4) format. ***Would the “flip” process still be required? I have commented out the “flip” portion as flip seems to be no longer applicable here as the bounding box is no longer defined as (left,top,bottom,right), but rather as (x1,y1,x2,y2,x3,y3,x4,y4).

2) yolo_layer.c

Make_yolo_layer(): replaced all instances of (classes + 4 + 1) to (classes + 8 + 1)

Resize_yolo_layer(): replaced all instances of (classes + 4 + 1) to (classes + 8 + 1)

Get_yolo_box(): b.x1 = (i + x[index + 0*stride]) / lw;, replaced from (xywh) to (x1,y1,x2,y2,x3,y3,x4,y4)

Delta_yolo_box():
float tx1 = (truth.x1lw - i); delta[index + 0stride] = scale (tx1 - x[index + 0stride]); Replaced from (xywh) to (x1,y1,x2,y2,x3,y3,x4,y4)

Entry_index(): replaced all instances of (classes + 4 + 1) to (classes + 8 + 1)

float_to_box_stride(): Replaced from (xywh) to (x1,y1,x2,y2,x3,y3,x4,y4).

Forward_yolo_layer(): Replaced all instances of (classes + 4 + 1) to (classes + 8 + 1). Replaced the break condition to: if(!truth.x1 && !truth.x2 && !truth.x3 && !truth.x4 && !truth.y1 && !truth.y2 && !truth.y3 && !truth.y4) break; **what does “ i = (truth.x l.w); and j = (truth.y * l.h);” do? Would I need to replace them as well?

Correct_yolo_boxes(): b.x1 = (b.x1 - (netw - new_w)/2./netw) / ((float)new_w/netw);Replaced from (xywh) to (x1,y1,x2,y2,x3,y3,x4,y4). ***What does the”if(!relative)” condition mean?

Avg_flipped_yolo(): replaced all instances of (classes + 4 + 1) to (classes + 8 + 1)

Get_yolo_detections():replaced all instances of (classes + 4 + 1) to (classes + 8 + 1)

Forward_yolo_layer_gpu():replaced all instances of (classes + 4 + 1) to (classes + 8 + 1)

3) box.c

Box_iou(): I have replaced the original IOU calculation functions with a function that calculates the IOU of 2 four-sided polygons.

Thank you.

AlexeyAB commented 6 years ago

You should NOT do this

You should change this function: https://github.com/AlexeyAB/darknet/blob/18d5e4f39c1441f2c21043ac3204b5cb279f8758/src/data.c#L301 and this: https://github.com/AlexeyAB/darknet/blob/18d5e4f39c1441f2c21043ac3204b5cb279f8758/src/data.c#L187-L222

For changing your 8 points by using these values pleft,pright,ptop,pbot.

what are “pleft,pright,ptop,pbot”? I realized that they will be used later in image_data_augmentation().

Yes

Would training still work for (x1,y1,x2,y2,x3,y3,x4,y4) format?

If you will change correct_boxes() and other functions, then yes.

Would the “flip” process still be required?

If you want to use data augmentation - then flip is required.

**what does “ i = (truth.x l.w); and j = (truth.y * l.h);” do? Would I need to replace them as well?

It is required to find the most suitable final activation. You need to calculate 4 values i,j for each of 4 points, and use average i,j.

https://github.com/AlexeyAB/darknet/blob/18d5e4f39c1441f2c21043ac3204b5cb279f8758/src/yolo_layer.c#L243-L244

***What does the”if(!relative)” condition mean?

It means that coordinates is relative to image size, i.e. it is always from 0.0 ti 1.0 for Yolo.

JohnWnx commented 6 years ago

@AlexeyAB

It is required to find the most suitable final activation. You need to calculate 4 values i,j for each of 4 points, and use average i,j.

I understand that forward_yolo_layer() is organized in the following manner:

The red section adjusts the center coordinates of the predicted box (x,y). Where j is adjusted along the height of the network layer testing size, l.h. Similarly for i is adjusted along the width, l.w. n refers to the numbers of images within the current batch. Does that mean the red section loop will break upon obtaining an optimal IOU, thereby the current i and j will present the best (x,y) offset values? Did you mean that I have to repeat the all the processes in the red section for 4 times to calculate the 4 values each for i and j?

The green section adjusts the width and height of the predicted box (w,h). Since I am only trying to adjust x1,x2,x3,x4,y1,y2,y3,y4, Is this section still required then? Or did you mean that I should edit the section in green such that I obtain the average value which will be truncated to an integer value: i.e. something like this: int i1 = (truth.x1 l.w); int i2 = (truth.x2 l.w); int i3 = (truth.x3 l.w); int i4 = (truth.x4 l.w); int j1 = (truth.y1 l.h); int j2 = (truth.y2 l.h); int j3 = (truth.y3 l.h); int j4 = (truth.y4 l.h); i = (i1+i2+i3+i4)/4; j = (j1+j2+j3+j4)/4;

Also, the section in blue seems to be adjusting the height and width of the box. Is this sectionno longer be required?

Thank you.

AlexeyAB commented 6 years ago

Red section - search for final activations where are no objects
Green section - search for final activations where are objects

n refers to the numbers of images within the current batch.

n - refers to the number of anchors in this layer (3 for yolov3, 5 for yolov2) - it is not required if you don't use Anchors
b - refers to the numbers of images within the current batch.

Did you mean that I have to repeat the all the processes in the red section for 4 times to calculate the 4 values each for i and j?

No. In the red-section you should calculate (average_x,average_y) of your 4 points, and use it instead of bounding box cetner.

Or did you mean that I should edit the section in green such that I obtain the average value which will be truncated to an integer value:

Yes. And you should pass these averege (i,j) to the rewriten delta_yolo_box() function.

The green section adjusts the width and height of the predicted box (w,h). Since I am only trying to adjust x1,x2,x3,x4,y1,y2,y3,y4, Is this section still required then?

This section still required. Green section:

adjusts: width and height
adjusts: x and y - get the most suitable 4-points (instead of bbox) for the current (i,j) https://github.com/AlexeyAB/darknet/blob/18d5e4f39c1441f2c21043ac3204b5cb279f8758/src/yolo_layer.c#L261
adjusts: classes probability - you should do it if you have several classes: https://github.com/AlexeyAB/darknet/blob/18d5e4f39c1441f2c21043ac3204b5cb279f8758/src/yolo_layer.c#L270

Also, the section in blue seems to be adjusting the height and width of the box. Is this sectionno longer be required?

If you don't use Anchors, then Blue section is not required.

JohnWnx commented 6 years ago

Hi @AlexeyAB , Here are my rewritten functions:

No. In the red-section you should calculate (average_x,average_y) of your 4 points, and use it instead of bounding box cetner.

Unlike the green section where i = (truth.x * l.w); j = (truth.y * l.h); (lines 243-244 of original yolo_layer.c)

i & j is not specified here. Their values seems to change with the "for" loops. How did you calculate the original (x,y) value in the code?

AlexeyAB commented 6 years ago

Unlike the green section where i = (truth.x l.w); j = (truth.y l.h); (lines 243-244 of original yolo_layer.c)

i & j is not specified here. Their values seems to change with the "for" loops. How did you calculate the original (x,y) value in the code?

Yes, you are right. There in Green-section you shouldn't get average (i,j). You should get the most suitable 4-points (instead of bbox) for the current (i,j) in this part of code: https://github.com/AlexeyAB/darknet/blob/18d5e4f39c1441f2c21043ac3204b5cb279f8758/src/yolo_layer.c#L247-L256

JohnWnx commented 6 years ago

@AlexeyAB did you mean for the green section:

With the newly predicted 4-points based on the average (i,j) calculated, the IOU between pred and truth is now calculated again.

Meanwhile, nothing will be needed to be changed in the red section, except for the functions which have already been reflected in my previous post?

I have tried running the training process, but the IOU still shows 0.

JohnWnx commented 6 years ago

Here's my forward_yolo_layer() function:

void forward_yolo_layer(const layer l, network_state state) { int i,j,b,t,n; memcpy(l.output, state.input, l.outputsl.batchsizeof(float));

ifndef GPU //unused, as GPU is used.

for (b = 0; b < l.batch; ++b){
    for(n = 0; n < l.n; ++n){
        int index = entry_index(l, b, n*l.w*l.h, 0); 
        activate_array(l.output + index, 2*l.w*l.h, LOGISTIC);
        //index = entry_index(l, b, n*l.w*l.h, 4); //Replaced
        index = entry_index(l, b, n*l.w*l.h, 8); 
        activate_array(l.output + index, (1+l.classes)*l.w*l.h, LOGISTIC);
    }
}

endif

memset(l.delta, 0, l.outputs * l.batch * sizeof(float));
if(!state.train) return;
float avg_iou = 0;
float recall = 0;
float recall75 = 0;
float avg_cat = 0;
float avg_obj = 0;
float avg_anyobj = 0;
int count = 0;
int class_count = 0;
*(l.cost) = 0;

for (b = 0; b < l.batch; ++b) {
    for (j = 0; j < l.h; ++j) { 
        for (i = 0; i < l.w; ++i) { 
            for (n = 0; n < l.n; ++n) { 
                int box_index = entry_index(l, b, n*l.w*l.h + j*l.w + i, 0); 
                box pred = get_yolo_box(l.output, l.biases, l.mask[n], box_index, i, j, l.w, l.h, state.net.w, state.net.h, l.w*l.h);
                float best_iou = 0;
                int best_t = 0;
                for(t = 0; t < l.max_boxes; ++t){
                    //box truth = float_to_box_stride(state.truth + t*(4 + 1) + b*l.truths, 1); //Replaced
        box truth = float_to_box_stride(state.truth + t*(8 + 1) + b*l.truths, 1);
                    //int class_id = state.truth[t*(4 + 1) + b*l.truths + 4]; //Replaced
        int class_id = state.truth[t*(8 + 1) + b*l.truths + 8];         
                    if (class_id >= l.classes) {
                        printf(" Warning: in txt-labels class_id=%d >= classes=%d in cfg-file. In txt-labels class_id should be [from 0 to %d] \n", class_id, l.classes, l.classes - 1);
                        getchar();
                        continue; // if label contains class_id more than number of classes in the cfg-file
                    }           
                    //if(!truth.x) break; //Replaced as we need to check truth.x1 to truth.x4 now
        if(!truth.x1 && !truth.x2 && !truth.x3 && !truth.x4 && !truth.y1 && !truth.y2 && !truth.y3 && !truth.y4) break; 
                    float iou = box_iou(pred, truth);
                    if (iou > best_iou) {
                        best_iou = iou;
                        best_t = t;  
                    }
                }
                //int obj_index = entry_index(l, b, n*l.w*l.h + j*l.w + i, 4); //Replaced
        int obj_index = entry_index(l, b, n*l.w*l.h + j*l.w + i, 8); 
                avg_anyobj += l.output[obj_index]; 
                l.delta[obj_index] = 0 - l.output[obj_index]; 
                if (best_iou > l.ignore_thresh) {
                    l.delta[obj_index] = 0; 
                }
                if (best_iou > l.truth_thresh) {
                    l.delta[obj_index] = 1 - l.output[obj_index];
                    //int class_id = state.truth[best_t*(4 + 1) + b*l.truths + 4]; //Replaced
        int class_id = state.truth[best_t*(8 + 1) + b*l.truths + 8]; 
                    if (l.map) class_id = l.map[class_id];
                    //int class_index = entry_index(l, b, n*l.w*l.h + j*l.w + i, 4 + 1); //Replaced
        int class_index = entry_index(l, b, n*l.w*l.h + j*l.w + i, 8 + 1); 
                    delta_yolo_class(l.output, l.delta, class_index, class_id, l.classes, l.w*l.h, 0, l.focal_loss); 
                    //box truth = float_to_box_stride(state.truth + best_t*(4 + 1) + b*l.truths, 1); //Replaced
            box truth = float_to_box_stride(state.truth + best_t*(8 + 1) + b*l.truths, 1);
                    delta_yolo_box(truth, l.output, l.biases, l.mask[n], box_index, i, j, l.w, l.h, state.net.w, state.net.h, l.delta, (2-truth.w*truth.h), l.w*l.h);
                }
            }
        }
    }
    for(t = 0; t < l.max_boxes; ++t){

        //box truth = float_to_box_stride(state.truth + t*(4 + 1) + b*l.truths, 1); //Replaced
        box truth = float_to_box_stride(state.truth + t*(8 + 1) + b*l.truths, 1); 
        //int class_id = state.truth[t*(4 + 1) + b*l.truths + 4]; //Replaced
    int class_id = state.truth[t*(8 + 1) + b*l.truths + 8]; 
        if (class_id >= l.classes) continue; // if label contains class_id more than number of classes in the cfg-file
        //if(!truth.x) break; //Replaced
    if(!truth.x1 && !truth.x2 && !truth.x3 && !truth.x4 && !truth.y1 && !truth.y2 && !truth.y3 && !truth.y4) break;

        float best_iou = 0;
        int best_n = 0;
        //i = (truth.x * l.w); //Replaced
        //j = (truth.y * l.h); // Replaced

        int i1 = (truth.x1 * l.w);
        int i2 = (truth.x2 * l.w);
        int i3 = (truth.x3 * l.w);
        int i4 = (truth.x4 * l.w);

        int j1 = (truth.y1 * l.h);
        int j2 = (truth.y2 * l.h);
        int j3 = (truth.y3 * l.h);
        int j4 = (truth.y4 * l.h);

    i = (i1+i2+i3+i4)/4; //truncated to integer, eg. 10/4 = 2.5 --> 2 
    j = (j1+j2+j3+j4)/4;                        

        box truth_shift = truth;
        //truth_shift.x = truth_shift.y = 0; //Replaced
    truth_shift.x1 = truth_shift.y1 = truth_shift.x2 = truth_shift.y2 = truth_shift.x3 = truth_shift.y3 = truth_shift.x4 = truth_shift.y4 = 0; 

        for(n = 0; n < l.total; ++n){  
            box pred = {0};
            //pred.w = l.biases[2*n]/ state.net.w; //Replaced
            //pred.h = l.biases[2*n+1]/ state.net.h; //Replaced

        pred.x1 = (i + truth.x1) / state.net.w; 
        pred.y1 = (j + truth.y1) / state.net.h; 
        pred.x2 = (i + truth.x2) / state.net.w; 
        pred.y2 = (j + truth.y2) / state.net.h; 
        pred.x3 = (i + truth.x3) / state.net.w; 
        pred.y3 = (j + truth.y3) / state.net.h; 
        pred.x4 = (i + truth.x4) / state.net.w; 
        pred.y4 = (j + truth.y4) / state.net.h; 

            float iou = box_iou(pred, truth_shift); 
            if (iou > best_iou){
                best_iou = iou;
                best_n = n;
            }
        }

        int mask_n = int_index(l.mask, best_n, l.n);
        if(mask_n >= 0){ 
            int box_index = entry_index(l, b, mask_n*l.w*l.h + j*l.w + i, 0); 
            float iou = delta_yolo_box(truth, l.output, l.biases, best_n, box_index, i, j, l.w, l.h, state.net.w, state.net.h, l.delta, (2-truth.w*truth.h), l.w*l.h);
            //int obj_index = entry_index(l, b, mask_n*l.w*l.h + j*l.w + i, 4); //Replaced
    int obj_index = entry_index(l, b, mask_n*l.w*l.h + j*l.w + i, 8); 
            avg_obj += l.output[obj_index];         
            l.delta[obj_index] = 1 - l.output[obj_index]; 
            //int class_id = state.truth[t*(4 + 1) + b*l.truths + 4]; //Replaced
    int class_id = state.truth[t*(8 + 1) + b*l.truths + 8]; 

            if (l.map) class_id = l.map[class_id]; 
            //int class_index = entry_index(l, b, mask_n*l.w*l.h + j*l.w + i, 4 + 1); //Replaced
    int class_index = entry_index(l, b, mask_n*l.w*l.h + j*l.w + i, 8 + 1); 
            delta_yolo_class(l.output, l.delta, class_index, class_id, l.classes, l.w*l.h, &avg_cat, l.focal_loss); 

            ++count;
            ++class_count;
            if(iou > .5) recall += 1;
            if(iou > .75) recall75 += 1;
            avg_iou += iou;
        }
    }
}
*(l.cost) = pow(mag_array(l.delta, l.outputs * l.batch), 2); 
printf("Region %d Avg IOU: %f, Class: %f, Obj: %f, No Obj: %f, .5R: %f, .75R: %f,  count: %d\n", state.index, avg_iou/count, avg_cat/class_count, avg_obj/count, avg_anyobj/(l.w*l.h*l.n*l.batch), recall/count, recall75/count, count);

}

AlexeyAB commented 6 years ago

did you mean for the green section:

No, you shouldn't do this:

        int i1 = (truth.x1 * l.w);
        int i2 = (truth.x2 * l.w);
        int i3 = (truth.x3 * l.w);
        int i4 = (truth.x4 * l.w);

        int j1 = (truth.y1 * l.h);
        int j2 = (truth.y2 * l.h);
        int j3 = (truth.y3 * l.h);
        int j4 = (truth.y4 * l.h);

    i = (i1+i2+i3+i4)/4; //truncated to integer, eg. 10/4 = 2.5 --> 2 
    j = (j1+j2+j3+j4)/4;

leave it as it is:

        i = (truth.x * l.w); //Replaced
        j = (truth.y * l.h); // Replaced

        pred.x1 = (i + truth.x1) / state.net.w; 
        pred.y1 = (j + truth.y1) / state.net.h; 
        pred.x2 = (i + truth.x2) / state.net.w; 
        pred.y2 = (j + truth.y2) / state.net.h; 
        pred.x3 = (i + truth.x3) / state.net.w; 
        pred.y3 = (j + truth.y3) / state.net.h; 
        pred.x4 = (i + truth.x4) / state.net.w; 
        pred.y4 = (j + truth.y4) / state.net.h; 

            float iou = box_iou(pred, truth_shift); 
            if (iou > best_iou){
                best_iou = iou;
                best_n = n;
            }

Yes, you can do it. Ot you can try to write your custom pointes4_iou()-function (instead of box_iou()) that will calculate IoU for non-rectangle quadrangle.

JohnWnx commented 6 years ago

@AlexeyAB

Ot you can try to write your custom pointes4_iou()-function (instead of box_iou()) that will calculate IoU for non-rectangle quadrangle.

Yes, I have already written a function to calculate IOU of non-rectangle quadrangle, which will be called in box_iou in box.c

leave it as it is:

    i = (truth.x * l.w); //Replaced
    j = (truth.y * l.h); // Replaced

I am confused here as there will be no more truth.x and truth.y. Thus, i and j will always be zero:

** In data.c and all other functions of yolo_layer.c, truth boxes are defined as:

truth.x1, truth.x2, truth.x3, truth.x4, truth.y1, truth.y2, truth.y3, truth.y4

Don't I still need to calculate calculate (average_x,average_y) of the 4 points? which part of the red secton should it be done?

Thanks

JohnWnx commented 6 years ago

Or did you mean something like this in the green section?

From my understanding, the Green section is split into 2 parts: Part 1:

if(!truth.x) break; float best_iou = 0; int best_n = 0; i = (truth.x l.w); j = (truth.y l.h); box truth_shift = truth; truth_shift.x = truth_shift.y = 0; for(n = 0; n < l.total; ++n){ box pred = {0}; pred.w = l.biases[2n]/ state.net.w; pred.h = l.biases[2n+1]/ state.net.h; float iou = box_iou(pred, truth_shift); if (iou > best_iou){ best_iou = iou; best_n = n; }

This part fixes the truth box with center @(x=0,y=0) truth box: (0, 0, w, h) pred box: (0, 0, predicted_w, predicted_h)

iou(pred, truth), in order to find the best width and best height In the end, the best anchor "n" will be calculated.

should I delete this whole part then? since I am not using w and h

part 2:

int mask_n = int_index(l.mask, best_n, l.n); if(mask_n >= 0){ int box_index = entry_index(l, b, mask_nl.wl.h + jl.w + i, 0); float iou = delta_yolo_box(truth, l.output, l.biases, best_n, box_index, i, j, l.w, l.h, state.net.w, state.net.h, l.delta, (2-truth.wtruth.h), l.w*l.h);

            int obj_index = entry_index(l, b, mask_n*l.w*l.h + j*l.w + i, 4);
            avg_obj += l.output[obj_index];
            l.delta[obj_index] = 1 - l.output[obj_index];

            int class_id = state.truth[t*(4 + 1) + b*l.truths + 4];
            if (l.map) class_id = l.map[class_id];
            int class_index = entry_index(l, b, mask_n*l.w*l.h + j*l.w + i, 4 + 1);
            delta_yolo_class(l.output, l.delta, class_index, class_id, l.classes, l.w*l.h, &avg_cat, l.focal_loss);

            ++count;
            ++class_count;
            if(iou > .5) recall += 1;
            if(iou > .75) recall75 += 1;
            avg_iou += iou;
        }

This part further adjusts the predicted box previously calculated in the RED section via this function: delta_yolo_box() **However, I need the value of "best_n" calculated which was in part 1. "best_n" will be used to calculate: "int mask_n = int_index(l.mask, best_n, l.n);" required by this part.

How can I obtain "best_n" in this case for: "int mask_n = int_index(l.mask, best_n, l.n);" ?

Also, may I know what is mask_n?

Thank you.

AlexeyAB commented 6 years ago

Or did you mean something like this in the green section?

Yes.

should I delete this whole part then? since I am not using w and h

Yes, you can delete part with box_iou().

How can I obtain "best_n" in this case for: "int mask_n = int_index(l.mask, best_n, l.n);" ?

Also, may I know what is mask_n?

mask_n is an index of the most suitable anchor for this bbox: https://github.com/AlexeyAB/darknet/blob/57e878b4f9512cf9995ff6b5cd6e0d7dc1da9eaf/cfg/yolov3.cfg#L608 I.e. mask_n defines what is the most suitable anchor and what is the most suitable [yolo]-layer (of 3 [yolo]-layers).

So even if you don't need for anchors, you must somehow determine which layer (the layer with what scale) is the most suitable for these 4-points.

or use anchors as it is done
or use something else - hardcode it in C, and determine whether the given [yolo]-layer is suitable for this object, depending on the size of this object, or something else

JohnWnx commented 6 years ago

@AlexeyAB , thank you for the explanations, however I'm still getting IOU =0.

or use anchors as it is done

How may I use anchors if I do not have box.w and box.h?

Red section - search for final activations where are no objects Green section - search for final activations where are objects

Why would there be predicted bounding boxes generated when there are "no objects" in the red section?

box pred = get_yolo_box(l.output, l.biases, l.mask[n], box_index, i, j, l.w, l.h, state.net.w, state.net.h, l.w*l.h);

From my understanding, these two nested "for loops" in the red section are trying to adjust the predicted (x,y) coordinate from (0 to l.w) and (0 to l.h). Is this correct? What is l.w and l.h?

Because till now, I'm still getting IOU = 0, for all cases in both green and red section. when trying the original unedited version of YOLO, I realised than IOU is not always zero.

//================ I have confirmed that my truth (x1,y1,x2,y2,x3,y3,x4,y4) values are always correctly reflected in yolo layer.c. I have also verified the modified box_iou() function correctly returns the IOU. I believe the problem now lies with the predicted box values.

The predicted box is obtained via the function: get_yolo_box() Here, b.x1 = (i + l.output[box_index + 0stride])/l.w The newly predicted x1 value takes the originally predicted value: l.output[box_index + 0stride] and is adjusted by adding "i" which its value increases progressively from 0 to l.w in forward_yolo_layer() function.

The most plausible reason that I'm still getting IOU =0, *I believe that the value of l.output[box_index + 0stride]) is incorrect.**

Here's my concern: Originally, YOLO trains the network by telling it defining the training data contained in a box in (x,y,w,h) format: Based on what I observed in yolo_layer.c, after the network predicts (x,y) --> l.output, it is then further adjusted by adding i,j that increases from 0 to l.w and l.h respectively, to find the best IOU, when comparing to the given truth box. For this, can I know which part of the original code, which passes the actual pixel information into the network for training?

Now, when defining the box with 4 points with (x1,y1,x2,y2,x3,y3,x4,y4) format:

How will the network know where is the actual pixel data used for telling the network that "This is text" Considering that in data.c, the training "pixel" to be learned by the network is in a non-rotated fixed shape rectangle:

image

ai = image_data_augmentation(src, w, h, pleft, ptop, swidth, sheight, flip, jitter, dhue, dsat, dexp);
d.X.vals[i] = ai.data;

In data.c, while loading in the training data:

I am supplying YOLO image ai = image_data_augmentation(src, w, h, pleft, ptop, swidth, sheight, flip, jitter, dhue, dsat, dexp);
d.X.vals[i] = ai.data; Here: I'm randomly cropping and augmenting a patch of the image with information: d.X.vals[i], which will be passed into the network as args.d. **What is d.X.vals[i]? Is it the pixels used for training by the network?**

fill_truth_detection(filename, boxes, d.y.vals[i], classes, flip, dx, dy, 1./sx, 1./sy, small_object, w, h, pleft, ptop, pright, pbot, ow, oh); Here: I'm telling YOLO: This 4 points (x1,y1), (x2,y2), (x3,y3), (x4,y4) forms a quadrangle which contains "text" These coodinates are stored and returned as: d.y.vals[i] which will be passed into the network as args.d.

AlexeyAB commented 6 years ago

How may I use anchors if I do not have box.w and box.h?

You can use anchors (w,h) only to find the most suitabel scale by calculation the area of a quadrilateral (your 4 points), i.e. find the closes 4_points_area(p1,p2,p3,p4) ~= anchors_area(w,h) instead of box_iou().

Why would there be predicted bounding boxes generated when there are "no objects" in the red section?

Each anchor in each final activation always predicts an object. Just with different (may be low) probability.

Red section - get deltas for final activations where are no objects
Green section - get deltas for final activations where are objects

From my understanding, these two nested "for loops" in the red section are trying to adjust the predicted (x,y) coordinate from (0 to l.w) and (0 to l.h). Is this correct? What is l.w and l.h?

l.w, l.h - are the width and height of the final feature map (final activation), i.e. 13x13, 26x26 and 52x52 for 3 [yolo]-layers. (i, j) iterates through this final feature map to get deltas for final activation where shouldn't be objects (red section), and where should be objects (green section).

JohnWnx commented 6 years ago

In Yolo, how is the pixel to be trained by the network identified? (Where can I find the section performing this?)

I.e in the original YOLO code: Is the entire “patch” of pixels defined by the rectangular box passed into the network or just the center coordinates (x,y)?

AlexeyAB commented 6 years ago

What do you mean?

Each of the final activation sees almost the whole image. Then each final activation adjusts inital values of:

x,y (center of bbox) - based on the current i/l.w and j/l.h
w,h (size of bbox) - based on the current anchor (w,h)

JohnWnx commented 6 years ago

Oh when you mentioned “each of the final activation sees almost the whole picture” do you mean: the final activation sees the whole picture where is pixels are contained in d.X.vals[i] as defined in data.c?

During the loading of data in data.c, fill_truth_detections() is used to get the ground truth box coordinates which are stored as d.y.vals[i]

After this process, how is the actual pixel information passed into the network?

I.e. how do I tell yolo that this particular pixel is “text? Is the the center coordinates pixel trained as text or all the pixels contained in the bounding box) are sent into the network?

AlexeyAB commented 6 years ago

Oh when you mentioned “each of the final activation sees almost the whole picture” do you mean: the final activation sees the whole picture where is pixels are contained in d.X.vals[i] as defined in data.c?

Yes.

I.e. how do I tell yolo this particular pixel is “text?

Yolo doesn't tell you that this pixel belongs to this object.

There are 3 the most usable approaches:

Classifier - network says what objects are on the image
Detector - network says what objects are on the image, and where are on the image (x,y,w,h)
Segmenter - network says which pixels belong to this object

So Yolo is Detector.

On the one hand, the Segmenter more precisiely gives you shape of the object than Detector and separate it from the background. From the other hand, the Segmenter usually can't separate several objects.

Detector is more suitable for detecting and counting objects
Segmenter is more suitable for detecting: sky, road, grass, ... - which in fact is not a specific object and can have any shape

JohnWnx commented 6 years ago

@AlexeyAB is the pred.w and pred.h normalized with respect to the image width and height?

Should multiply them with state.net.w and state.net.h or with the (image_width=1280) and (image_height=720) respectively?

I realised that pred.w and pred.h after un-normalizing are actually the anchor values defined in yolo-text.cfg: ==> anchors = 10,13, 16,30, 33,23, 30,61, 62,45, 59,119, 116,90, 156,198, 373,326 Do I need to recalculate these anchors or I can leave them as is it?

Also, can I clarify if this is the correct method (below) to obtain "i" and "j" now since "truth.x" is already replaced by truth.x1, truth.x2, truth.x3 and truth.x4.

AlexeyAB commented 6 years ago

Do I need to recalculate these anchors or I can leave them as is it?

I think you can leave them. Because if you want to recalculate anchors, then you should rewrite code for anchors calculation by using your 4-points objects.

is the pred.w and pred.h normalized with respect to the image width and height?

Here pred.w and pred.h are normalized from 0.0 to 1.0: https://github.com/AlexeyAB/darknet/blob/57e878b4f9512cf9995ff6b5cd6e0d7dc1da9eaf/src/yolo_layer.c#L249-L250

If you want to map it to the network input, then multiply it by state.net.w and state.net.h
If you want to map it to the input image, then multiply it by image_width (1920) and image_height (1080)

Also, can I clarify if this is the correct method (below) to obtain "i" and "j" now since "truth.x" is already replaced by truth.x1, truth.x2, truth.x3 and truth.x4.

I think yes.

JohnWnx commented 6 years ago

Hi @AlexeyAB

If you want to map it to the network input, then multiply it by state.net.w and state.net.h

I don't quite understand this, I am trying to

use anchors (w,h) only to find the most suitabel scale by calculation the area of a quadrilateral (your 4 points), i.e. find the closes 4_points_area(p1,p2,p3,p4) ~= anchors_area(w,h) instead of box_iou().

Would I need to:

float anchors_area = (pred.w1280)(pred.h720); //printf("anchors_area: %f \n",anchors_area); float poly_points[20][2] = {{truth.x11280,truth.y1720}, {truth.x21280,truth.y2720}, {truth.x31280,truth.y3720}, {truth.x41280,truth.y4*720}};

or

float anchors_area = (pred.wstate.net.w)(pred.hstate.net.h); //printf("anchors_area: %f \n",anchors_area); float poly_points[20][2] = {{truth.x1state.net.w,truth.y1state.net.h}, {truth.x2state.net.w,truth.y2state.net.h}, {truth.x3state.net.w,truth.y3state.net.h}, {truth.x4state.net.w,truth.y4*state.net.h}};

In order to obtain the area, I need to un-normalize it. So do I map to network input or input image?

===

I also realised that IOU does not always become 0 when I un-comment the correct_yolo_boxes() in data.c. May I know what does the correct_yolo_boxes() function do?

AlexeyAB commented 6 years ago

@JohnWnx

Use this:

float anchors_area = (pred.wstate.net.w)(pred.hstate.net.h); //printf("anchors_area: %f \n",anchors_area); float poly_points[20][2] = {{truth.x1state.net.w,truth.y1state.net.h}, {truth.x2state.net.w,truth.y2state.net.h}, {truth.x3state.net.w,truth.y3state.net.h}, {truth.x4state.net.w,truth.y4*state.net.h}};

Anchors is related to the input network size - width= height= in the cfg-file. I.e. Anchor is the size of object on the image that is resized to the network size.

I also realised that IOU does not always become 0 when I un-comment the correct_yolo_boxes() in data.c. May I know what does the correct_yolo_boxes() function do?

correct_yolo_boxes() changes coordinates and sizes of objects if you use letterbox_image() instead of resize_image(): https://github.com/AlexeyAB/darknet/issues/232#issuecomment-336955485

JohnWnx commented 6 years ago

For the comparing area of (anchors_area vs. truth_area): **Similar for truth.x1, truth.x2, truth.x3, turh.x4, truth.y1, truth.y2, truth.y3,, truth.y4? But truth.x1 was obtained by actual_x1 divided by image__width. and truth.y2 is actual_y1 divided by image_height.

Anchors is related to the input network size - width= height= in the cfg-file. I.e. Anchor is the size of object on the image that is resized to the network size.

For calculating IOU in general ( box_iou(pred,truth) ): In the original YOLO, the truth parameters (x,y,w,h) are normalized with respect to image_width and image height, they are then compared with the predicted (x,y,w,h) using the box_iou() function. I understand that the normalized values (0 to 1) were used to calculate the IOU.

in my modified box_iou(pred, truth) function, I have to un-normalize both pred and truth parameters in order to find their actual area size before calculating the IOU. In this case, should I multiply by net.w or the image_width?** (During training)

Thus, during the actual testing of the detector (eg. giving a new test image for YOLO to detect), should I multiply by the image_width too? (During testing)

JohnWnx commented 6 years ago

Hi @AlexeyAB, I realised that my ground truth box has been distorted by the correct_boxes() function in data.c

As a result the ground truth looks like this in yolo_layer.c:

Should correct_boxes be disabled in this case ?as it is distorting the original ground truth labelled box.

Or at least the part(below) which changes the coordinates should be removed, leaving only the flip portion for data augmentation. Also, leaving the constrain portion.

By removing this:

And keeping this:

AlexeyAB commented 6 years ago

@JohnWnx

For comparison pred.w and pred.h with anchor_w and anchor_h, you should multiply pred.w by net.w and pred.h by net.h

Hi @AlexeyAB, I realised that my ground truth box has been distorted by the correct_boxes() function in data.c

correct_boxes() - should distort coordinates.

I made a mistake - I spoke here about correct_yolo_boxes() that shouldn't change coordinates if you use letter=0 relative=1: https://github.com/AlexeyAB/darknet/issues/1532#issuecomment-419715111

But other function correct_boxes() should change coordinates of your 4-points during data augmentation. Without it you will get very low accuracy. Just change your 4 points in the same way, as changed 2-points (x,y of bounded box) in the default Yolo.

JohnWnx commented 6 years ago

@AlexeyAB

For comparison pred.w and pred.h with anchor_w and anchor_h, you should multiply pred.w by net.w and pred.h by net.h I understand we need to multiply pred.w by state.net.w. However, my concern is now with the truth.x1.

*Shouldn't polygon_area (truth.x1 x 1280, truth.y1 x 720, truth.x2 x 1280, truth.y2 x 720, truth.x3 x 1280, truth.y3720, truth.x4 x1280, truth.y4 x720)

compared area with:

anchors_area = (pred.w*state.net.w)*(pred.h*state.net.h) ?**

Because originally, truth.x1 = real_truth.x1 / image_width = real_truth.x1 / 1280.

If I take: truth.x1*state.net.w, it will be (real_truth.x1 / 1280)*state.net.w

So I should not use: polygon_area (truth.x1 x state.net.w, truth.y1 x state.net.h, .... truth.x4 x state.net.width, truth.y4 x state.net.height)?

Thank you.

JohnWnx commented 6 years ago

Hi @AlexeyAB, I understand that you've mentioned that it is still possible to train YOLO using (x1,y1,x2,y2,x3,y3,x4,y4) format. However, I am still getting IOU <0.1 after training over 8 hours. There are no instances of IOU >0.1

I have checked using the old version of YOLO (x,y,w,h), I get instances of IOU >0.1 in less than 1 minute of training.

Can (x1,y1,x2,y2,x3,y3,x4,y4) still be trained with darknet53.conv.74 pre-trained weights? i.e. I'm concerned that the pre-trained weights were in (x,y,w,h) format, so it won't be compatiable.

Thank you.

AlexeyAB commented 6 years ago

@JohnWnx Hi,

You should compare (truth.w*state.net.w), (truth.h*state.net.h) with anchors from cfg-file directly.
Or you should compare truth.w, truth.h with pred.w,pred.h

Can (x1,y1,x2,y2,x3,y3,x4,y4) still be trained with darknet53.conv.74 pre-trained weights?

Yes.

i.e. I'm concerned that the pre-trained weights were in (x,y,w,h) format, so it won't be compatiable.

darknet53.conv.74 is only for middle layers, so it isn't related to any final layers or coordinates.

JohnWnx commented 6 years ago

Hi @AlexeyAB ,

As you have previously advised:

You can use anchors (w,h) only to find the most suitabel scale by calculation the area of a quadrilateral (your 4 points), i.e. find the closes 4_points_area(p1,p2,p3,p4) ~= anchors_area(w,h) instead of box_iou(). Thus, I calculated the 4_points_area(p1,p2,p3,p4) and then calculated anchors_area = (pred.wstate.net.w), (pred.hstate.net.h) And compared these two areas.

I'm still confused when you mentioned:

You should compare (pred.wstate.net.w), (pred.hstate.net.h) with anchors from cfg-file directly.

pred.w = l.biases[2*n]/ state.net.w;

Thus, pred.wstate.net.w = l.biases[2n], which are the anchors corresponding to what is defined in cfg:

10,13, 16,30, 33,23, 30,61, 62,45, 59,119, 116,90, 156,198, 373,326

The anchors value changes with the index "n". So the green part tries to find the best anchor pair (width and height) by comparing it's area with 4_points_area(p1,p2,p3,p4).

So there are two areas need to be calculated now:

anchors_area = (pred.wstate.net.w)(pred.h*state.net.h)
4_points_area, using custom polygon function(x1,y1,x2,y2,x3,y3,x4,y4). However, I cannot just supply the raw truth values without un-normalizing it.

*So do I use custom polygon function(x1image_width, y1image_height, x2image_width, y2image_height, x3image_width ,y3_image_height, x4image_width ,y4*image_height).

OR custom polygon function(x1net.w, y1net.h, x2net.w, y2net.h, x3net.w ,y3net.h, x4net.w ,y4net.h).**

My concern is: truth values of (x1,y1,x2,y2,x3,y3,x4,y4) were all obtained by dividing their original image_width and image_height respectively. i.e. In the labels: x1 = original_x1/image_width

JohnWnx commented 6 years ago

@AlexeyAB To add on, I am concerned with this because my predicted values seem to be very off from the actual values. Therefore, my IOU is always close to "0" and doesn't really improve.

However, my loss seems to be stable now as it decreases, and no longer diverges to "nan"

Thus, when calculating IOU, should I multiply by image_width and image_height OR multiply by net.w and net.h in order to obtain the un-normalized points to calculate IOU?

===

No matter how I adjust, my IOU seems to remain 0, and the boxes never overlap. The pred box is generated by this:

x was generated from:

    image ai = image_data_augmentation(src, w, h, pleft, ptop, swidth, sheight, flip, jitter, dhue, dsat, dexp);

d.X.vals[i] = ai.data;

Can I verify is this this the correct way to obtain the predicted box? All the predictions seem to never overlap with the truth boxes.

Thank you very much, greatly appreciate your responses.

AlexeyAB commented 6 years ago

@JohnWnx I fixed my answer: https://github.com/AlexeyAB/darknet/issues/1532#issuecomment-420250409

JohnWnx commented 6 years ago

@JohnWnx Hi,

You should compare (truth.w*state.net.w), (truth.h*state.net.h) with anchors from cfg-file directly.

Or you should compare truth.w, truth.h with pred.w,pred.h

Can (x1,y1,x2,y2,x3,y3,x4,y4) still be trained with darknet53.conv.74 pre-trained weights?

Yes.

i.e. I'm concerned that the pre-trained weights were in (x,y,w,h) format, so it won't be compatiable.

darknet53.conv.74 is only for middle layers, so it isn't related to any final layers or coordinates.

Hi @AlexeyAB , there is no truth,w and truth.h . There are no width and height components in the ground truth labels. only: x1,y1,x2,y2,x3,y3,x4,y4.

Or are you suggesting to average { (x2-x1) + (x3-x4) }/ 2 ===> (width_1 + width_2) / 2 ? I am concerned this may not be appropriate as the detected box is no longer a fixed rectangle, it could be a quadrangle, where width_1 =/= width_2. (I have tried this method, IOU is still "0")

How can I obtain truth.w and truth.h?

Thank you.

JohnWnx commented 6 years ago

Hi @AlexeyAB , I would like to have a private conversation with you. Could you send me an email at johnwnx@gmail.com?

Thank you.

JohnWnx commented 6 years ago

Hi @AlexeyAB I have applied the changes, however the predicted (x1,x2,x3,x4) seems to be very random and I am still getting IOU = 0. The points are not in clockwise manner. Thus I have tried two options: 1) skip the iteration if predicted (p1,p2,p3,p4) is not in clockwise manner 2) rearrange the points to ensure (p1,p2,p3,p4) is in clockwise manner (althought i think this shouldn't be done)

Both methods have also led to "0" IOU. I highly suspect the problem lies with the process of obtaining the pred. data.

in get_yolo_box(): I am not sure if this is the correct method to calculate the predicted box. Could I check what is x[index +0*stride]? Could the obtained x[] be wrong in this case? Therefore, the predicted 4 points almost always never overlap with the truth 4 points.

Also, could I have a quick private discussion with you ?

Thank you.

AlexeyAB commented 6 years ago

Hi @AlexeyAB , there is no truth,w and truth.h . There are no width and height components in the ground truth labels. only: x1,y1,x2,y2,x3,y3,x4,y4.

Or are you suggesting to average { (x2-x1) + (x3-x4) }/ 2 ===> (width_1 + width_2) / 2 ? I am concerned this may not be appropriate as the detected box is no longer a fixed rectangle, it could be a quadrangle, where width_1 =/= width_2. (I have tried this method, IOU is still "0")

How can I obtain truth.w and truth.h?

Thank you.

You can try to use Gauss's area formula: https://en.wikipedia.org/wiki/Shoelace_formula#Proof_for_a_quadrilateral_and_general_polygon

And compare

box pred = {0};
pred.w = l.biases[2*n]/ state.net.w;
pred.h = l.biases[2*n+1]/ state.net.h;
float S_anchor = pred.w * pred.h;
float S_points = (1./2.)*(x1*y2 + x2*y3 + x3*y4 + x4*y1 - x2*y1 - x3*y2 - x4*y3 - x1*y4);
float iou = 1 - fabs(S_anchor - S_points) / fmax(S_anchor, S_points);
if (iou > best_iou){
    best_iou = iou;
    best_n = n;
}

instead of: https://github.com/AlexeyAB/darknet/blob/57e878b4f9512cf9995ff6b5cd6e0d7dc1da9eaf/src/yolo_layer.c#L248-L255

Where are pred.w, pred.h, x1, y1, x2, y2, .... normalized values [0.0 - 1.0].

quadri

AlexeyAB commented 6 years ago

in get_yolo_box(): I am not sure if this is the correct method to calculate the predicted box. Could I check what is x[index +0*stride]?

What do you mean?

Could the obtained x[] be wrong in this case? Therefore, the predicted 4 points almost always never overlap with the truth 4 points.

What is the maximum IoU did you get?

Also, could I have a quick private discussion with you ?

I wrote you. But I do not answer that often.

Can you show final part of training log?

JohnWnx commented 6 years ago

Hi @AlexeyAB

in get_yolo_box(): I am not sure if this is the correct method to calculate the predicted box. Could I check what is x[index +0*stride]?

What do you mean?

Could the obtained x[] be wrong in this case? Therefore, the predicted 4 points almost always never overlap with the truth 4 points.

How is this highlighted value obtained? I believe this is actually the prediction by the logistic activation before adjustment. However, The value is always incorrect. Therefore, the predicted box and truth box will never overlap. As I believe this value is "randomly" generated here: by data.c, I can't find the link to fix this problem such that I can make sure YOLO predicts x1,y1,x2,y2,x3,y3,x4,y4 correctly (as mentioned in my email), such that all the points are in clockwise. Therefore, the IOU is always zero ad the boxes will never overlap.

What is the maximum IoU did you get?

Also, could I have a quick private discussion with you ?

I wrote you. But I do not answer that often.

Can you show final part of training log?

Are you referring as this for training log?

Maximum IOU is less than 0.1.

AlexeyAB commented 6 years ago

@JohnWnx Hi,

How is this highlighted value obtained? I believe this is actually the prediction by the logistic activation before adjustment. However, The value is always incorrect. Therefore, the predicted box and truth box will never overlap. As I believe this value is "randomly" generated here: image by data.c, I can't find the link to fix this problem such that I can make sure YOLO predicts x1,y1,x2,y2,x3,y3,x4,y4 correctly (as mentioned in my email), such that all the points are in clockwise. Therefore, the IOU is always zero ad the boxes will never overlap.

This values https://github.com/AlexeyAB/darknet/blob/ca43bbdaaede5c9cbf82a8a0aa5e2d0a4bdcabc0/src/yolo_layer.c#L87 is obtained by using Logistic activation (sigmoid) here (as it is done for x,y coordinates in the default yolo): https://github.com/AlexeyAB/darknet/blob/ca43bbdaaede5c9cbf82a8a0aa5e2d0a4bdcabc0/src/yolo_layer.c#L175 so you should change here 2*l.w*l.h to the 8*l.w*l.h.

Prior to this, this value is taken from outout of the previous convolutional layer (without activation - i.e. linear activation): https://github.com/AlexeyAB/darknet/blob/master/cfg/yolov3.cfg#L599-L604

For the simplest debuging, try to:

disable data augmentation (don't change training image and don't change coordinates of labels in data.c)
leave only 1 image in the train.txt file
comment burn_in=1000 parameter in the cfg-file

It should successfully train your model only for one this object(4 points), so avg-loss should decreases.

And show screenshot of training log with the Avg IOU and avg loss.

Also change this line: https://github.com/AlexeyAB/darknet/blob/ca43bbdaaede5c9cbf82a8a0aa5e2d0a4bdcabc0/src/yolo_layer.c#L276 to this

if( !isnan(iou) ) avg_iou += iou;

to avoid nan values.

JohnWnx commented 6 years ago

Hi @AlexeyAB , With regards to training on single image for debugging:

When trying to write this line: d.X.vals[i] = load_image(random_paths[i], 0, 0, c); I can't disable image augmentation as I get this error:

I am currently training with image_augmentation and correct_boxes on a single image that contains label indicating there are two bounding boxes.

Here is my training loss:

AlexeyAB commented 6 years ago

Use this:

image orig_im = load_image(random_paths[i], 0, 0, c);
d.X.vals[i] = orig_im.data;

And with commented: correct_box().

JohnWnx commented 6 years ago

Hi @AlexeyAB, After making the changes, the process still couldn't be run though. I have also commented correct_box()

JohnWnx commented 6 years ago

Hi @AlexeyAB,

With regards to changing the values to activate the logistic functions:

I have swapped 2 to 10, because I intend to keep the previous (x,w,y,h) while adding in (x1,y1,x2,y2,x3,y3,x4,y4). In that case, the number of x,y to be predicted should be 10.

After changing all truths and in yolo_layer.c from 4 to 12 in the original yolo code, I have verified that the results are working.

However, upon performing the changes to activate logistic activations from 2 to 10, YOLO is no longer able to detect objects on the weights (models) that it trains.

Here is my training log: logistics_10

JohnWnx commented 6 years ago

I have managed to resolve the issue, as: activate_array(l.output + index, 10l.wl.h, LOGISTIC); assigns the logistic variable to the first 10 predicted array "X" counting from the current index. Thus, all the variables used must be sorted accordingly. Leaving the last 2 variables which we do not want to use LOGISTIC placed at the back in the following manner:

b.x1 = f[0 * stride];
b.y1 = f[1 * stride];
b.x2 = f[2 * stride];
b.y2 = f[3 * stride];
b.x3 = f[4 * stride];
b.y3 = f[5 * stride];
b.x4 = f[6 * stride];
b.y4 = f[7 * stride];
b.x = f[8 * stride]; 
b.y = f[9 * stride];

b.w = f[10 stride]; (Logistic not used here, only first 10 of X is activated). b.h = f[11 stride]; (Logistic not used here, only first 10 of X is activated).

JohnWnx commented 6 years ago

Hi @AlexeyAB

I realised that in my training process, correct_yolo_boxes() in yolo_layer.c was not used. Will it be used for demo() "video detection" or detector_test() "image detection"?
What is the purpose of if(!truth.x) break; in forward_yolo_layer() of yolo_layer.c? Under what scenario would truth.x be 0? Is used to prevent incorrectly adjusted truth.x values adjusted in correct_boxes() in data.c? If so, why truth.y is not checked as well?
What is the function of delta_yolo_boxes() in yolo_layer.c? From what I observed, the deltas (loss = truth_parameter-predicted_parameter) are calculated here. The deltas are then later summed up to produce the total average loss. tx represents the truth while x[index + 8*stride] represents the predicted. However, why do we need to subtract "i" off truth.x?
How does the IOU affect the training? I observed that in the red section, the following section declares that delta (loss) is 0, because when IOU>0.7, the ignore threshold, we can assume no more changes is required. if (best_iou > l.ignore_thresh) { l.delta[obj_index] = 0; //No more changes required } However, the above section is not seen in the green section. Thus, how would the calculated IOU in the green contribute to the training process, apart from only being displayed as: "avg_iou += iou; "?
From your suggestion: When this is done in the green section: i = (truth.x1 + truth.x2 + truth.x3 + truth.x4)/4 * l.w, Are we actually approxmating the center point (x,y) of the bounding box? Does that mean YOLO can only detect the center of the object rather than the 4 corner points?

Thank you.

kmsravindra commented 5 years ago

@JohnWnx, Wanted to know if you finally were able to train successfully using a 4 coordinate system? If so, could you please share? I too am looking for a similar solution and it would be helpful.

One thought is that you could probably consider using a different algorithm like mask-rcnn where the annotations can be like polygon shaped and need not be a strict rectangle as in here.

JohnWnx commented 5 years ago

@kmsravindra
Sorry for the late reply, I realised that YOLO being an object detector finds the exact centre coordinates of the object and then approximate its width and height via the anchors precalculated from K-means clustering. Thus from my understanding, it is not possible to train a 4-coordinate system.

i-chaochen commented 5 years ago

Hi @AlexeyAB I am not sure what does this following one doing?
box truth = float_to_box_stride(state.truth + t*(4 + 1) + b*l.truths, 1);

It seems transferring the state.truth as box of truth {x, y, w, h}, but I'm not understood why you need t*5 // 5=4+1, which t is from max_boxes (90), so t could be [0,1,2,3...,89].

Also, you mentioned that

Red section - get deltas for final activations where are no objects
Green section - get deltas for final activations where are objects

I am confused that why do you need to calculate deltas twice for non-object and object, respectively? You already have the truth and obj_index, why you can't just calculate deltas at once?

AlexeyAB commented 5 years ago

@i-chaochen

I am confused that why do you need to calculate deltas twice for non-object and object, respectively?

Many final activations may generate bounded boxes for the same one object, so in this section, you can increase T0(objectness) for many cells and anchors if(best_iou > l.truth_thresh), or at least you may not decrease T0(objectness) for many cells and anchors if (best_iou > l.ignore_thresh) https://github.com/AlexeyAB/darknet/blob/cce34712f6928495f1fbc5d69332162fc23491b9/src/yolo_layer.c#L255-L268

If you want to detect low number of objects on one image which are far from each other, then use low truth_thresh, ignore_thresh and nms_threshold
If you want to detect crowds (high number of objects on one image which are close each to other), then use high truth_thresh, ignore_thresh and nms_threshold

i-chaochen commented 5 years ago

@i-chaochen

I am confused that why do you need to calculate deltas twice for non-object and object, respectively?

Many final activations may generate bounded boxes for the same one object, so in this section, you can increase T0(objectness) for many cells and anchors if(best_iou > l.truth_thresh), or at least you may not decrease T0(objectness) for many cells and anchors if (best_iou > l.ignore_thresh)

https://github.com/AlexeyAB/darknet/blob/cce34712f6928495f1fbc5d69332162fc23491b9/src/yolo_layer.c#L255-L268

If you want to detect low number of objects on one image which are far from each other, then use low truth_thresh, ignore_thresh and nms_threshold

If you want to detect crowds (high number of objects on one image which are close each to other), then use high truth_thresh, ignore_thresh and nms_threshold

Thanks for reply.

Just to make sure, the section you referred to “so in this section, you can increase T0 or at least not decrease T0”, is the green section, the section to calculate deltas for objects?

But still, why you need to calculate the deltas for non-object? Those deltas seem can be calculated at object sections as well? It’s “nonobject” anyway during the training.

AlexeyAB / darknet

what does index & entry_index() in yolo_layer.c do? #1532

1) data.c:

2) yolo_layer.c

3) box.c

ifndef GPU //unused, as GPU is used.

endif