I recently decided to try to make a YOLO V1 implementation as my first serious project, based on your guide, but doing all the pre-training and training the full model myself. I have succeeded in making a sort of working model, though there are probably still some mistakes as it is not optimal. For reference, my repository is here.
Doing this led to me to noticing some issues with your implementation of the loss function:
Your target confidence (the tensor torch.flatten(exists_box * target[..., 20:21])) is going to be 1 for every cell where a box exists, and 0 for every box where it does not exist. In fact target[..., 20:21] is the same thing as exists_box. This is not true to the paper, which instead asks that in the case of a responsible predictor, the target confidence should be equal to the IOU of the currently predicted box with the ground truth box. The correct target tensor is exists_box * iou_maxes.unsqueeze(3) (not tested working, but this is the right idea). There is actually currently an open pull request (#44 ) which would fix this.
Your no-object loss does not factor in non-responsible predictors which share a cell with a responsible predictor, which it should, as the "1_ij^noobj" from the paper will be 1 for these.
You set your MSE function with reduction='sum', but then do not normalize for batch size. This means that the loss scales linearly with the batch size, which results in much larger losses (forcing low learning rates), and is also an entanglement of hyperparameters, which is bad. The correct implementation is to calculate sum-squared error for each sample in the batch independently, then average them. To fix this, replace return loss with return loss / float(predictions.size()[0]) (you will have to use a larger learning rate, but this is a good thing!).
Those flatten layers in are totally unnecessary, or rather, they do nothing. Torch MSE is smart enough to be given any two tensors of the same dimension.
In dataset.py, you have your width and height target values for each box calculated relative to the cell dimensions: width_cell, height_cell = (width * self.S, height * self.S,) This is incorrect, they are supposed to be relative to the dimensions of the entire image (even though x and y are relative to the dimensions of a cell!) The reason for this, as stated in the paper, is so that each element of [x,y,w,h] will be between 0.0 and 1.0. To fix this, just remove multiplication by self.S. This will also need to be fixed on the other end when you convert predicted labels back to boxes for visualization. This is really more about the dataloading than the loss function, but because it unbalances the loss function it has the same sort of effect: failing to fix this causes mode collapse on object classification when you try to generalize the model.
Obviously your project is just about overfitting the model, and none of these issues are apparent when attempting to overfit. They do, however, cause serious issues when you are trying to train the whole thing. If you want to fix it, feel free to have reference to my re-implementation of the loss function, which should be compatible with yours, but is re-written to try to mimic the paper's formula as close as possible. Do bear in mind, though, that mine evidently isn't perfect either (I can't get my model stable under 1e-2 learning rate, indicating a probable scaling mistake somewhere).
I recently decided to try to make a YOLO V1 implementation as my first serious project, based on your guide, but doing all the pre-training and training the full model myself. I have succeeded in making a sort of working model, though there are probably still some mistakes as it is not optimal. For reference, my repository is here.
Doing this led to me to noticing some issues with your implementation of the loss function:
torch.flatten(exists_box * target[..., 20:21])
) is going to be 1 for every cell where a box exists, and 0 for every box where it does not exist. In facttarget[..., 20:21]
is the same thing asexists_box
. This is not true to the paper, which instead asks that in the case of a responsible predictor, the target confidence should be equal to the IOU of the currently predicted box with the ground truth box. The correct target tensor isexists_box * iou_maxes.unsqueeze(3)
(not tested working, but this is the right idea). There is actually currently an open pull request (#44 ) which would fix this.reduction='sum'
, but then do not normalize for batch size. This means that the loss scales linearly with the batch size, which results in much larger losses (forcing low learning rates), and is also an entanglement of hyperparameters, which is bad. The correct implementation is to calculate sum-squared error for each sample in the batch independently, then average them. To fix this, replacereturn loss
withreturn loss / float(predictions.size()[0])
(you will have to use a larger learning rate, but this is a good thing!).dataset.py
, you have your width and height target values for each box calculated relative to the cell dimensions:width_cell, height_cell = (width * self.S, height * self.S,)
This is incorrect, they are supposed to be relative to the dimensions of the entire image (even thoughx
andy
are relative to the dimensions of a cell!) The reason for this, as stated in the paper, is so that each element of[x,y,w,h]
will be between 0.0 and 1.0. To fix this, just remove multiplication byself.S
. This will also need to be fixed on the other end when you convert predicted labels back to boxes for visualization. This is really more about the dataloading than the loss function, but because it unbalances the loss function it has the same sort of effect: failing to fix this causes mode collapse on object classification when you try to generalize the model.Obviously your project is just about overfitting the model, and none of these issues are apparent when attempting to overfit. They do, however, cause serious issues when you are trying to train the whole thing. If you want to fix it, feel free to have reference to my re-implementation of the loss function, which should be compatible with yours, but is re-written to try to mimic the paper's formula as close as possible. Do bear in mind, though, that mine evidently isn't perfect either (I can't get my model stable under 1e-2 learning rate, indicating a probable scaling mistake somewhere).