A few questions about the ObjectBox

Icecream-blue-sky commented 1 year ago

1、Unfair comparison between state-of-the-art methods. State-of-the-art detectors such as VFNet don't use any data augmentations and only train for 24 epochs, but ObjectBox uses the CutMix and Mosaic data augmentations and needs to train for 300 epochs. Why ? 2、About the advantage of using SDIoU. The only difference between SDIoU and DIoU is that the squares of the Euclidean distances are used instead of bbox area for IoU calculation. What are the advantages of doing this? What's more, the performance gap between SDIoU and GIoU is very small (as shown in Table S.3), since DIoU can do everything SDIoU can do, will the performance of DIoU be better than SDIoU? In addition, what's the meaning of scale-invariant distance-based IoU, isn't IoU loss itself a scale-invariant ? 3、About the regression of ObjectBox. I think that the regression target of ObjectBox is essentially the same as FCOS. Have you tried using the FCOS regression targets directly?

MohsenZand commented 1 year ago

1) We believe this is due to convergence rather than fairness. As ObjectBox uses a different backbone architecture, it is naturally trained and converged differently. We did not compare convergence and training speed since we made no claims about them. We do, however, report the inference speed, which is significantly more essential than the training speed. We also employ the same augmentation strategies as YOLOv4.

2) As we explained in the paper, the standard IoU losses cannot be used directly in our case. Moreover, we never have two separate boxes, since the GT and predicted boxes share at least one point. For advantages of using the squares of the Euclidean distances, please see scale balanced loss paper.

3) This has been explained several times throughout the paper. The paper is actually based on the ObjectBox distinctions from FCOS and other detectors.

xiaobai111111223 commented 1 year ago

Why is the area of non-overlapping areas expressed in this way

MohsenZand commented 1 year ago

Please see this.

Zzh-tju commented 1 year ago

I can‘t believe that ObjectBox is based on YOLOv5x (a strong highly-optimized CNN-based detector with AP 50.7). However, your method produces 46.8 AP. Why is there such a big degradation? I felt not good when I read the Implementation Details of the paper.

I totally agree with @Icecream-blue-sky for his concerns about the unfair comparison. We all know that YOLOv5 adopts tons of tricks to improve the performance such as train from scratch, 300 epochs training, hyper-parameter evolution, Cutmix/Mosaic data augmentation and so on and on and on. It is hard to convince that the improvement in Table 1 is owing to your proposed methods. It is far from enough to align the backbone network to ResNet-101.
The advantages of the new bbox representation is ambiguous. Due to the drawback 1, it is unclear whether this bbox representation is better than FCOS-style representation or the others. No ablation is presented. Moreover, theoretically, the new bbox representation may not form a correct bbox. For example, the predicted right edge is on the left side of the predicted left edge, and the bottom edge is on the upper side of the top edge. This is a bug that FCOS box representation can completely avoid. What's more, according to ATSS, what the essential is label assignment but not bbox representation.
From Table 3, it is hard to believe that the MSE loss has only 22.6 AP. Doesn't that mean that this new bbox representation is very poor? Because MSE-FCOS definitely won't be that performance.
From Table 2 part C, with scale assignment, the performance will drop from 46.8 to 35.8. I'm afraid that the original YOLOv5 is more accurate than that. The label assignment of YOLOv5 definitely uses scale assignment or something like that, or at least N-to-1 matching strategy.

MohsenZand commented 1 year ago

Thank you for your critical comments. In the paper and the code, we however did our best to include all the necessary information.

Clearly, we don't do all data augmentation and hyper-parameter evolution, as well as other strategies like anchor design and selection from YOLOv5. You entirely overlooked the fact that ObjectBox is an anchor-free detector and YOLOv5 is an anchor-based detector while making your judgment about the results and comparison (the impact of anchor boxes is ignored).

Nevertheless, our method is basically proposed to tackle the label assignment and generalization issues. It introduces a novel effective label assignment approach (as a result of the bbox representation) as well as a highly generalized anchor-free detector with no data-dependent hyper-parameter.

We are, however, attempting to further enhance it and solve its shortcomings.

MohsenZand / ObjectBox

A few questions about the ObjectBox #20