deepcam-cn / yolov5-face

YOLO5Face: Why Reinventing a Face Detector (https://arxiv.org/abs/2105.12931) ECCV Workshops 2022)
GNU General Public License v3.0
2.04k stars 495 forks source link

Unmatch in architecture between the paper and the pretrained model #97

Open 07embracebootie opened 2 years ago

07embracebootie commented 2 years ago

I realized that the pretrained network's architecture is a little bit shorter than one proposed in the paper. For instance, there are 4 C3 blocks before reaching to the SPP block in the paper, but in the pretrained model, there are only 3. Could you please explain this to me? Or am i wrong? I just simply invoke "print(model)" on the notebook and manually reformulate the architecture. Is anything not correct with the print command on Pytorch? Or the model is indeed missing some parts compare to the one in the your paper? Update: I just did a little check. Only with P6 version, your pretrained model match up within the paper. It doesnt make sense. Because in your paper, you didnt mention about changing the backbone when removing P6 output. It assumes that P6 block is just a removable head, and the backbone, including the neck is still the same. Is this a conventional manner of saying "removing the P6 output" in Computer Vision context?

Another two questions! Fig 1e in paper demonstrates the C3 block that has a CONV layer between bottleneck layers and CONCAT layer, but they seemly doesnt appear in the code (actually bottlenetCSP module have that, but C3 module is not, *in models/common.py). Second, the depth multiple is (1.0,0.67,0.33) respectively in code, but (1.0,0.5,0.33) in paper. Typing mistake?

QAQEthan commented 2 years ago

@Chad77L there are 4 C3 blocks before reaching to the SPP block in the paper, but in the pretrained model, there are only 3. 是不是因为你加载错了预训练模型呢?论文展示的是包含P6层的结构,你是不是加载了不包含P6的预训练模型?