why the effect of distillation for yolov7 is not good?

Errol-golang commented 2 years ago

Recently, i tried to improve the performance of yolov7-tiny by distilling knowledge from yolov7. I have tried logits-based distillation and feature-based distillation. However, it didn't work for yolov7. But it is work on yolov5 with the same code. I'm wondering if you try to distill yolov7-tiny? and why the distillation effect is bad ? The reason for this is the architecture of yolov7, the loss computation for yolov7 or other reasons? Looking forward to your reply. If there are any other friends have meeting the same problem, could us have a conversation about this?

WongKinYiu commented 2 years ago

YOLOv7-tiny should able to work with knowledge distillation. One thing to notice is that YOLOv5 uses depth/width ratio to control number of layers and channels, so index of output blocks of yolov5n, s, m, l, x are same. Another thing to notice is that default YOLOv7 and YOLOv7-tiny has different anchors and trained with different input size. You may need check the index and anchors parts of your knowledge distillation code.

Errol-golang commented 2 years ago

YOLOv7-tiny should able to work with knowledge distillation. One thing to notice is that YOLOv5 uses depth/width ratio to control number of layers and channels, so index of output blocks of yolov5n, s, m, l, x are same. Another thing to notice is that default YOLOv7 and YOLOv7-tiny has different anchors and trained with different input size. You may need check the index and anchors parts of your knowledge distillation code.

Thanks. I have found another issue about distillation. I'm waiting for his open-source code.

Errol-golang commented 2 years ago

YOLOv7-tiny should able to work with knowledge distillation. One thing to notice is that YOLOv5 uses depth/width ratio to control number of layers and channels, so index of output blocks of yolov5n, s, m, l, x are same. Another thing to notice is that default YOLOv7 and YOLOv7-tiny has different anchors and trained with different input size. You may need check the index and anchors parts of your knowledge distillation code.

Thanks for you reply. I've tried replace the anchor of yolov7-tiny with anchors of yolov7. But the distillation still not work. Do u think if it is possible that it is due to the model gap between yolov7 and yolov7-tiny?

lin1github commented 2 years ago

@Errol-golang did you have any progress? I also did some experiments of distillation(Yolov7 as teacher and Yolov7-tiny as student, using FGD/CWD/MGD/PKD distillation losses), but I can not get the improvement like other anchor-based detectors performs.

Fushier commented 1 year ago

@Errol-golang did you have any progress? I also did some experiments of distillation(Yolov7 as teacher and Yolov7-tiny as student, using FGD/CWD/MGD/PKD distillation losses), but I can not get the improvement like other anchor-based detectors performs.

I conducted experiments on the VOC2007 dataset, using YOLOv7 as the teacher model and YOLOv7-tiny as the student model. YOLOv7-tiny indeed achieved better performance and converged faster. Drawing on feature distillation in CNNs, I registered hook functions in the down-sampling layers of the backbone network to obtain feature maps for feature distillation (such as SP, AT, and SemCKD). Specifically, in YOLOv7, the layers were at 3, 16, 29, and 42, while in YOLOv7-tiny, they were at 1, 8, 15, and 22 (adaptive layers are needed to match the dimensions of these features). Of course, these layer numbers are not necessarily correct, but I hope this information is helpful to you.

lin1github commented 1 year ago

@Errol-golang did you have any progress? I also did some experiments of distillation(Yolov7 as teacher and Yolov7-tiny as student, using FGD/CWD/MGD/PKD distillation losses), but I can not get the improvement like other anchor-based detectors performs.

I conducted experiments on the VOC2007 dataset, using YOLOv7 as the teacher model and YOLOv7-tiny as the student model. YOLOv7-tiny indeed achieved better performance and converged faster. Drawing on feature distillation in CNNs, I registered hook functions in the down-sampling layers of the backbone network to obtain feature maps for feature distillation (such as SP, AT, and SemCKD). Specifically, in YOLOv7, the layers were at 3, 16, 29, and 42, while in YOLOv7-tiny, they were at 1, 8, 15, and 22 (adaptive layers are needed to match the dimensions of these features). Of course, these layer numbers are not necessarily correct, but I hope this information is helpful to you.

@Errol-golang did you have any progress? I also did some experiments of distillation(Yolov7 as teacher and Yolov7-tiny as student, using FGD/CWD/MGD/PKD distillation losses), but I can not get the improvement like other anchor-based detectors performs.

I conducted experiments on the VOC2007 dataset, using YOLOv7 as the teacher model and YOLOv7-tiny as the student model. YOLOv7-tiny indeed achieved better performance and converged faster. Drawing on feature distillation in CNNs, I registered hook functions in the down-sampling layers of the backbone network to obtain feature maps for feature distillation (such as SP, AT, and SemCKD). Specifically, in YOLOv7, the layers were at 3, 16, 29, and 42, while in YOLOv7-tiny, they were at 1, 8, 15, and 22 (adaptive layers are needed to match the dimensions of these features). Of course, these layer numbers are not necessarily correct, but I hope this information is helpful to you.

Thanks for the information.In the past few weeks, my experiments have also show some improvements when distilling Yolov7-tiny using Yolov7 as teacher. Unlike your experiments, my experiments were conducted on the COCO dataset, with the FPN feature layers as the distillation targets.

zgzhengSEU commented 1 year ago

I use CMD for knowledge distillation, the teacher network is yolov7x, and the student network is yolov7tiny. The experimental results of yolov7tiny even reduce the accuracy

godhj93 commented 1 year ago

Hi, @Fushier. I'm interested in your experiment, I am trying to implement knowledge distillation to yolov7-tiny model as well but visdrone dataset. It is difficult to make the tiny model converge faster for me. If you can share your code, it would be great to me. If you mind provide your code, can you provide some advice for me such as loss function for KD?

WongKinYiu / yolov7

why the effect of distillation for yolov7 is not good? #380