LutingWang / OADP

Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection
Apache License 2.0
54 stars 3 forks source link

Reproduction of oadp_ov_lvis.py #13

Open parang99 opened 1 year ago

parang99 commented 1 year ago

Thank you for sharing the great work.
I tried to reproduce the results(20.6 bbox APr) on OV LVIS. The command I used is torchrun --nproc_per_node=8 -m oadp.dp.train oadp_ov_lvis configs/dp/oadp_ov_lvis.py --override .trainer.evaluation.interval:24. The results are as follows:

OD APr APc APf AP
checkpoint 20.7 28.2 32.3 28.5
reproduce 18.2 27.3 32.1 27.6
However, there was a difference in performance, especially in APr. The 'checkpoint' word means the test result with the checkpoint you provided.
I checked the train log of LVIS in Baidu to see what the problem is. And I couldn't see the configs related to Global KD. So I tried reproducing again without Global KD like the log. The results are as follows:
OD APr APc APf AP
checkpoint 20.7 28.2 32.3 28.5
reproduce without Global KD 19.1 27.9 32.4 28.1

The performance of APr is slightly improved, but still lower than 20.7.
In here, I got two questions. Q1. Is it correct that the final model of OV LVIS doesn't require the Global KD? Q2. Did I do something wrong when I do reproduce?

These are my experiment settings:

LutingWang commented 1 year ago

For Q1, we intentionally removed Global KD since the ablation study shows improvements. In your case, Global KD seems to improve APr, which may be due to the lower baseline. For Q2, I don't find any mistake from the command or environment info.

Could you please evaluate other checkpoints? Sometimes, the last checkpoint is not the best one. Also, it may be insightful if the losses of the first and last several iterations are provided.

parang99 commented 1 year ago

Thank you for your quick comment and sorry for the late response. I really want to know why there is a performance gap, so there were something that I need to check.

First, following your advice, I evaluate with other checkpoints about two cases (reproduce with Global KD and without Global KD). The results are as follows:

I think the best epochs in each case are 17th and 20th. But, in both cases, 20.7 APr cannot be reached. Based on these results, I concluded that the performance differences arise because the environments and GPUs cannot be completely identical.

Second, I captured the loss graphs from wandb for your insight. Screenshot from 2023-09-19 16-35-03 Screenshot from 2023-09-19 16-35-17 Screenshot from 2023-09-19 16-35-25

Last, another question arose. The test results obtained after training were slightly different with those from evaluation with loading the checkpoint were slightly? Can you tell me the reason for this?

LutingWang commented 1 year ago

According to the test results and loss curve you provided, we did not find any abnormalities. Therefore, over the past week, we conducted some experiments to reproduce this phenomenon. We found that the training process may be influenced by random factors, resulting in accuracy instability. Therefore, we recommend training with different random seeds, which may yield higher results.

Additionally, we analyzed the reasons for accuracy fluctuations and believe that the cause is the suppression of the rare class by the base class. Specifically, since the model is trained using base class annotations, it tends to predict objects as the base class. To eliminate this effect, we suggest applying the following patch and retesting:

diff --git a/oadp/dp/roi_heads.py b/oadp/dp/roi_heads.py
index e702981..e0bca08 100644
--- a/oadp/dp/roi_heads.py
+++ b/oadp/dp/roi_heads.py
@@ -109,6 +109,7 @@ class ViLDEnsembleRoIHead(StandardRoIHead):
         cls_score[:, -1] = 1 - cls_score[:, :-1].sum(-1)

         bbox_results['cls_score'] = cls_score.log()
+        bbox_results['cls_score'][:, :866] = float('-inf')
         return bbox_results

     def _object_forward(

This patch will be added to the main branch.

For the last question, we have also observed the phenomenon. We suspect that it is due to floating point precision (fp32 during testing and fp16 during training). However, we have not thoroughly investigated it yet, so we cannot confirm it. Our suggestion is to test the model's accuracy after training and report this accuracy as the final result. Because, given the checkpoint, this accuracy can always be reproduced.

Sorry for the inconvenience. Please feel free to ask if you have any further questions.