dynamic reweighting causes performance degradation in reproducing

Charles-Xie commented 2 years ago

Hi, thanks for sharing the code! Great work!

I have a small question in reproducing your result. I run the CDN-S model (res50, 3+3). It gave a result of about 31.5 or 31.2 (I run 2 times) after the first training stage (train the whole model with re gular loss). But after the second training stage (decoupled training) is finished, the performance downgrades to 31.0 and 30.4 for these 2 runs separately. For full mAP, rare mAP and non-rare mAP, this trick seems to be not helpful.

So I wonder what could goes wrong during my reproduction or what can be the reason. I will paste the commands and log below. Thanks. Nice day :3

Charles-Xie commented 2 years ago

command (exactly the same as the one provided in readme, except the output_dir):


python3 -m torch.distributed.launch \
        --nproc_per_node=8 \
        --use_env \
        main.py \
        --pretrained params/detr-r50-pre-2stage-q64.pth \
        --output_dir logs/base \
        --dataset_file hico \
        --hoi_path data/hico_20160224_det \
        --num_obj_classes 80 \
        --num_verb_classes 117 \
        --backbone resnet50 \
        --num_queries 64 \
        --dec_layers_hopd 3 \
        --dec_layers_interaction 3 \
        --epochs 90 \
        --lr_drop 60 \
        --use_nms_filter

python3 -m torch.distributed.launch \
        --nproc_per_node=8 \
        --use_env \
        main.py \
        --pretrained logs/base/checkpoint_last.pth \
        --output_dir logs/base \
        --dataset_file hico \
        --hoi_path data/hico_20160224_det \
        --num_obj_classes 80 \
        --num_verb_classes 117 \
        --backbone resnet50 \
        --num_queries 64 \
        --dec_layers_hopd 3 \
        --dec_layers_interaction 3 \
        --epochs 10 \
        --freeze_mode 1 \
        --obj_reweight \
        --verb_reweight \
        --use_nms_filter

echo "base"

corresponding result (log): 31.5 after 1st training stage and 31.0 after 2nd training stage: log.txt

Charles-Xie commented 2 years ago

for the 2nd run: command (exactly the same as the one provided in readme, except the output_dir):

python3 -m torch.distributed.launch \
        --nproc_per_node=8 \
        --use_env \
        main.py \
        --pretrained params/detr-r50-pre-2stage-q64.pth \
        --output_dir logs/base_4worker \
        --dataset_file hico \
        --hoi_path data/hico_20160224_det \
        --num_obj_classes 80 \
        --num_verb_classes 117 \
        --backbone resnet50 \
        --num_queries 64 \
        --dec_layers_hopd 3 \
        --dec_layers_interaction 3 \
        --epochs 90 \
        --lr_drop 60 \
        --use_nms_filter \
        --num_workers 4

python3 -m torch.distributed.launch \
        --nproc_per_node=8 \
        --use_env \
        main.py \
        --pretrained logs/base_4worker/checkpoint_last.pth \
        --output_dir logs/base_4worker \
        --dataset_file hico \
        --hoi_path data/hico_20160224_det \
        --num_obj_classes 80 \
        --num_verb_classes 117 \
        --backbone resnet50 \
        --num_queries 64 \
        --dec_layers_hopd 3 \
        --dec_layers_interaction 3 \
        --epochs 10 \
        --freeze_mode 1 \
        --obj_reweight \
        --verb_reweight \
        --use_nms_filter \
        --num_workers 4

echo "base_4worker"

corresponding result (log): 31.2 after 1st training stage and 30.4 after 2nd training stage: log.txt

YueLiao commented 2 years ago

This module is implemented by @zhangaixi2008, and he will reply you later.

YueLiao commented 2 years ago

command (exactly the same as the one provided in readme, except the output_dir):

python3 -m torch.distributed.launch \
        --nproc_per_node=8 \
        --use_env \
        main.py \
        --pretrained params/detr-r50-pre-2stage-q64.pth \
        --output_dir logs/base \
        --dataset_file hico \
        --hoi_path data/hico_20160224_det \
        --num_obj_classes 80 \
        --num_verb_classes 117 \
        --backbone resnet50 \
        --num_queries 64 \
        --dec_layers_hopd 3 \
        --dec_layers_interaction 3 \
        --epochs 90 \
        --lr_drop 60 \
        --use_nms_filter

python3 -m torch.distributed.launch \
        --nproc_per_node=8 \
        --use_env \
        main.py \
        --pretrained logs/base/checkpoint_last.pth \
        --output_dir logs/base \
        --dataset_file hico \
        --hoi_path data/hico_20160224_det \
        --num_obj_classes 80 \
        --num_verb_classes 117 \
        --backbone resnet50 \
        --num_queries 64 \
        --dec_layers_hopd 3 \
        --dec_layers_interaction 3 \
        --epochs 10 \
        --freeze_mode 1 \
        --obj_reweight \
        --verb_reweight \
        --use_nms_filter

echo "base"

corresponding result (log): 31.5 after 1st training stage and 31.0 after 2nd training stage: log.txt

aha 31.5%, a new SOTA with CDN-S.

zhangaixi2008 commented 2 years ago

You can have a try with the following command. python -m torch.distributed.launch --master_port 10026 --nproc_per_node=4 --use_env main.py --pretrained logs/base/checkpoint_last.pth --output_dir logs/base --hoi --dataset_file hico --hoi_path data/hico_20160224_det --num_obj_classes 80 --num_verb_classes 117 --backbone resnet50 --set_cost_bbox 2.5 --set_cost_giou 1 --bbox_loss_coef 2.5 --giou_loss_coef 1 --num_queries 64 --dec_layers_stage1 3 --dec_layers_stage2 3 --epochs 10 --freeze_mode 1 --obj_reweight --verb_reweight --queue_size 9408 --p_obj 0.7 --p_verb 0.7 --lr 5e-6 --lr_backbone 5e-7 --use_nms_filter

boringwar commented 2 years ago

@zhangaixi2008 It does not work for me either. The reweighting retraining leads to performance drop. Details are as follows:

Using given script training on HICO-DET:

CDN S:

best in first 90 epoch: 31.71
fine-tune degrades to 30.96

CDN B:

best in first 90 epochs: 31.6
fine-tune degrades to 30.6

I'm using the above script to re-run cdn-s finetuning.

zhangaixi2008 commented 2 years ago

@haak0 please upload your model here, let me have a look.

boringwar commented 2 years ago

@zhangaixi2008 Hi, some of my checkpoints are overwritten. I am re-running the experiments.

boringwar commented 2 years ago

Hi, I re-do the experiments, and here's the log. CDN small:

best in first 90 epoch: 30.99
fine-tune degrades to 30.3 Here's the script and log small.txt

CDN base:

best in first 90 epochs: 31.98
fine-tune degrades to 30.6 Here's the script and log base.txt

zhangaixi2008 commented 2 years ago

Hi, I made a mistake in the previous readme for running the fine-tune process. Please use the script I provide above under this issue. As we claimed in the paper, we use a small learning rate to fine-tune the first model. Thus, we set lr as 5e-6 and lr_backbone as 5e-7 for bs=8, or lr as 1e-5 and lr_backbone as 1e-6 for bs=16. Please try again and let us see the results. Sorry for our carelessness, we have already modified the readme.

boringwar commented 2 years ago

@zhangaixi2008 hi, I reproduce the finetune result following your script, and the result is reasonable. CDN-base:

first 90 epoch: 32.05
fine-tune: 32.12 All the result are evaluated using the python script.

BTW, what's the meaning of the "vis_tag" in hico_eval.py?

zhangaixi2008 commented 2 years ago

For CDN-base, you have already surpassed our reported (official matlab 31.78, python 31.86) in our paper. Good job^^ For 'vis_tag', you can see the evaluation script. In short, we filtered the already matched ground-truth hoi to calculate fp and tp during evaluation.

YueLiao commented 2 years ago

The issue about the "Re-weight module" seems to be solved. If any other issues, feel free to open a new issue.

YueLiao / CDN

dynamic reweighting causes performance degradation in reproducing #4