[Feature] Refine DETA and H-Deformable-DETR project

rentainhe commented 1 year ago

TODO

[x] refine DETA and H-Deformable-DETR project
[x] refine README
[x] update ModelZoo
[x] align training config with the official repo and reproduce the result for DETA and H-Deform.DETR

Experiments

using hacked train_net.py and aligned hyper-param (batch_size=16, max_iters=90000) got 49.9AP with 12ep training, using a random seed
aligned hyper-param, H-Deformable-DETR-R50 achieves 49.1AP with topk=300 for evaluation, outperforming the 48.9AP baseline in official repo.
get better result with no frozen backbone on DETA: 50.2AP

jozhang97 commented 1 year ago

Thanks for the nice work! I'm looking into the performance discrepancy.

I noticed that I cannot reproduce the improved DeformDETR baseline in detrex (47.7 vs 46.5). I'm looking into this right now, will update if I find out why.

rentainhe commented 1 year ago

Thanks for the nice work! I'm looking into the performance discrepancy.

I noticed that I cannot reproduce the improved DeformDETR baseline in detrex (47.7 vs 46.5). I'm looking into this right now, will update if I find out why.

Hello @jozhang97 ! there are some different hyper-param of Deform.DETR in detrex:

we set lr=1e-4
we use loss_class_weight=1
the lr of sampling_offsets and reference_points is also 1e-4 in detrex, which are both 1e-5 in the official implementation
We did only freeze the stem of the backbone before

We've found this hyper-param is better for the two-stage 50eps baseline (we got 48.2AP), but I think it's not better for the improved baseline, and we're also trying to reproduce the result by fixing the hyper-params. We hacked into the train_net file for this, and we're trying to reproduce the results recently, we will strickly align the training hyper-param to see if the results can be reproduced or not.

BTW, I'm not sure if all the other hyper-param for DETA model is aligned now, that will be great if you would like to help us to double check it~ thanks a lot!

rentainhe commented 1 year ago

Why using a hacked train_net.py

We did aligning the training hyper-param by modified the config before by setting

optimizer.params.lr_factor_func = lambda module_name: 0.1 if "backbone" or "sampling_offsets or "reference_points" in module_name else 1

Which get a worse result, and I setting it's not correct to set learning rate in this way, so we decided to use a hacked train_net to totally align the param_dict in optimizer to see if the result can be reproduced, we will work on this recently

rentainhe commented 1 year ago

Thanks for the nice work! I'm looking into the performance discrepancy.

I noticed that I cannot reproduce the improved DeformDETR baseline in detrex (47.7 vs 46.5). I'm looking into this right now, will update if I find out why.

The improved Deform.DETR baseline can achieve 47.7AP in 12epochs and 46.5 AP in detrex?

rentainhe commented 1 year ago

Thanks for the nice work! I'm looking into the performance discrepancy.

I noticed that I cannot reproduce the improved DeformDETR baseline in detrex (47.7 vs 46.5). I'm looking into this right now, will update if I find out why.

Hello @jozhang97 ! When I'm training DETA with the hacked train_net.py and aligned hyper-param, it got 49.9AP with 12epoch s training, I‘m not sure if this counts as successfully reproducing the algorithm results DETA.

jozhang97 commented 1 year ago

Hello @jozhang97 ! When I'm training DETA with the hacked train_net.py and aligned hyper-param, it got 49.9AP with 12epoch s training, I‘m not sure if this counts as successfully reproducing the algorithm results DETA.

This is pretty good! I also feel that DETA should work without hacked train_net.py. Code changes should be simple: as long as its DeformDETR equivalent is reproduced, it should work better. I'm using your nice impl of DETA to take a stab at this. Thanks!

(fyi: one notable difference between the repos is the feature scales. DeformDETR uses res3 to res6 whereas here is res2 to res5: https://github.com/fundamentalvision/Deformable-DETR/blob/11169a60c33333af00a4849f1808023eba96a931/models/deformable_detr.py#L140 Also with 300 predictions instead of standard 100 predictions, I expect performance to be slightly higher. I'm looking into this right now.)

The improved Deform.DETR baseline can achieve 47.7AP in 12epochs and 46.5 AP in detrex?

Yes, in our modified DeformDETR repo we got 47.7AP. This is likely due to one of our changes to the queries/proposals where we use image features (not only box features) as queries/proposals to feed into the second stage. The discrepancy makes sense. Adding this in DeformDETR in detrex gets 47.3. Impl is found here: https://github.com/IDEA-Research/detrex/blob/69b50b89b72f207cb2c402cc999fc5f0bdc82332/projects/deta/modeling/deformable_transformer.py#L491

rentainhe commented 1 year ago

I think the key is that when modifying our config as:

optimizer.params.lr=2e-4
optimizer.params.lr_factor_func = lambda module_name: 0.1 if "backbone" or "sampling_offsets or "reference_points" in module_name else 1

there're more params' lr to be set to 2e-5 in d2's optimizer function, which is not aligned with the official implement. There're something wrong with this config modification but I'm still not figure it out.

When using the hacked train_net.py, I manually set the param_dict and optimizer as:

    # this is an hack of train_net
    param_dicts = [
        {
            "params": [
                p
                for n, p in model.named_parameters()
                if not match_name_keywords(n, ["backbone"])
                and not match_name_keywords(n, ["reference_points", "sampling_offsets"])
                and p.requires_grad
            ],
            "lr": 2e-4,
        },
        {
            "params": [
                p
                for n, p in model.named_parameters()
                if match_name_keywords(n, ["backbone"]) and p.requires_grad
            ],
            "lr": 2e-5,
        },
        {
            "params": [
                p
                for n, p in model.named_parameters()
                if match_name_keywords(n, ["reference_points", "sampling_offsets"])
                and p.requires_grad
            ],
            "lr": 2e-5,
        },
    ]
    optim = torch.optim.AdamW(param_dicts, 2e-4, weight_decay=1e-4)

which is totally aligned with the Deform.DETR's hyper-params.

rentainhe commented 1 year ago

Thank you very much for your correction @jozhang97 ! I believe that the key issue lies in the misalignment of the hyperparameters. I have observed the same phenomenon in H-Deform-DETR.

I set two kinds of hyper-params in detrex:

First one: aligned with DINO, using lr=1e-4 for encoder/decoder/sampling_offsets/reference_points, loss_class_coef=1, lr=1e-5 for backbone
Second one: aligned with official Deform.DETR, using lr=2e-4 for encoder/decoder, loss_class_coef=2.0, lr=2e-5 for backbone/sampling_offsets/reference_points

The first one works better on:

Deform-DETR-50ep: 48.2AP

but not that good in DETA, like only 49.4AP for 12ep training

The second one works better on:

DETA-12ep: 49.9AP (using random seed, sry to not set the training seed lol)
H-Deform-DETR-12ep: 49.1AP in detrex vs 48.9AP in official repo

rentainhe commented 1 year ago

I'm going to merge this PR, feel free to open a new PR to correct us at any time! @jozhang97 Thanks a lot!

rentainhe commented 1 year ago

I think for improved deformable-detr baseline, you can try to set the loss_class to 2.0

jozhang97 commented 1 year ago

@rentainhe Thank you for the heroic effort in hunting down the discrepancies between the repos! I don't have enough time to figure out the last discrepancies, but hopefully their impact is insignificant. Thanks again!

jozhang97 commented 1 year ago

@rentainhe instead of half the batch size, did you try half or twice the learning rate?

rentainhe commented 1 year ago

@rentainhe instead of half the batch size, did you try half or twice the learning rate?

I did some other experiments on DETA and DINO:

lr=2e-4 with hacked train_net for DINO: 49.0AP -> 49.4AP
only frozen the stem of the backbone for DETA: got 50.2AP

All the pretrained checkpoint is released in DETA project, you can check the config for more details. And I will pay my attention to DETA project and to see if there are some better hyper-params for DETA. It's amazing to achieve 50.2AP with only 12ep training now.

@jozhang97

IDEA-Research / detrex

[Feature] Refine DETA and H-Deformable-DETR project #235

TODO

Experiments