IDEA-Research / detrex

detrex is a research platform for DETR-based object detection, segmentation, pose estimation and other visual recognition tasks.
https://detrex.readthedocs.io/en/latest/
Apache License 2.0
1.95k stars 204 forks source link

Convergence problem of DAB-DETR on COCO #222

Closed ustcwhy closed 1 year ago

ustcwhy commented 1 year ago

Thanks for your wonderful work. I try to reproduce the results of DAB-DETR on COCO 2017 using detrex. I train dab_detr_r50_50ep with default settings on 2 RTX 3090. The batch size is set as 8 per GPU. However, the model is very easy to diverge. I tried the learning rate ranging 1e-4 (default setting), 5e-5, 3e-5, 1e-5. Only the model with 1e-5 converge. Could you provide me some suggestions ?

My setting is as follows:

from detrex.config import get_config from .models.dab_detr_r50 import model from fvcore.common.param_scheduler import MultiStepParamScheduler, LinearParamScheduler

from detectron2.config import LazyCall as L from detectron2.solver import WarmupParamScheduler

optimizer = get_config("common/optim.py").AdamW optimizer.lr = 3e-5 optimizer.betas = (0.9, 0.999) optimizer.weight_decay = 1e-4 optimizer.params.lr_factor_func = lambda module_name: 0.1 if "backbone" in module_name else 1

acc_grad_iter = 1 total_steps = 375000 acc_grad_iter decay_steps = 300000 acc_grad_iter warmup_steps = 0 * acc_grad_iter warmup_factor = 0.001

lr_multiplier = L(WarmupParamScheduler)( scheduler=L(MultiStepParamScheduler)( values=[1.0, 0.1], milestones=[decay_steps, total_steps], ), warmup_length=warmup_steps / total_steps, warmup_method="linear", warmup_factor=warmup_factor, )

dataloader = get_config("common/data/coco_detr.py").dataloader dataloader.train.num_workers = 4 dataloader.train.total_batch_size = 8

train.init_checkpoint = "detectron2://ImageNetPretrained/torchvision/R-50.pkl" train = get_config("common/train.py").train train.tensorboard_dir = f"{train.output_dir}/tb-logs" train.acc_grad_iter = acc_grad_iter train.ddp.fp16_compression = True train.amp.enabled = True train.log_period = 20 train.acc_grad_iter train.max_iter = total_steps train.eval_period = 5000 train.acc_grad_iter train.checkpointer.period = 5000 * train.acc_grad_iter train.seed = 42

train.clip_grad.enabled = True train.clip_grad.params.max_norm = 0.1 train.clip_grad.params.norm_type = 2

train.device = "cuda" model.device = train.device

dataloader.evaluator.output_dir = train.output_dir

rentainhe commented 1 year ago

Thanks for your wonderful work. I try to reproduce the results of DAB-DETR on COCO 2017 using detrex. I train dab_detr_r50_50ep with default settings on 2 RTX 3090. The batch size is set as 8 per GPU. However, the model is very easy to diverge. I tried the learning rate ranging 1e-4 (default setting), 5e-5, 3e-5, 1e-5. Only the model with 1e-5 converge. Could you provide me some suggestions ?

My setting is as follows:

from detrex.config import get_config from .models.dab_detr_r50 import model from fvcore.common.param_scheduler import MultiStepParamScheduler, LinearParamScheduler

from detectron2.config import LazyCall as L from detectron2.solver import WarmupParamScheduler

optimizer = get_config("common/optim.py").AdamW optimizer.lr = 3e-5 optimizer.betas = (0.9, 0.999) optimizer.weight_decay = 1e-4 optimizer.params.lr_factor_func = lambda module_name: 0.1 if "backbone" in module_name else 1

acc_grad_iter = 1 total_steps = 375000 acc_grad_iter decay_steps = 300000 acc_grad_iter warmup_steps = 0 * acc_grad_iter warmup_factor = 0.001

lr_multiplier = L(WarmupParamScheduler)( scheduler=L(MultiStepParamScheduler)( values=[1.0, 0.1], milestones=[decay_steps, total_steps], ), warmup_length=warmup_steps / total_steps, warmup_method="linear", warmup_factor=warmup_factor, )

dataloader = get_config("common/data/coco_detr.py").dataloader dataloader.train.num_workers = 4 dataloader.train.total_batch_size = 8

train.init_checkpoint = "detectron2://ImageNetPretrained/torchvision/R-50.pkl" train = get_config("common/train.py").train train.tensorboard_dir = f"{train.output_dir}/tb-logs" train.acc_grad_iter = acc_grad_iter train.ddp.fp16_compression = True train.amp.enabled = True train.log_period = 20 train.acc_grad_iter train.max_iter = total_steps train.eval_period = 5000 train.acc_grad_iter train.checkpointer.period = 5000 * train.acc_grad_iter train.seed = 42

train.clip_grad.enabled = True train.clip_grad.params.max_norm = 0.1 train.clip_grad.params.norm_type = 2

train.device = "cuda" model.device = train.device

dataloader.evaluator.output_dir = train.output_dir

maybe you can try to set the total-batch-size to 4, which means 2 for each gpus, and set a longer 4x training iters

rentainhe commented 1 year ago

To the best of my knowledge, we have not previously tested this situation. In the original detr repository, the batch size is set for each GPU rather than for the total batch size. Therefore, when using fewer GPUs, the training process requires more iterations.

We will test the situation you described later and summarize our engineering experience.

ustcwhy commented 1 year ago

Thanks for your advice. I will try later ~

ustcwhy commented 1 year ago

Thanks for your wonderful work. I try to reproduce the results of DAB-DETR on COCO 2017 using detrex. I train dab_detr_r50_50ep with default settings on 2 RTX 3090. The batch size is set as 8 per GPU. However, the model is very easy to diverge. I tried the learning rate ranging 1e-4 (default setting), 5e-5, 3e-5, 1e-5. Only the model with 1e-5 converge. Could you provide me some suggestions ? My setting is as follows: from detrex.config import get_config from .models.dab_detr_r50 import model from fvcore.common.param_scheduler import MultiStepParamScheduler, LinearParamScheduler from detectron2.config import LazyCall as L from detectron2.solver import WarmupParamScheduler optimizer = get_config("common/optim.py").AdamW optimizer.lr = 3e-5 optimizer.betas = (0.9, 0.999) optimizer.weight_decay = 1e-4 optimizer.params.lr_factor_func = lambda module_name: 0.1 if "backbone" in module_name else 1 acc_grad_iter = 1 total_steps = 375000 acc_grad_iter decay_steps = 300000 acc_grad_iter warmup_steps = 0 acc_grad_iter warmup_factor = 0.001 lr_multiplier = L(WarmupParamScheduler)( scheduler=L(MultiStepParamScheduler)( values=[1.0, 0.1], milestones=[decay_steps, total_steps], ), warmup_length=warmup_steps / total_steps, warmup_method="linear", warmup_factor=warmup_factor, ) dataloader = get_config("common/data/coco_detr.py").dataloader dataloader.train.num_workers = 4 dataloader.train.total_batch_size = 8 train.init_checkpoint = "detectron2://ImageNetPretrained/torchvision/R-50.pkl" train = get_config("common/train.py").train train.tensorboard_dir = f"{train.output_dir}/tb-logs" train.acc_grad_iter = acc_grad_iter train.ddp.fp16_compression = True train.amp.enabled = True train.log_period = 20 train.acc_grad_iter train.max_iter = total_steps train.eval_period = 5000 train.acc_grad_iter train.checkpointer.period = 5000 train.acc_grad_iter train.seed = 42 train.clip_grad.enabled = True train.clip_grad.params.max_norm = 0.1 train.clip_grad.params.norm_type = 2 train.device = "cuda" model.device = train.device dataloader.evaluator.output_dir = train.output_dir

maybe you can try to set the total-batch-size to 4, which means 2 for each gpus, and set a longer 4x training iters

I tried the settings with total bacth size as 4 on 2 GPUs, 8 on 4 GPUs, 16 on 4 GPUs. All of them diverge on the early stage of the training (before 15k steps) ...

ustcwhy commented 1 year ago

I resolve the convergence problem by increasing the total batch size from 16 to 32.