Open AmbiakaTT opened 1 year ago
My DINO config is as follows from detrex.config import get_config from .models.dino_r50 import model
dataloader = get_config("common/data/coco_detr.py").dataloader optimizer = get_config("common/optim.py").AdamW lr_multiplier = get_config("common/coco_schedule.py").lr_multiplier_12ep train = get_config("common/train.py").train
train.init_checkpoint = "detectron2://ImageNetPretrained/torchvision/R-50.pkl" train.output_dir = "./output/dino_r50_4scale_12ep"
train.max_iter = 90000
train.eval_period = 5000
train.log_period = 20
train.checkpointer.period = 5000
train.clip_grad.enabled = True train.clip_grad.params.max_norm = 0.1 train.clip_grad.params.norm_type = 2
train.device = "cuda" model.device = train.device
optimizer.lr = 1e-4 optimizer.betas = (0.9, 0.999) optimizer.weight_decay = 1e-4 optimizer.params.lr_factor_func = lambda module_name: 0.1 if "backbone" in module_name else 1
dataloader.train.num_workers = 16
dataloader.train.total_batch_size = 16
dataloader.evaluator.output_dir = train.output_dir
The training error is very low [04/17 06:35:07] d2.utils.events INFO: eta: 0:00:00 iter: 19999 total_loss: 13.36 loss_class: 0.3083 loss_bbox: 0.09793 loss_giou: 0.5997 loss_class_0: 0.4523 loss_bbox_0: 0.08994 loss_giou_0: 0.5179 loss_class_1: 0.3937 loss_bbox_1: 0.0916 loss_giou_1: 0.5637 loss_class_2: 0.3323 loss_bbox_2: 0.08917 loss_giou_2: 0.6124 loss_class_3: 0.3066 loss_bbox_3: 0.09503 loss_giou_3: 0.583 loss_class_4: 0.3075 loss_bbox_4: 0.1018 loss_giou_4: 0.6114 loss_class_enc: 0.49 loss_bbox_enc: 0.08078 loss_giou_enc: 0.5312
What could be the reason for this?
Hello! how many GPUs you're using for running this experiments
Hi, I used 2 GPUs for training with broadcast_buffers=True.
Hi, I have the same problem. Have you solved it?
My DINO config is as follows from detrex.config import get_config from .models.dino_r50 import model
get default config
dataloader = get_config("common/data/coco_detr.py").dataloader optimizer = get_config("common/optim.py").AdamW lr_multiplier = get_config("common/coco_schedule.py").lr_multiplier_12ep train = get_config("common/train.py").train
modify training config
train.init_checkpoint = "detectron2://ImageNetPretrained/torchvision/R-50.pkl" train.output_dir = "./output/dino_r50_4scale_12ep"
max training iterations
train.max_iter = 90000
run evaluation every 5000 iters
train.eval_period = 5000
log training infomation every 20 iters
train.log_period = 20
save checkpoint every 5000 iters
train.checkpointer.period = 5000
gradient clipping for training
train.clip_grad.enabled = True train.clip_grad.params.max_norm = 0.1 train.clip_grad.params.norm_type = 2
set training devices
train.device = "cuda" model.device = train.device
modify optimizer config
optimizer.lr = 1e-4 optimizer.betas = (0.9, 0.999) optimizer.weight_decay = 1e-4 optimizer.params.lr_factor_func = lambda module_name: 0.1 if "backbone" in module_name else 1
modify dataloader config
dataloader.train.num_workers = 16
please notice that this is total batch size.
surpose you're using 4 gpus for training and the batch size for
each gpu is 16/4 = 4
dataloader.train.total_batch_size = 16
dump the testing results into output_dir for visualization
dataloader.evaluator.output_dir = train.output_dir
The training error is very low [04/17 06:35:07] d2.utils.events INFO: eta: 0:00:00 iter: 19999 total_loss: 13.36 loss_class: 0.3083 loss_bbox: 0.09793 loss_giou: 0.5997 loss_class_0: 0.4523 loss_bbox_0: 0.08994 loss_giou_0: 0.5179 loss_class_1: 0.3937 loss_bbox_1: 0.0916 loss_giou_1: 0.5637 loss_class_2: 0.3323 loss_bbox_2: 0.08917 loss_giou_2: 0.6124 loss_class_3: 0.3066 loss_bbox_3: 0.09503 loss_giou_3: 0.583 loss_class_4: 0.3075 loss_bbox_4: 0.1018 loss_giou_4: 0.6114 loss_class_enc: 0.49 loss_bbox_enc: 0.08078 loss_giou_enc: 0.5312
What could be the reason for this?