Open j-rausch opened 2 years ago
I am facing a very similar issue. Did you find a reason for this behaviour and have any suggestions how to fix it?
I'm still facing the issue. Without having debugged this in more detail and just looking at the losses of the three runs, loss_cls
appears to differ the most at the beginning of the training.
There have been other issues that have been closed in the past (e.g. https://github.com/facebookresearch/detectron2/issues/2480 ), pointing to PyTorch's non-determinism. Perhaps revisiting them with the new deterministic training flags in PyTorch could give new pointers.
Are there any news or advice on possible reasons for this issue?
Hi, I'm working on an experiment where I noticed large differences between models trained with identical configs and random seeds. I'm trying to understand the causes for this.
I've upgraded to a more recent PyTorch version that introduced flags for deterministic training between multiple executions: https://pytorch.org/docs/1.11/notes/randomness.html?highlight=reproducibility
However, despite using these flags and the most recent detectron2 sources, the final trained models and their validation accuracies can differ greatly on a custom dataset set of mine (~2 AP). These differences occur in multiple runs on the same machine (identical device, code, config, random seed).
I've been looking into reproducing this problem and also observe this for the unaltered detectron2 demo training code. I've added a minimal script to reproduce the training and observe rather big differences between the first logged losses of three subsequent runs.
Instructions To Reproduce the Issue:
deterministic_example.py
)from detectron2.config import get_cfg from detectron2.engine import DefaultTrainer, default_argument_parser, default_setup, launch
def setup(args): """ Create configs and perform basic setups. """ cfg = get_cfg() cfg.merge_from_file(args.config_file) cfg.merge_from_list(args.opts) cfg.freeze() default_setup(cfg, args) return cfg
def main(args):
if name == "main": args = default_argument_parser().parse_args() print("Command Line Args:", args) launch( main, args.num_gpus, num_machines=args.num_machines, machine_rank=args.machine_rank, dist_url=args.dist_url, args=(args,), )
git rev-parse HEAD; git diff e091a07ef573915056f8c2191b774aad0e38d09c
CUDA_VISIBLE_DEVICES=0 python deterministic_example.py --num-gpus 1 --config-file ./configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml SOLVER.IMS_PER_BATCH 1 SEED 42 DATALOADER.NUM_WORKERS 1
Command Line Args: Namespace(config_file='./configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml', resume=False, eval_only=False, num_gpus=1, num_machines=1, machine_rank=0, dist_url='tcp://127.0.0.1:53650', opts=['SOLVER.IMS_PER_BATCH', '1', 'SEED', '42', 'DATALOADER.NUM_WORKERS', '1']) [05/23 15:49:06 detectron2]: Rank of current process: 0. World size: 1 [05/23 15:49:08 detectron2]: Environment info:
PyTorch built with:
[05/23 15:49:08 detectron2]: Command line arguments: Namespace(config_file='./configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml', resume=False, eval_only=False, num_gpus=1, num_machines=1, machine_rank=0, dist_url='tcp://127.0.0.1:53650', opts=['SOLVER.IMS_PER_BATCH', '1', 'SEED', '42', 'DATALOADER.NUM_WORKERS', '1']) [05/23 15:49:08 detectron2]: Contents of args.config_file=./configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml: BASE: "../Base-RCNN-FPN.yaml" MODEL: WEIGHTS: "detectron2://ImageNetPretrained/MSRA/R-50.pkl" MASK_ON: True RESNETS: DEPTH: 50
FILTER_EMPTY_ANNOTATIONS: true NUM_WORKERS: 1 REPEAT_THRESHOLD: 0.0 SAMPLER_TRAIN: TrainingSampler DATASETS: PRECOMPUTED_PROPOSAL_TOPK_TEST: 1000 PRECOMPUTED_PROPOSAL_TOPK_TRAIN: 2000 PROPOSAL_FILES_TEST: [] PROPOSAL_FILES_TRAIN: [] TEST:
1.0 PROPOSAL_GENERATOR: MIN_SIZE: 0 NAME: RPN RESNETS: DEFORM_MODULATED: false DEFORM_NUM_GROUPS: 1 DEFORM_ON_PER_STAGE:
[05/23 15:49:08 detectron2]: Full config saved to ./output/config.yaml
proposal_generator.rpn_head.anchor_deltas.{bias, weight} proposal_generator.rpn_head.conv.{bias, weight} proposal_generator.rpn_head.objectness_logits.{bias, weight} roi_heads.box_head.fc1.{bias, weight} roi_heads.box_head.fc2.{bias, weight} roi_heads.box_predictor.bbox_pred.{bias, weight} roi_heads.box_predictor.cls_score.{bias, weight} roi_heads.mask_head.deconv.{bias, weight} roi_heads.mask_head.mask_fcn1.{bias, weight} roi_heads.mask_head.mask_fcn2.{bias, weight} roi_heads.mask_head.mask_fcn3.{bias, weight} roi_heads.mask_head.mask_fcn4.{bias, weight} roi_heads.mask_head.predictor.{bias, weight} WARNING [05/23 15:50:04 fvcore.common.checkpoint]: The checkpoint state_dict contains keys that are not used by the model: fc1000.{bias, weight} stem.conv1.bias [05/23 15:50:04 d2.engine.train_loop]: Starting training from iteration 0 /rootpath/anaconda3/envs/sgg_torch111_detectron06/lib/python3.10/site-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the index backbone.fpn_output2.{bias, weight} backbone.fpn_output3.{bias, weight} backbone.fpn_output4.{bias, weight} backbone.fpn_output5.{bias, weight} proposal_generator.rpn_head.anchor_deltas.{bias, weight} proposal_generator.rpn_head.conv.{bias, weight} proposal_generator.rpn_head.objectness_logits.{bias, weight} roi_heads.box_head.fc1.{bias, weight} roi_heads.box_head.fc2.{bias, weight} roi_heads.box_predictor.bbox_pred.{bias, weight} roi_heads.box_predictor.cls_score.{bias, weight} roi_heads.mask_head.deconv.{bias, weight} roi_heads.mask_head.mask_fcn1.{bias, weight} roi_heads.mask_head.mask_fcn2.{bias, weight} roi_heads.mask_head.mask_fcn3.{bias, weight} roi_heads.mask_head.mask_fcn4.{bias, weight} roi_heads.mask_head.predictor.{bias, weight} WARNING [05/23 15:50:04 fvcore.common.checkpoint]: The checkpoint state_dict contains keys that are not used by the model: fc1000.{bias, weight} stem.conv1.bias [05/23 15:50:04 d2.engine.train_loop]: Starting training from iteration 0 /rootpath/anaconda3/envs/sgg_torch111_detectron06/lib/python3.10/site-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2228.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] [05/23 15:50:12 d2.utils.events]: eta: 7:44:48 iter: 19 total_loss: 2.345 loss_cls: 0.5814 loss_box_reg: 0.01275 loss_mask: 0.6936 loss_rpn_cls: 0.6719 loss_rpn_loc: 0.0908 time: 0.3151 data_time: 0.0139 lr: 0.00039962 max_mem: 1481M [05/23 15:50:19 d2.utils.events]: eta: 8:08:10 iter: 39 total_loss: 1.601 loss_cls: 0.4312 loss_box_reg: 0.04747 loss_mask: 0.6906 loss_rpn_cls: 0.4376 loss_rpn_loc: 0.0764 time: 0.3254 data_time: 0.0026 lr: 0.00079922 max_mem: 1481M [05/23 15:50:26 d2.utils.events]: eta: 8:17:54 iter: 59 total_loss: 1.641 loss_cls: 0.4153 loss_box_reg: 0.09799 loss_mask: 0.691 loss_rpn_cls: 0.3649 loss_rpn_loc: 0.1253 time: 0.3259 data_time: 0.0028 lr: 0.0011988 max_mem: 1481M [05/23 15:50:32 d2.utils.events]: eta: 8:20:12 iter: 79 total_loss: 1.439 loss_cls: 0.3282 loss_box_reg: 0.09175 loss_mask: 0.6924 loss_rpn_cls: 0.2477 loss_rpn_loc: 0.05234 time: 0.3288 data_time: 0.0027 lr: 0.0015984 max_mem: 1481M [05/23 15:50:39 d2.utils.events]: eta: 8:20:06 iter: 99 total_loss: 1.285 loss_cls: 0.2667 loss_box_reg: 0.1191 loss_mask: 0.6891 loss_rpn_cls: 0.154 loss_rpn_loc: 0.05424 time: 0.3274 data_time: 0.0025 lr: 0.001998 max_mem: 1481M [05/23 15:50:45 d2.utils.events]: eta: 8:15:39 iter: 119 total_loss: 1.52 loss_cls: 0.346 loss_box_reg: 0.1504 loss_mask: 0.6818 loss_rpn_cls: 0.2181 loss_rpn_loc: 0.09391 time: 0.3256 data_time: 0.0025 lr: 0.0023976 max_mem: 1481M [05/23 15:50:51 d2.utils.events]: eta: 8:12:57 iter: 139 total_loss: 1.546 loss_cls: 0.2511 loss_box_reg: 0.1242 loss_mask: 0.6869 loss_rpn_cls: 0.2738 loss_rpn_loc: 0.04643 time: 0.3242 data_time: 0.0027 lr: 0.0027972 max_mem: 1481M [05/23 15:50:58 d2.utils.events]: eta: 8:12:51 iter: 159 total_loss: 1.687 loss_cls: 0.3452 loss_box_reg: 0.09927 loss_mask: 0.6778 loss_rpn_cls: 0.2546 loss_rpn_loc: 0.1271 time: 0.3253 data_time: 0.0028 lr: 0.0031968 max_mem: 1481M [05/23 15:51:05 d2.utils.events]: eta: 8:15:19 iter: 179 total_loss: 1.557 loss_cls: 0.4099 loss_box_reg: 0.1837 loss_mask: 0.6872 loss_rpn_cls: 0.1388 loss_rpn_loc: 0.06568 time: 0.3271 data_time: 0.0027 lr: 0.0035964 max_mem: 1481M [05/23 15:51:12 d2.utils.events]: eta: 8:16:06 iter: 199 total_loss: 1.931 loss_cls: 0.5021 loss_box_reg: 0.2378 loss_mask: 0.6843 loss_rpn_cls: 0.2495 loss_rpn_loc: 0.1568 time: 0.3284 data_time: 0.0035 lr: 0.003996 max_mem: 1481M
[05/23 15:52:57 d2.utils.events]: eta: 7:49:54 iter: 19 total_loss: 2.349 loss_cls: 0.5801 loss_box_reg: 0.01275 loss_mask: 0.6936 loss_rpn_cls: 0.6719 loss_rpn_loc: 0.09081 time: 0.3190 data_time: 0.0176 lr: 0.00039962 max_mem: 1481M [05/23 15:53:04 d2.utils.events]: eta: 8:10:18 iter: 39 total_loss: 1.603 loss_cls: 0.4004 loss_box_reg: 0.04758 loss_mask: 0.6906 loss_rpn_cls: 0.4404 loss_rpn_loc: 0.07629 time: 0.3276 data_time: 0.0025 lr: 0.00079922 max_mem: 1481M [05/23 15:53:10 d2.utils.events]: eta: 8:19:58 iter: 59 total_loss: 1.646 loss_cls: 0.4176 loss_box_reg: 0.1167 loss_mask: 0.6912 loss_rpn_cls: 0.3633 loss_rpn_loc: 0.1252 time: 0.3274 data_time: 0.0026 lr: 0.0011988 max_mem: 1481M [05/23 15:53:17 d2.utils.events]: eta: 8:21:51 iter: 79 total_loss: 1.428 loss_cls: 0.299 loss_box_reg: 0.0902 loss_mask: 0.6921 loss_rpn_cls: 0.2449 loss_rpn_loc: 0.05256 time: 0.3296 data_time: 0.0026 lr: 0.0015984 max_mem: 1481M [05/23 15:53:23 d2.utils.events]: eta: 8:21:44 iter: 99 total_loss: 1.319 loss_cls: 0.2876 loss_box_reg: 0.1062 loss_mask: 0.6898 loss_rpn_cls: 0.1512 loss_rpn_loc: 0.05531 time: 0.3289 data_time: 0.0027 lr: 0.001998 max_mem: 1481M [05/23 15:53:30 d2.utils.events]: eta: 8:17:13 iter: 119 total_loss: 1.441 loss_cls: 0.28 loss_box_reg: 0.1317 loss_mask: 0.6835 loss_rpn_cls: 0.2149 loss_rpn_loc: 0.09209 time: 0.3274 data_time: 0.0025 lr: 0.0023976 max_mem: 1481M [05/23 15:53:36 d2.utils.events]: eta: 8:15:03 iter: 139 total_loss: 1.496 loss_cls: 0.272 loss_box_reg: 0.1103 loss_mask: 0.6876 loss_rpn_cls: 0.2564 loss_rpn_loc: 0.04832 time: 0.3262 data_time: 0.0025 lr: 0.0027972 max_mem: 1481M [05/23 15:53:43 d2.utils.events]: eta: 8:14:56 iter: 159 total_loss: 1.737 loss_cls: 0.3486 loss_box_reg: 0.06897 loss_mask: 0.678 loss_rpn_cls: 0.2603 loss_rpn_loc: 0.1359 time: 0.3266 data_time: 0.0025 lr: 0.0031968 max_mem: 1481M [05/23 15:53:49 d2.utils.events]: eta: 8:16:21 iter: 179 total_loss: 1.525 loss_cls: 0.3834 loss_box_reg: 0.1672 loss_mask: 0.6877 loss_rpn_cls: 0.1623 loss_rpn_loc: 0.08118 time: 0.3272 data_time: 0.0026 lr: 0.0035964 max_mem: 1481M [05/23 15:53:56 d2.utils.events]: eta: 8:16:14 iter: 199 total_loss: 1.598 loss_cls: 0.3331 loss_box_reg: 0.1141 loss_mask: 0.6792 loss_rpn_cls: 0.2563 loss_rpn_loc: 0.1831 time: 0.3270 data_time: 0.0026 lr: 0.003996 max_mem: 1481M
[05/23 15:56:10 d2.utils.events]: eta: 7:45:39 iter: 19 total_loss: 2.348 loss_cls: 0.5763 loss_box_reg: 0.01275 loss_mask: 0.6936 loss_rpn_cls: 0.6719 loss_rpn_loc: 0.0908 time: 0.3167 data_time: 0.0122 lr: 0.00039962 max_mem: 1481M [05/23 15:56:16 d2.utils.events]: eta: 8:10:26 iter: 39 total_loss: 1.605 loss_cls: 0.3891 loss_box_reg: 0.04755 loss_mask: 0.6906 loss_rpn_cls: 0.4403 loss_rpn_loc: 0.07635 time: 0.3277 data_time: 0.0027 lr: 0.00079922 max_mem: 1481M [05/23 15:56:23 d2.utils.events]: eta: 8:23:04 iter: 59 total_loss: 1.679 loss_cls: 0.4163 loss_box_reg: 0.1102 loss_mask: 0.6912 loss_rpn_cls: 0.3563 loss_rpn_loc: 0.1251 time: 0.3293 data_time: 0.0031 lr: 0.0011988 max_mem: 1481M [05/23 15:56:30 d2.utils.events]: eta: 8:21:28 iter: 79 total_loss: 1.433 loss_cls: 0.3133 loss_box_reg: 0.07978 loss_mask: 0.6921 loss_rpn_cls: 0.2468 loss_rpn_loc: 0.05257 time: 0.3303 data_time: 0.0028 lr: 0.0015984 max_mem: 1481M [05/23 15:56:36 d2.utils.events]: eta: 8:22:50 iter: 99 total_loss: 1.317 loss_cls: 0.2764 loss_box_reg: 0.1469 loss_mask: 0.6895 loss_rpn_cls: 0.1487 loss_rpn_loc: 0.05474 time: 0.3291 data_time: 0.0027 lr: 0.001998 max_mem: 1481M [05/23 15:56:43 d2.utils.events]: eta: 8:20:03 iter: 119 total_loss: 1.455 loss_cls: 0.3264 loss_box_reg: 0.1456 loss_mask: 0.6827 loss_rpn_cls: 0.209 loss_rpn_loc: 0.09486 time: 0.3281 data_time: 0.0030 lr: 0.0023976 max_mem: 1481M [05/23 15:56:49 d2.utils.events]: eta: 8:16:57 iter: 139 total_loss: 1.475 loss_cls: 0.2835 loss_box_reg: 0.09706 loss_mask: 0.6861 loss_rpn_cls: 0.2541 loss_rpn_loc: 0.04725 time: 0.3260 data_time: 0.0027 lr: 0.0027972 max_mem: 1481M [05/23 15:56:56 d2.utils.events]: eta: 8:18:19 iter: 159 total_loss: 1.675 loss_cls: 0.3287 loss_box_reg: 0.1219 loss_mask: 0.6776 loss_rpn_cls: 0.2344 loss_rpn_loc: 0.1299 time: 0.3269 data_time: 0.0028 lr: 0.0031968 max_mem: 1481M [05/23 15:57:02 d2.utils.events]: eta: 8:19:43 iter: 179 total_loss: 1.568 loss_cls: 0.4459 loss_box_reg: 0.1866 loss_mask: 0.6875 loss_rpn_cls: 0.124 loss_rpn_loc: 0.06825 time: 0.3279 data_time: 0.0027 lr: 0.0035964 max_mem: 1481M [05/23 15:57:09 d2.utils.events]: eta: 8:19:37 iter: 199 total_loss: 1.803 loss_cls: 0.4938 loss_box_reg: 0.1835 loss_mask: 0.6884 loss_rpn_cls: 0.2585 loss_rpn_loc: 0.1701 time: 0.3281 data_time: 0.0029 lr: 0.003996 max_mem: 1481M