facebookresearch / detectron2

Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.
https://detectron2.readthedocs.io/en/latest/
Apache License 2.0
30.67k stars 7.51k forks source link

Repeated training not deterministic despite identical setup and reproducibility flags #4260

Open j-rausch opened 2 years ago

j-rausch commented 2 years ago

Hi, I'm working on an experiment where I noticed large differences between models trained with identical configs and random seeds. I'm trying to understand the causes for this.

I've upgraded to a more recent PyTorch version that introduced flags for deterministic training between multiple executions: https://pytorch.org/docs/1.11/notes/randomness.html?highlight=reproducibility

However, despite using these flags and the most recent detectron2 sources, the final trained models and their validation accuracies can differ greatly on a custom dataset set of mine (~2 AP). These differences occur in multiple runs on the same machine (identical device, code, config, random seed).

I've been looking into reproducing this problem and also observe this for the unaltered detectron2 demo training code. I've added a minimal script to reproduce the training and observe rather big differences between the first logged losses of three subsequent runs.

Instructions To Reproduce the Issue:

  1. Full runnable code or full changes you made: script to reproduce the experiment (deterministic_example.py)
    
    import os
    os.environ["CUBLAS_WORKSPACE_CONFIG"]=":4096:8"
    import torch
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    torch.use_deterministic_algorithms(True)

from detectron2.config import get_cfg from detectron2.engine import DefaultTrainer, default_argument_parser, default_setup, launch

def setup(args): """ Create configs and perform basic setups. """ cfg = get_cfg() cfg.merge_from_file(args.config_file) cfg.merge_from_list(args.opts) cfg.freeze() default_setup(cfg, args) return cfg

def main(args):

cfg = setup(args)

trainer = DefaultTrainer(cfg)
trainer.resume_or_load(resume=False)
return trainer.train()

if name == "main": args = default_argument_parser().parse_args() print("Command Line Args:", args) launch( main, args.num_gpus, num_machines=args.num_machines, machine_rank=args.machine_rank, dist_url=args.dist_url, args=(args,), )

git rev-parse HEAD; git diff e091a07ef573915056f8c2191b774aad0e38d09c

2. What exact command you run:

CUDA_VISIBLE_DEVICES=0 python deterministic_example.py --num-gpus 1 --config-file ./configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml SOLVER.IMS_PER_BATCH 1 SEED 42 DATALOADER.NUM_WORKERS 1


3. __Full logs__ or other relevant observations:

Command Line Args: Namespace(config_file='./configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml', resume=False, eval_only=False, num_gpus=1, num_machines=1, machine_rank=0, dist_url='tcp://127.0.0.1:53650', opts=['SOLVER.IMS_PER_BATCH', '1', 'SEED', '42', 'DATALOADER.NUM_WORKERS', '1']) [05/23 15:49:06 detectron2]: Rank of current process: 0. World size: 1 [05/23 15:49:08 detectron2]: Environment info:


sys.platform linux Python 3.10.4 packaged by conda-forge (main, Mar 24 2022, 17:39:04) [GCC 10.3.0] numpy 1.22.3 detectron2 0.6 @/rootpath/git/detectron2/detectron2 Compiler GCC 9.3 CUDA compiler CUDA 11.5 detectron2 arch flags 6.1 DETECTRON2_ENV_MODULE PyTorch 1.11.0+cu115 @/rootpath/anaconda3/envs/sgg_torch111_detectron06/lib/python3.10/site-packages/torch PyTorch debug build False GPU available Yes GPU 0 NVIDIA TITAN Xp (arch=6.1) Driver version 510.47.03 CUDA_HOME /usr/local/cuda-11.5 Pillow 9.1.0 torchvision 0.12.0+cu115 @/rootpath/anaconda3/envs/sgg_torch111_detectron06/lib/python3.10/site-packages/torchvision torchvision arch flags 3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6 fvcore 0.1.5.post20220504 iopath 0.1.9 cv2 4.5.5

PyTorch built with:

[05/23 15:49:08 detectron2]: Command line arguments: Namespace(config_file='./configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml', resume=False, eval_only=False, num_gpus=1, num_machines=1, machine_rank=0, dist_url='tcp://127.0.0.1:53650', opts=['SOLVER.IMS_PER_BATCH', '1', 'SEED', '42', 'DATALOADER.NUM_WORKERS', '1']) [05/23 15:49:08 detectron2]: Contents of args.config_file=./configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml: BASE: "../Base-RCNN-FPN.yaml" MODEL: WEIGHTS: "detectron2://ImageNetPretrained/MSRA/R-50.pkl" MASK_ON: True RESNETS: DEPTH: 50

FILTER_EMPTY_ANNOTATIONS: true NUM_WORKERS: 1 REPEAT_THRESHOLD: 0.0 SAMPLER_TRAIN: TrainingSampler DATASETS: PRECOMPUTED_PROPOSAL_TOPK_TEST: 1000 PRECOMPUTED_PROPOSAL_TOPK_TRAIN: 2000 PROPOSAL_FILES_TEST: [] PROPOSAL_FILES_TRAIN: [] TEST:

[05/23 15:49:08 detectron2]: Full config saved to ./output/config.yaml

      )
      (conv3): Conv2d(
        64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
      )
    )
    (2): BottleneckBlock(
      (conv1): Conv2d(
        256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
      )
      (conv2): Conv2d(
        64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
      )
      (conv3): Conv2d(
        64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
      )
    )
  )
  (res3): Sequential(
    (0): BottleneckBlock(
      (shortcut): Conv2d(
        256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False
        (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
      )
      (conv1): Conv2d(
        256, 128, kernel_size=(1, 1), stride=(2, 2), bias=False
        (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
      )
      (conv2): Conv2d(
        128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
      )
      (conv3): Conv2d(
        128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
      )
    )
    (1): BottleneckBlock(
      (conv1): Conv2d(
        512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
      )
      (conv2): Conv2d(
        128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1
        256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
      )
    )
    (1): BottleneckBlock(
      (conv1): Conv2d(
        1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
      )
      (conv2): Conv2d(
        256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
      )
      (conv3): Conv2d(
        256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
      )
    )
    (2): BottleneckBlock(
      (conv1): Conv2d(
        1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
      )
      (conv2): Conv2d(
        256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
      )
      (conv3): Conv2d(
        256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
      )
    )
    (3): BottleneckBlock(
      (conv1): Conv2d(
        1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
      )
      (conv2): Conv2d(
        256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
      )
      (conv3): Conv2d(
        256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
      )
    )
    (4): BottleneckBl
      )
      (conv2): Conv2d(
        512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
      )
      (conv3): Conv2d(
        512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05)
      )
    )
    (2): BottleneckBlock(
      (conv1): Conv2d(
        2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
      )
      (conv2): Conv2d(
        512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
      )
      (conv3): Conv2d(
        512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05)
      )
    )
  )
)
) (proposal_generator): RPN( (rpn_head): StandardRPNHead( (conv): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1) (activation): ReLU() ) (objectness_logits): Conv2d(256, 3, kernel_size=(1, 1), stride=(1, 1)) (anchor_deltas): Conv2d(256, 12, kernel_size=(1, 1), stride=(1, 1)) ) (anchor_generator): DefaultAnchorGenerator( (cell_anchors): BufferList() ) ) (roi_heads): StandardROIHeads( (box_pooler): ROIPooler( (level_poolers): ModuleList( (0): ROIAlign(output_size=(7, 7), spatial_scale=0.25, sampling_ratio=0, aligned=True) (1): ROIAlign(output_size=(7, 7), spatial_scale=0.125, sampling_ratio=0, aligned=True) (2): ROIAlign(output_size=(7, 7), spatial_scale=0.0625, sampling_ratio=0, aligned=True) (3): ROIAlign(output_size=(7, 7), spatial_scale=0.03125, sampling_ratio=0, aligned=True) ) ) (box_head): FastRCNNConvFCHead( (flatten): Flatten(start_dim=1, end_dim=-1) (fc1): Linear(in_features=12544, out_features=1024, bias=True) (fc_relu1): ReLU() (fc2): Linear(in_features=1024, out_features=1024, bias=True) (fc_relu2): ReLU() ) (box_predictor): FastRCNNOutputLayers( (cls_score): Linear(in_features=1024, out_features=81, bias=True) (bbox_pred): Linear(in_features=1024, out_features=320, bias=True) ) (mask_pooler): ROIPooler( (level_poolers): ModuleList( (0): ROIAlign(output_size=(14, 14), spatial_scale=0.25, sampling_ratio=0, aligned=True) (1): ROIAlign(output_size=(14, 14), spatial_scale=0.125, sampling_ratio=0, aligned=True) (2): ROIAlign(output_size=(14, 14), spatial_scale=0.0625, sampling_ratio=0, aligned=True) (3): ROIAlign(output_size=(14, 14), spatial_scale=0.03125, sampling_ratio=0, aligned=True) ) ) (mask_head): MaskRCNNConvUpsampleHead( (mask_fcn1): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1) (activation): ReLU() ) (mask_fcn2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1) (activation): ReLU() ) (mask_fcn3): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1) (activation): ReLU() ) (mask_fcn4): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1) (activation): ReLU() ) (deconv): ConvTranspose2d(256, 256, kernel_size=(2, 2), stride=(2, 2)) (deconv_relu): ReLU() (predictor): Conv2d(256, 80, kernel_size=(1, 1), stride=(1, 1)) ) ) ) [05/23 15:49:30 d2.data.datasets.coco]: Loading datasets/coco/annotations/instances_train2017.json takes 18.03 seconds. [05/23 15:49:31 d2.data.datasets.coco]: Loaded 118287 images in COCO format from datasets/coco/annotations/instances_train2017.json [05/23 15:49:37 d2.data.build]: Removed 1021 images with no usable annotations. 117266 images left. [05/23 15:49:43 d2.data.build]: Distribution of instances among all 80 categories: category #instances category #instances category #instances
person 257253 bicycle 7056 car 43533
motorcycle 8654 airplane 5129 bus 6061
train 4570 truck 9970 boat 10576
traffic light 12842 fire hydrant 1865 stop sign 1983
parking meter 1283 bench 9820 bird 10542
cat 4766 dog 5500 horse 6567
sheep 9223 cow 8014 elephant 5484
bear 1294 zebra 5269 giraffe 5128
backpack 8714 umbrella 11265 handbag 12342
tie 6448 suitcase 6112 frisbee 2681
skis 6623 snowboard 2681 sports ball 6299
kite 8802 baseball bat 3273 baseball gl.. 3747
skateboard 5536 surfboard 6095 tennis racket 4807
bottle 24070 wine glass 7839 cup 20574
fork 5474 knife 7760 spoon 6159
bowl 14323 banana 9195 apple 5776
sandwich 4356 orange 6302 broccoli 7261
carrot 7758 hot dog 2884 pizza 5807
donut 7005 cake 6296 chair 38073
couch 5779 potted plant 8631 bed 4192
dining table 15695 toilet 4149 tv 5803
laptop 4960 mouse 2261 remote 5700
keyboard 2854 cell phone 6422 microwave 1672
oven 3334 toaster 225 sink 5609
refrigerator 2634 book 24077 clock 6320
vase 6577 scissors 1464 teddy bear 4729
hair drier 198 toothbrush 1945
total 849949
[05/23 15:49:43 d2.data.dataset_mapper]: [DatasetMapper] Augmentations used in training: [ResizeShortestEdge(short_edge_length=(640, 672, 704, 736, 768, 800), max_size=1333, sample_style='choice'), RandomFlip()] [05/23 15:49:43 d2.data.build]: Using training sampler TrainingSampler [05/23 15:49:43 d2.data.common]: Serializing 117266 elements to byte tensors and concatenating them all ... [05/23 15:49:47 d2.data.common]: Serialized dataset takes 451.21 MiB [05/23 15:50:04 fvcore.common.checkpoint]: [Checkpointer] Loading from detectron2://ImageNetPretrained/MSRA/R-50.pkl ... [05/23 15:50:04 d2.checkpoint.c2_model_loading]: Renaming Caffe2 weights ...... [05/23 15:50:04 d2.checkpoint.c2_model_loading]: Following weights matched with submodule backbone.bottom_up: Names in Model Names in Checkpoint Shapes
res2.0.conv1.* res2_0branch2a{bn_*,w} (64,) (64,) (64,) (64,) (64,64,1,1)
res2.0.conv2.* res2_0branch2b{bn_*,w} (64,) (64,) (64,) (64,) (64,64,3,3)
res2.0.conv3.* res2_0branch2c{bn_*,w} (256,) (256,) (256,) (256,) (256,64,1,1)
res2.0.shortcut.* res2_0branch1{bn_*,w} (256,) (256,) (256,) (256,) (256,64,1,1)
res2.1.conv1.* res2_1branch2a{bn_*,w} (64,) (64,) (64,) (64,) (64,256,1,1)
res2.1.conv2.* res2_1branch2b{bn_*,w} (64,) (64,) (64,) (64,) (64,64,3,3)
res2.1.conv3.* res2_1branch2c{bn_*,w} (256,) (256,) (256,) (256,) (256,64,1,1)
res2.2.conv1.* res2_2branch2a{bn_*,w} (64,) (64,) (64,) (64,) (64,256,1,1)
res2.2.conv2.* res2_2branch2b{bn_*,w} (64,) (64,) (64,) (64,) (64,64,3,3)
res2.2.conv3.* res2_2branch2c{bn_*,w} (256,) (256,) (256,) (256,) (256,64,1,1)
res3.0.conv1.* res3_0branch2a{bn_*,w} (128,) (128,) (128,) (128,) (128,256,1,1)
res3.0.conv2.* res3_0branch2b{bn_*,w} (128,) (128,) (128,) (128,) (128,128,3,3)
res3.0.conv3.* res3_0branch2c{bn_*,w} (512,) (512,) (512,) (512,) (512,128,1,1)
res3.0.shortcut.* res3_0branch1{bn_*,w} (512,) (512,) (512,) (512,) (512,256,1,1)
res3.1.conv1.* res3_1branch2a{bn_*,w} (128,) (128,) (128,) (128,) (128,512,1,1)
res3.1.conv2.* res3_1branch2b{bn_*,w} (128,) (128,) (128,) (128,) (128,128,3,3)
res3.1.conv3.* res3_1branch2c{bn_*,w} (512,) (512,) (512,) (512,) (512,128,1,1)
res3.2.conv1.* res3_2branch2a{bn_*,w} (128,) (128,) (128,) (128,) (128,512,1,1)
res3.2.conv2.* res3_2branch2b{bn_*,w} (128,) (128,) (128,) (128,) (128,128,3,3)
res3.2.conv3.* res3_2branch2c{bn_*,w} (512,) (512,) (512,) (512,) (512,128,1,1)
res3.3.conv1.* res3_3branch2a{bn_*,w} (128,) (128,) (128,) (128,) (128,512,1,1)
res3.3.conv2.* res3_3branch2b{bn_*,w} (128,) (128,) (128,) (128,) (128,128,3,3)
res3.3.conv3.* res3_3branch2c{bn_*,w} (512,) (512,) (512,) (512,) (512,128,1,1)
res4.0.conv1.* res4_0branch2a{bn_*,w} (256,) (256,) (256,) (256,) (256,512,1,1)
res4.0.conv2.* res4_0branch2b{bn_*,w} (256,) (256,) (256,) (256,) (256,256,3,3)
res4.0.conv3.* res4_0branch2c{bn_*,w} (1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res4.0.shortcut.* res4_0branch1{bn_*,w} (1024,) (1024,) (1024,) (1024,) (1024,512,1,1)
res4.1.conv1.* res4_1branch2a{bn_*,w} (256,) (256,) (256,) (256,) (256,1024,1,1)
res4.1.conv2.* res4_1branch2b{bn_*,w} (256,) (256,) (256,) (256,) (256,256,3,3)
res4.1.conv3.* res4_1branch2c{bn_*,w} (1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res4.2.conv1.* res4_2branch2a{bn_*,w} (256,) (256,) (256,) (256,) (256,1024,1,1)
res4.2.conv2.* res4_2branch2b{bn_*,w} (256,) (256,) (256,) (256,) (256,256,3,3)
res4.2.conv3.* res4_2branch2c{bn_*,w} (1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res4.3.conv1.* res4_3branch2a{bn_*,w} (256,) (256,) (256,) (256,) (256,1024,1,1)
res4.3.conv2.* res4_3branch2b{bn_*,w} (256,) (256,) (256,) (256,) (256,256,3,3)
res4.3.conv3.* res4_3branch2c{bn_*,w} (1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res4.4.conv1.* res4_4branch2a{bn_*,w} (256,) (256,) (256,) (256,) (256,1024,1,1)
res4.4.conv2.* res4_4branch2b{bn_*,w} (256,) (256,) (256,) (256,) (256,256,3,3)
res4.4.conv3.* res4_4branch2c{bn_*,w} (1024,) (1024,) (1024,) (1024,) (1024,256,1,1)

proposal_generator.rpn_head.anchor_deltas.{bias, weight} proposal_generator.rpn_head.conv.{bias, weight} proposal_generator.rpn_head.objectness_logits.{bias, weight} roi_heads.box_head.fc1.{bias, weight} roi_heads.box_head.fc2.{bias, weight} roi_heads.box_predictor.bbox_pred.{bias, weight} roi_heads.box_predictor.cls_score.{bias, weight} roi_heads.mask_head.deconv.{bias, weight} roi_heads.mask_head.mask_fcn1.{bias, weight} roi_heads.mask_head.mask_fcn2.{bias, weight} roi_heads.mask_head.mask_fcn3.{bias, weight} roi_heads.mask_head.mask_fcn4.{bias, weight} roi_heads.mask_head.predictor.{bias, weight} WARNING [05/23 15:50:04 fvcore.common.checkpoint]: The checkpoint state_dict contains keys that are not used by the model: fc1000.{bias, weight} stem.conv1.bias [05/23 15:50:04 d2.engine.train_loop]: Starting training from iteration 0 /rootpath/anaconda3/envs/sgg_torch111_detectron06/lib/python3.10/site-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the index backbone.fpn_output2.{bias, weight} backbone.fpn_output3.{bias, weight} backbone.fpn_output4.{bias, weight} backbone.fpn_output5.{bias, weight} proposal_generator.rpn_head.anchor_deltas.{bias, weight} proposal_generator.rpn_head.conv.{bias, weight} proposal_generator.rpn_head.objectness_logits.{bias, weight} roi_heads.box_head.fc1.{bias, weight} roi_heads.box_head.fc2.{bias, weight} roi_heads.box_predictor.bbox_pred.{bias, weight} roi_heads.box_predictor.cls_score.{bias, weight} roi_heads.mask_head.deconv.{bias, weight} roi_heads.mask_head.mask_fcn1.{bias, weight} roi_heads.mask_head.mask_fcn2.{bias, weight} roi_heads.mask_head.mask_fcn3.{bias, weight} roi_heads.mask_head.mask_fcn4.{bias, weight} roi_heads.mask_head.predictor.{bias, weight} WARNING [05/23 15:50:04 fvcore.common.checkpoint]: The checkpoint state_dict contains keys that are not used by the model: fc1000.{bias, weight} stem.conv1.bias [05/23 15:50:04 d2.engine.train_loop]: Starting training from iteration 0 /rootpath/anaconda3/envs/sgg_torch111_detectron06/lib/python3.10/site-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2228.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] [05/23 15:50:12 d2.utils.events]: eta: 7:44:48 iter: 19 total_loss: 2.345 loss_cls: 0.5814 loss_box_reg: 0.01275 loss_mask: 0.6936 loss_rpn_cls: 0.6719 loss_rpn_loc: 0.0908 time: 0.3151 data_time: 0.0139 lr: 0.00039962 max_mem: 1481M [05/23 15:50:19 d2.utils.events]: eta: 8:08:10 iter: 39 total_loss: 1.601 loss_cls: 0.4312 loss_box_reg: 0.04747 loss_mask: 0.6906 loss_rpn_cls: 0.4376 loss_rpn_loc: 0.0764 time: 0.3254 data_time: 0.0026 lr: 0.00079922 max_mem: 1481M [05/23 15:50:26 d2.utils.events]: eta: 8:17:54 iter: 59 total_loss: 1.641 loss_cls: 0.4153 loss_box_reg: 0.09799 loss_mask: 0.691 loss_rpn_cls: 0.3649 loss_rpn_loc: 0.1253 time: 0.3259 data_time: 0.0028 lr: 0.0011988 max_mem: 1481M [05/23 15:50:32 d2.utils.events]: eta: 8:20:12 iter: 79 total_loss: 1.439 loss_cls: 0.3282 loss_box_reg: 0.09175 loss_mask: 0.6924 loss_rpn_cls: 0.2477 loss_rpn_loc: 0.05234 time: 0.3288 data_time: 0.0027 lr: 0.0015984 max_mem: 1481M [05/23 15:50:39 d2.utils.events]: eta: 8:20:06 iter: 99 total_loss: 1.285 loss_cls: 0.2667 loss_box_reg: 0.1191 loss_mask: 0.6891 loss_rpn_cls: 0.154 loss_rpn_loc: 0.05424 time: 0.3274 data_time: 0.0025 lr: 0.001998 max_mem: 1481M [05/23 15:50:45 d2.utils.events]: eta: 8:15:39 iter: 119 total_loss: 1.52 loss_cls: 0.346 loss_box_reg: 0.1504 loss_mask: 0.6818 loss_rpn_cls: 0.2181 loss_rpn_loc: 0.09391 time: 0.3256 data_time: 0.0025 lr: 0.0023976 max_mem: 1481M [05/23 15:50:51 d2.utils.events]: eta: 8:12:57 iter: 139 total_loss: 1.546 loss_cls: 0.2511 loss_box_reg: 0.1242 loss_mask: 0.6869 loss_rpn_cls: 0.2738 loss_rpn_loc: 0.04643 time: 0.3242 data_time: 0.0027 lr: 0.0027972 max_mem: 1481M [05/23 15:50:58 d2.utils.events]: eta: 8:12:51 iter: 159 total_loss: 1.687 loss_cls: 0.3452 loss_box_reg: 0.09927 loss_mask: 0.6778 loss_rpn_cls: 0.2546 loss_rpn_loc: 0.1271 time: 0.3253 data_time: 0.0028 lr: 0.0031968 max_mem: 1481M [05/23 15:51:05 d2.utils.events]: eta: 8:15:19 iter: 179 total_loss: 1.557 loss_cls: 0.4099 loss_box_reg: 0.1837 loss_mask: 0.6872 loss_rpn_cls: 0.1388 loss_rpn_loc: 0.06568 time: 0.3271 data_time: 0.0027 lr: 0.0035964 max_mem: 1481M [05/23 15:51:12 d2.utils.events]: eta: 8:16:06 iter: 199 total_loss: 1.931 loss_cls: 0.5021 loss_box_reg: 0.2378 loss_mask: 0.6843 loss_rpn_cls: 0.2495 loss_rpn_loc: 0.1568 time: 0.3284 data_time: 0.0035 lr: 0.003996 max_mem: 1481M


run2:

[05/23 15:52:57 d2.utils.events]: eta: 7:49:54 iter: 19 total_loss: 2.349 loss_cls: 0.5801 loss_box_reg: 0.01275 loss_mask: 0.6936 loss_rpn_cls: 0.6719 loss_rpn_loc: 0.09081 time: 0.3190 data_time: 0.0176 lr: 0.00039962 max_mem: 1481M [05/23 15:53:04 d2.utils.events]: eta: 8:10:18 iter: 39 total_loss: 1.603 loss_cls: 0.4004 loss_box_reg: 0.04758 loss_mask: 0.6906 loss_rpn_cls: 0.4404 loss_rpn_loc: 0.07629 time: 0.3276 data_time: 0.0025 lr: 0.00079922 max_mem: 1481M [05/23 15:53:10 d2.utils.events]: eta: 8:19:58 iter: 59 total_loss: 1.646 loss_cls: 0.4176 loss_box_reg: 0.1167 loss_mask: 0.6912 loss_rpn_cls: 0.3633 loss_rpn_loc: 0.1252 time: 0.3274 data_time: 0.0026 lr: 0.0011988 max_mem: 1481M [05/23 15:53:17 d2.utils.events]: eta: 8:21:51 iter: 79 total_loss: 1.428 loss_cls: 0.299 loss_box_reg: 0.0902 loss_mask: 0.6921 loss_rpn_cls: 0.2449 loss_rpn_loc: 0.05256 time: 0.3296 data_time: 0.0026 lr: 0.0015984 max_mem: 1481M [05/23 15:53:23 d2.utils.events]: eta: 8:21:44 iter: 99 total_loss: 1.319 loss_cls: 0.2876 loss_box_reg: 0.1062 loss_mask: 0.6898 loss_rpn_cls: 0.1512 loss_rpn_loc: 0.05531 time: 0.3289 data_time: 0.0027 lr: 0.001998 max_mem: 1481M [05/23 15:53:30 d2.utils.events]: eta: 8:17:13 iter: 119 total_loss: 1.441 loss_cls: 0.28 loss_box_reg: 0.1317 loss_mask: 0.6835 loss_rpn_cls: 0.2149 loss_rpn_loc: 0.09209 time: 0.3274 data_time: 0.0025 lr: 0.0023976 max_mem: 1481M [05/23 15:53:36 d2.utils.events]: eta: 8:15:03 iter: 139 total_loss: 1.496 loss_cls: 0.272 loss_box_reg: 0.1103 loss_mask: 0.6876 loss_rpn_cls: 0.2564 loss_rpn_loc: 0.04832 time: 0.3262 data_time: 0.0025 lr: 0.0027972 max_mem: 1481M [05/23 15:53:43 d2.utils.events]: eta: 8:14:56 iter: 159 total_loss: 1.737 loss_cls: 0.3486 loss_box_reg: 0.06897 loss_mask: 0.678 loss_rpn_cls: 0.2603 loss_rpn_loc: 0.1359 time: 0.3266 data_time: 0.0025 lr: 0.0031968 max_mem: 1481M [05/23 15:53:49 d2.utils.events]: eta: 8:16:21 iter: 179 total_loss: 1.525 loss_cls: 0.3834 loss_box_reg: 0.1672 loss_mask: 0.6877 loss_rpn_cls: 0.1623 loss_rpn_loc: 0.08118 time: 0.3272 data_time: 0.0026 lr: 0.0035964 max_mem: 1481M [05/23 15:53:56 d2.utils.events]: eta: 8:16:14 iter: 199 total_loss: 1.598 loss_cls: 0.3331 loss_box_reg: 0.1141 loss_mask: 0.6792 loss_rpn_cls: 0.2563 loss_rpn_loc: 0.1831 time: 0.3270 data_time: 0.0026 lr: 0.003996 max_mem: 1481M


run3:

[05/23 15:56:10 d2.utils.events]: eta: 7:45:39 iter: 19 total_loss: 2.348 loss_cls: 0.5763 loss_box_reg: 0.01275 loss_mask: 0.6936 loss_rpn_cls: 0.6719 loss_rpn_loc: 0.0908 time: 0.3167 data_time: 0.0122 lr: 0.00039962 max_mem: 1481M [05/23 15:56:16 d2.utils.events]: eta: 8:10:26 iter: 39 total_loss: 1.605 loss_cls: 0.3891 loss_box_reg: 0.04755 loss_mask: 0.6906 loss_rpn_cls: 0.4403 loss_rpn_loc: 0.07635 time: 0.3277 data_time: 0.0027 lr: 0.00079922 max_mem: 1481M [05/23 15:56:23 d2.utils.events]: eta: 8:23:04 iter: 59 total_loss: 1.679 loss_cls: 0.4163 loss_box_reg: 0.1102 loss_mask: 0.6912 loss_rpn_cls: 0.3563 loss_rpn_loc: 0.1251 time: 0.3293 data_time: 0.0031 lr: 0.0011988 max_mem: 1481M [05/23 15:56:30 d2.utils.events]: eta: 8:21:28 iter: 79 total_loss: 1.433 loss_cls: 0.3133 loss_box_reg: 0.07978 loss_mask: 0.6921 loss_rpn_cls: 0.2468 loss_rpn_loc: 0.05257 time: 0.3303 data_time: 0.0028 lr: 0.0015984 max_mem: 1481M [05/23 15:56:36 d2.utils.events]: eta: 8:22:50 iter: 99 total_loss: 1.317 loss_cls: 0.2764 loss_box_reg: 0.1469 loss_mask: 0.6895 loss_rpn_cls: 0.1487 loss_rpn_loc: 0.05474 time: 0.3291 data_time: 0.0027 lr: 0.001998 max_mem: 1481M [05/23 15:56:43 d2.utils.events]: eta: 8:20:03 iter: 119 total_loss: 1.455 loss_cls: 0.3264 loss_box_reg: 0.1456 loss_mask: 0.6827 loss_rpn_cls: 0.209 loss_rpn_loc: 0.09486 time: 0.3281 data_time: 0.0030 lr: 0.0023976 max_mem: 1481M [05/23 15:56:49 d2.utils.events]: eta: 8:16:57 iter: 139 total_loss: 1.475 loss_cls: 0.2835 loss_box_reg: 0.09706 loss_mask: 0.6861 loss_rpn_cls: 0.2541 loss_rpn_loc: 0.04725 time: 0.3260 data_time: 0.0027 lr: 0.0027972 max_mem: 1481M [05/23 15:56:56 d2.utils.events]: eta: 8:18:19 iter: 159 total_loss: 1.675 loss_cls: 0.3287 loss_box_reg: 0.1219 loss_mask: 0.6776 loss_rpn_cls: 0.2344 loss_rpn_loc: 0.1299 time: 0.3269 data_time: 0.0028 lr: 0.0031968 max_mem: 1481M [05/23 15:57:02 d2.utils.events]: eta: 8:19:43 iter: 179 total_loss: 1.568 loss_cls: 0.4459 loss_box_reg: 0.1866 loss_mask: 0.6875 loss_rpn_cls: 0.124 loss_rpn_loc: 0.06825 time: 0.3279 data_time: 0.0027 lr: 0.0035964 max_mem: 1481M [05/23 15:57:09 d2.utils.events]: eta: 8:19:37 iter: 199 total_loss: 1.803 loss_cls: 0.4938 loss_box_reg: 0.1835 loss_mask: 0.6884 loss_rpn_cls: 0.2585 loss_rpn_loc: 0.1701 time: 0.3281 data_time: 0.0029 lr: 0.003996 max_mem: 1481M


## Expected behavior:

I would expect the losses to be (largely) identical in the default training setup, when using identical machine/code/random seed/config and PyTorch flags for deterministic training.
jhindel commented 2 years ago

I am facing a very similar issue. Did you find a reason for this behaviour and have any suggestions how to fix it?

j-rausch commented 2 years ago

I'm still facing the issue. Without having debugged this in more detail and just looking at the losses of the three runs, loss_cls appears to differ the most at the beginning of the training.

There have been other issues that have been closed in the past (e.g. https://github.com/facebookresearch/detectron2/issues/2480 ), pointing to PyTorch's non-determinism. Perhaps revisiting them with the new deterministic training flags in PyTorch could give new pointers.

j-rausch commented 2 years ago

Are there any news or advice on possible reasons for this issue?