Repeated training not deterministic despite identical setup and reproducibility flags

Hi, I'm working on an experiment where I noticed large differences between models trained with identical configs and random seeds. I'm trying to understand the causes for this.

I've upgraded to a more recent PyTorch version that introduced flags for deterministic training between multiple executions: https://pytorch.org/docs/1.11/notes/randomness.html?highlight=reproducibility

However, despite using these flags and the most recent detectron2 sources, the final trained models and their validation accuracies can differ greatly on a custom dataset set of mine (~2 AP). These differences occur in multiple runs on the same machine (identical device, code, config, random seed).

I've been looking into reproducing this problem and also observe this for the unaltered detectron2 demo training code. I've added a minimal script to reproduce the training and observe rather big differences between the first logged losses of three subsequent runs.

Instructions To Reproduce the Issue:

Full runnable code or full changes you made: script to reproduce the experiment (deterministic_example.py)


import os
os.environ["CUBLAS_WORKSPACE_CONFIG"]=":4096:8"
import torch
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
torch.use_deterministic_algorithms(True)

from detectron2.config import get_cfg from detectron2.engine import DefaultTrainer, default_argument_parser, default_setup, launch

def setup(args): """ Create configs and perform basic setups. """ cfg = get_cfg() cfg.merge_from_file(args.config_file) cfg.merge_from_list(args.opts) cfg.freeze() default_setup(cfg, args) return cfg

def main(args):

cfg = setup(args)

trainer = DefaultTrainer(cfg)
trainer.resume_or_load(resume=False)
return trainer.train()

if name == "main": args = default_argument_parser().parse_args() print("Command Line Args:", args) launch( main, args.num_gpus, num_machines=args.num_machines, machine_rank=args.machine_rank, dist_url=args.dist_url, args=(args,), )

git rev-parse HEAD; git diff e091a07ef573915056f8c2191b774aad0e38d09c

2. What exact command you run:

CUDA_VISIBLE_DEVICES=0 python deterministic_example.py --num-gpus 1 --config-file ./configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml SOLVER.IMS_PER_BATCH 1 SEED 42 DATALOADER.NUM_WORKERS 1


3. __Full logs__ or other relevant observations:

Command Line Args: Namespace(config_file='./configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml', resume=False, eval_only=False, num_gpus=1, num_machines=1, machine_rank=0, dist_url='tcp://127.0.0.1:53650', opts=['SOLVER.IMS_PER_BATCH', '1', 'SEED', '42', 'DATALOADER.NUM_WORKERS', '1']) [05/23 15:49:06 detectron2]: Rank of current process: 0. World size: 1 [05/23 15:49:08 detectron2]: Environment info:

sys.platform linux Python 3.10.4	packaged by conda-forge	(main, Mar 24 2022, 17:39:04) [GCC 10.3.0] numpy 1.22.3 detectron2 0.6 @/rootpath/git/detectron2/detectron2 Compiler GCC 9.3 CUDA compiler CUDA 11.5 detectron2 arch flags 6.1 DETECTRON2_ENV_MODULE PyTorch 1.11.0+cu115 @/rootpath/anaconda3/envs/sgg_torch111_detectron06/lib/python3.10/site-packages/torch PyTorch debug build False GPU available Yes GPU 0 NVIDIA TITAN Xp (arch=6.1) Driver version 510.47.03 CUDA_HOME /usr/local/cuda-11.5 Pillow 9.1.0 torchvision 0.12.0+cu115 @/rootpath/anaconda3/envs/sgg_torch111_detectron06/lib/python3.10/site-packages/torchvision torchvision arch flags 3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6 fvcore 0.1.5.post20220504 iopath 0.1.9 cv2 4.5.5

PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.5.2 (Git Hash a9302535553c73243c632ad3c4c80beec3d19a1e)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 11.5
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86
CuDNN 8.3.2
Magma 2.5.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.5, CUDNN_VERSION=8.3.2, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.11.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=OFF, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,

[05/23 15:49:08 detectron2]: Command line arguments: Namespace(config_file='./configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml', resume=False, eval_only=False, num_gpus=1, num_machines=1, machine_rank=0, dist_url='tcp://127.0.0.1:53650', opts=['SOLVER.IMS_PER_BATCH', '1', 'SEED', '42', 'DATALOADER.NUM_WORKERS', '1']) [05/23 15:49:08 detectron2]: Contents of args.config_file=./configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml: BASE: "../Base-RCNN-FPN.yaml" MODEL: WEIGHTS: "detectron2://ImageNetPretrained/MSRA/R-50.pkl" MASK_ON: True RESNETS: DEPTH: 50

FILTER_EMPTY_ANNOTATIONS: true NUM_WORKERS: 1 REPEAT_THRESHOLD: 0.0 SAMPLER_TRAIN: TrainingSampler DATASETS: PRECOMPUTED_PROPOSAL_TOPK_TEST: 1000 PRECOMPUTED_PROPOSAL_TOPK_TRAIN: 2000 PROPOSAL_FILES_TEST: [] PROPOSAL_FILES_TRAIN: [] TEST:

coco_2017_val TRAIN:
coco_2017_train GLOBAL: HACK: 1.0 INPUT: CROP: ENABLED: false SIZE:
- 0.9
- 0.9 TYPE: relative_range FORMAT: BGR MASK_FORMAT: polygon MAX_SIZE_TEST: 1333 MAX_SIZE_TRAIN: 1333 MIN_SIZE_TEST: 800 MIN_SIZE_TRAIN:
640
672
704
736
768
800 MIN_SIZE_TRAIN_SAMPLING: choice RANDOM_FLIP: horizontal MODEL: ANCHOR_GENERATOR: ANGLES:
- - -90
  - 0
  - 90 ASPECT_RATIOS:
- - 0.5
  - 1.0
  - 2.0 NAME OFFSET: 0.0 SIZES:
- - 32
- - 64
- - 128
- - 256
- - 512 BACKBONE: FREEZE_AT: 2 NAME: build_resnet_fpn_backbone DEVICE: cuda FPN: FUSE_TYPE: sum IN_FEATURES:
- res2
- res3
- res4
- res5 NORM: '' OUT_CHANNELS: 256 KEYPOINT_ON: false LOAD_PROPOSALS: false MASK_ON: true META_ARCHITECTURE: GeneralizedRCNN PANOPTIC_FPN: COMBINE: ENABLED: true INSTANCES_CONFIDENCE_THRESH: 0.5 OVERLAP_THRESH: 0.5 STUFF_AREA_LIMIT: 4096 INSTANCE_LOSS_WEIGHT: 1.0 PIXEL_MEAN:
103.53
116.28
123.675 PIXEL_STD:
1.0
1.0
1.0 PROPOSAL_GENERATOR: MIN_SIZE: 0 NAME: RPN RESNETS: DEFORM_MODULATED: false DEFORM_NUM_GROUPS: 1 DEFORM_ON_PER_STAGE:
- false DEPTH: 50 NORM: FrozenBN NUM_GROUPS: 1 OUT_FEATURES:
- res2
- res3
- res4
- res5 RES2_OUT_CHANNELS: 256 RES5_DILATION: 1 STEM_OUT_CHANNELS: 64 STRIDE_IN_1X1: true WIDTH_PER_GROUP: 64 RETINANET: BBOX_REG_LOSS_TYPE: smooth_l1 BBOX_REG_WEIGHTS: &id002
- 1.0
- 1.0
- 1.0
- 1.0 FOCAL_LOSS_ALPHA: 0.25 FOCAL_LOSS_GAMMA: 2.0 IN_FEATURES:
- p3
- p4
- p5
- p6
- p7 IOU_LABELS:
- 0
- -1
- 1 IOU_THRESHOLDS:
- 0.4
- 0.5 NMS_THRESH_TEST: 0.5 NORM: '' NUM_CLASSES: 80 NUM_CONVS: 4 PRIOR_PROB: 0.01 SCORE_THRESH_TEST: 0.05 SMOOTH_L1_LOSS_BETA: 0.1 TOPK_CANDIDATES_TEST: 1000 ROI_BOX_CASCADE_HEAD: BBOX_REG_WEIGHTS:
- &id
  - 10.0
  - 5.0
  - 5.0
- - 20.0
  - 20.0
  - 10.0
  - 10.0
- - 30.0
  - 30.0
  - 15.0
  - 15.0 IOUS:
- 0.5
- 0.6
- 0.7 ROI_BOX_HEAD: BBOX_REG_LOSS_TYPE: smooth_l1 BBOX_REG_LOSS_WEIGHT: 1.0 BBOX_REG_WEIGHTS: *id001 CLS_AGNOSTIC_BBOX_REG: false CONV_DIM: 256 FC_DIM: 1024 FED_LOSS_FREQ_WEIGHT_POWER: 0.5 FED_LOSS_NUM_CLASSES: 50 NAME: FastRCNNConvFCHead NORM: '' NUM_CONV: 0 NUM_FC: 2 POOLER_RESOLUTION: 7 POOLER_SAMPLING_RATIO: 0 POOLER_TYPE: ROIAlignV2 SMOOTH_L1_BETA: 0.0 TRAIN_ON_PRED_BOXES: false USE_FED_LOSS: false USE_SIGMOID_CE: false ROI_HEADS: BATCH_SIZE_PER_IMAGE: 512 IN_FEATURES:
- p2
- p3
- p4
- p5 IOU_LABELS:
- 0
- 1 IOU_THRESHOLDS:
- 0.5 NAME: StandardROIHeads NMS_THRESH_TEST: 0.5 NUM_CLASSES: 80 POSITIVE_FRACTION: 0.25 PROPOSAL_APPEND_GT: true SCORE_THRESH_TEST: 0.05 ROI_KEYPOINT_HEAD: CONV_DIMS:
- 512
- 512
- 512
- 512
- 512
- 512
- 512
- 512 LOSS_WEIGHT: 1.0 MIN_KEYPOINTS_PER_IMAGE: 1 NAME: KRCNNConvDeconvUpsampleHead NORMALIZE_LOSS_BY_VISIBLE_KEYPOINTS: true NUM_KEYPOINTS: 17 POOLER_RESOLUTION: 14 POOLER_SAMPLING_RATIO: 0 POOLER_TYPE: ROIAlignV2 ROI_MASK_HEAD: CLS_AGNOSTIC_MASK: false CONV_DIM: 256 NAME: MaskRCNNConvUpsampleHead NORM: '' NUM_CONV: 4 POOLER_RESOLUTION: 14 POOLER_SAMPLING_RATIO: 0 POOLER_TYPE: ROIAlignV2 RPN: BATCH_SIZE_PER_IMAGE: 256 BBOX_REG_LOSS_TYPE: smooth_l1 BBOX_REG_LOSS_WEIGHT: 1.0 BBOX_REG_WEIGHTS: *id002 BOUNDARY_THRESH: -1 CONV_DIMS:
- -1 HEAD_NAME: StandardRPNHead IN_FEATURES:
- p4
- p5
- p6 IOU_LABELS:
- 0
- -1
- 1 IOU_THRESHOLDS:
- 0.3
- 0.7 LOSS_WEIGHT: 1.0 NMS_THRESH: 0.7 POSITIVE_FRACTION: 0.5 POST_NMS_TOPK_TEST: 1000 POST_NMS_TOPK_TRAIN: 1000 PRE_NMS_TOPK_TEST: 1000 PRE_NMS_TOPK_TRAIN: 2000 SMOOTH_L1_BETA: 0.0 SEM_SEG_HEAD: COMMON_STRIDE: 4 CONVS_DIM: 128 IGNORE_VALUE: 255 IN_FEATURES:
- p2
- p3
- p4
- p5 LOSS_WEIGHT: 1.0 NAME: SemSegFPNHead NORM: GN NUM_CLASSES: 54 WEIGHTS: detectron2://ImageNetPretrained/MSRA/R-50.pkl OUTPUT_DIR: ./output SEED: 42 SOLVER: AMP: ENABLED: false BASE_LR: 0.02 BASE_LR_END: 0.0 BIAS_LR_FACTOR: 1.0 CHECKPOINT_PERIOD: 5000 CLIP_GRADIENTS: CLIP_TYPE: value CLIP_VALUE: 1.0 ENABLED: false NORM_TYPE: 2.0 GAMMA: 0.1 IMS_PER_BATCH: 1 LR_SCHEDULER_NAME: WarmupMultiStepLR MAX_ITER: 90000 MOMENTUM: 0.9 NESTEROV: false REFERENCE_WORLD_SIZE: 0 STEPS:
60000
80000 WARMUP_FACTOR: 0.001 WARMUP_ITERS: 1000 WARMUP_METHOD: linear WEIGHT_DECAY: 0.0001 WEIGHT_DECAY_BIAS: null WEIGHT_DECAY_NORM: 0.0 TEST: AUG: ENABLED: false FLIP: true MAX_SIZE: 4000 MIN_SIZES:
- 400
- 500
- 600
- 700
- 800
- 900
- 1000
- 1100
- 1200 DETECTIONS_PER_IMAGE: 100 EVAL_PERIOD: 0 EXPECTED_RESULTS: [] KEYPOINT_OKS_SIGMAS: [] PRECISE_BN: ENABLED: false NUM_ITER: 200 VERSION: 2 VIS_PERIOD: 0

[05/23 15:49:08 detectron2]: Full config saved to ./output/config.yaml

      )
      (conv3): Conv2d(
        64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
      )
    )
    (2): BottleneckBlock(
      (conv1): Conv2d(
        256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
      )
      (conv2): Conv2d(
        64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05)
      )
      (conv3): Conv2d(
        64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
      )
    )
  )
  (res3): Sequential(
    (0): BottleneckBlock(
      (shortcut): Conv2d(
        256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False
        (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
      )
      (conv1): Conv2d(
        256, 128, kernel_size=(1, 1), stride=(2, 2), bias=False
        (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
      )
      (conv2): Conv2d(
        128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
      )
      (conv3): Conv2d(
        128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
      )
    )
    (1): BottleneckBlock(
      (conv1): Conv2d(
        512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05)
      )
      (conv2): Conv2d(
        128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1
        256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
      )
    )
    (1): BottleneckBlock(
      (conv1): Conv2d(
        1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
      )
      (conv2): Conv2d(
        256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
      )
      (conv3): Conv2d(
        256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
      )
    )
    (2): BottleneckBlock(
      (conv1): Conv2d(
        1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
      )
      (conv2): Conv2d(
        256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
      )
      (conv3): Conv2d(
        256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
      )
    )
    (3): BottleneckBlock(
      (conv1): Conv2d(
        1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
      )
      (conv2): Conv2d(
        256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05)
      )
      (conv3): Conv2d(
        256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05)
      )
    )
    (4): BottleneckBl
      )
      (conv2): Conv2d(
        512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
      )
      (conv3): Conv2d(
        512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05)
      )
    )
    (2): BottleneckBlock(
      (conv1): Conv2d(
        2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
      )
      (conv2): Conv2d(
        512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05)
      )
      (conv3): Conv2d(
        512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False
        (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05)
      )
    )
  )
)

) (proposal_generator): RPN( (rpn_head): StandardRPNHead( (conv): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1) (activation): ReLU() ) (objectness_logits): Conv2d(256, 3, kernel_size=(1, 1), stride=(1, 1)) (anchor_deltas): Conv2d(256, 12, kernel_size=(1, 1), stride=(1, 1)) ) (anchor_generator): DefaultAnchorGenerator( (cell_anchors): BufferList() ) ) (roi_heads): StandardROIHeads( (box_pooler): ROIPooler( (level_poolers): ModuleList( (0): ROIAlign(output_size=(7, 7), spatial_scale=0.25, sampling_ratio=0, aligned=True) (1): ROIAlign(output_size=(7, 7), spatial_scale=0.125, sampling_ratio=0, aligned=True) (2): ROIAlign(output_size=(7, 7), spatial_scale=0.0625, sampling_ratio=0, aligned=True) (3): ROIAlign(output_size=(7, 7), spatial_scale=0.03125, sampling_ratio=0, aligned=True) ) ) (box_head): FastRCNNConvFCHead( (flatten): Flatten(start_dim=1, end_dim=-1) (fc1): Linear(in_features=12544, out_features=1024, bias=True) (fc_relu1): ReLU() (fc2): Linear(in_features=1024, out_features=1024, bias=True) (fc_relu2): ReLU() ) (box_predictor): FastRCNNOutputLayers( (cls_score): Linear(in_features=1024, out_features=81, bias=True) (bbox_pred): Linear(in_features=1024, out_features=320, bias=True) ) (mask_pooler): ROIPooler( (level_poolers): ModuleList( (0): ROIAlign(output_size=(14, 14), spatial_scale=0.25, sampling_ratio=0, aligned=True) (1): ROIAlign(output_size=(14, 14), spatial_scale=0.125, sampling_ratio=0, aligned=True) (2): ROIAlign(output_size=(14, 14), spatial_scale=0.0625, sampling_ratio=0, aligned=True) (3): ROIAlign(output_size=(14, 14), spatial_scale=0.03125, sampling_ratio=0, aligned=True) ) ) (mask_head): MaskRCNNConvUpsampleHead( (mask_fcn1): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1) (activation): ReLU() ) (mask_fcn2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1) (activation): ReLU() ) (mask_fcn3): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1) (activation): ReLU() ) (mask_fcn4): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1) (activation): ReLU() ) (deconv): ConvTranspose2d(256, 256, kernel_size=(2, 2), stride=(2, 2)) (deconv_relu): ReLU() (predictor): Conv2d(256, 80, kernel_size=(1, 1), stride=(1, 1)) ) ) ) [05/23 15:49:30 d2.data.datasets.coco]: Loading datasets/coco/annotations/instances_train2017.json takes 18.03 seconds. [05/23 15:49:31 d2.data.datasets.coco]: Loaded 118287 images in COCO format from datasets/coco/annotations/instances_train2017.json [05/23 15:49:37 d2.data.build]: Removed 1021 images with no usable annotations. 117266 images left. [05/23 15:49:43 d2.data.build]: Distribution of instances among all 80 categories:	category	#instances	category	#instances	category
person	257253	bicycle	7056	car	43533
motorcycle	8654	airplane	5129	bus	6061
train	4570	truck	9970	boat	10576
traffic light	12842	fire hydrant	1865	stop sign	1983
parking meter	1283	bench	9820	bird	10542
cat	4766	dog	5500	horse	6567
sheep	9223	cow	8014	elephant	5484
bear	1294	zebra	5269	giraffe	5128
backpack	8714	umbrella	11265	handbag	12342
tie	6448	suitcase	6112	frisbee	2681
skis	6623	snowboard	2681	sports ball	6299
kite	8802	baseball bat	3273	baseball gl..	3747
skateboard	5536	surfboard	6095	tennis racket	4807
bottle	24070	wine glass	7839	cup	20574
fork	5474	knife	7760	spoon	6159
bowl	14323	banana	9195	apple	5776
sandwich	4356	orange	6302	broccoli	7261
carrot	7758	hot dog	2884	pizza	5807
donut	7005	cake	6296	chair	38073
couch	5779	potted plant	8631	bed	4192
dining table	15695	toilet	4149	tv	5803
laptop	4960	mouse	2261	remote	5700
keyboard	2854	cell phone	6422	microwave	1672
oven	3334	toaster	225	sink	5609
refrigerator	2634	book	24077	clock	6320
vase	6577	scissors	1464	teddy bear	4729
hair drier	198	toothbrush	1945
total	849949

[05/23 15:49:43 d2.data.dataset_mapper]: [DatasetMapper] Augmentations used in training: [ResizeShortestEdge(short_edge_length=(640, 672, 704, 736, 768, 800), max_size=1333, sample_style='choice'), RandomFlip()] [05/23 15:49:43 d2.data.build]: Using training sampler TrainingSampler [05/23 15:49:43 d2.data.common]: Serializing 117266 elements to byte tensors and concatenating them all ... [05/23 15:49:47 d2.data.common]: Serialized dataset takes 451.21 MiB [05/23 15:50:04 fvcore.common.checkpoint]: [Checkpointer] Loading from detectron2://ImageNetPretrained/MSRA/R-50.pkl ... [05/23 15:50:04 d2.checkpoint.c2_model_loading]: Renaming Caffe2 weights ...... [05/23 15:50:04 d2.checkpoint.c2_model_loading]: Following weights matched with submodule backbone.bottom_up:	Names in Model	Names in Checkpoint
res2.0.conv1.*	res2_0branch2a{bn_*,w}	(64,) (64,) (64,) (64,) (64,64,1,1)
res2.0.conv2.*	res2_0branch2b{bn_*,w}	(64,) (64,) (64,) (64,) (64,64,3,3)
res2.0.conv3.*	res2_0branch2c{bn_*,w}	(256,) (256,) (256,) (256,) (256,64,1,1)
res2.0.shortcut.*	res2_0branch1{bn_*,w}	(256,) (256,) (256,) (256,) (256,64,1,1)
res2.1.conv1.*	res2_1branch2a{bn_*,w}	(64,) (64,) (64,) (64,) (64,256,1,1)
res2.1.conv2.*	res2_1branch2b{bn_*,w}	(64,) (64,) (64,) (64,) (64,64,3,3)
res2.1.conv3.*	res2_1branch2c{bn_*,w}	(256,) (256,) (256,) (256,) (256,64,1,1)
res2.2.conv1.*	res2_2branch2a{bn_*,w}	(64,) (64,) (64,) (64,) (64,256,1,1)
res2.2.conv2.*	res2_2branch2b{bn_*,w}	(64,) (64,) (64,) (64,) (64,64,3,3)
res2.2.conv3.*	res2_2branch2c{bn_*,w}	(256,) (256,) (256,) (256,) (256,64,1,1)
res3.0.conv1.*	res3_0branch2a{bn_*,w}	(128,) (128,) (128,) (128,) (128,256,1,1)
res3.0.conv2.*	res3_0branch2b{bn_*,w}	(128,) (128,) (128,) (128,) (128,128,3,3)
res3.0.conv3.*	res3_0branch2c{bn_*,w}	(512,) (512,) (512,) (512,) (512,128,1,1)
res3.0.shortcut.*	res3_0branch1{bn_*,w}	(512,) (512,) (512,) (512,) (512,256,1,1)
res3.1.conv1.*	res3_1branch2a{bn_*,w}	(128,) (128,) (128,) (128,) (128,512,1,1)
res3.1.conv2.*	res3_1branch2b{bn_*,w}	(128,) (128,) (128,) (128,) (128,128,3,3)
res3.1.conv3.*	res3_1branch2c{bn_*,w}	(512,) (512,) (512,) (512,) (512,128,1,1)
res3.2.conv1.*	res3_2branch2a{bn_*,w}	(128,) (128,) (128,) (128,) (128,512,1,1)
res3.2.conv2.*	res3_2branch2b{bn_*,w}	(128,) (128,) (128,) (128,) (128,128,3,3)
res3.2.conv3.*	res3_2branch2c{bn_*,w}	(512,) (512,) (512,) (512,) (512,128,1,1)
res3.3.conv1.*	res3_3branch2a{bn_*,w}	(128,) (128,) (128,) (128,) (128,512,1,1)
res3.3.conv2.*	res3_3branch2b{bn_*,w}	(128,) (128,) (128,) (128,) (128,128,3,3)
res3.3.conv3.*	res3_3branch2c{bn_*,w}	(512,) (512,) (512,) (512,) (512,128,1,1)
res4.0.conv1.*	res4_0branch2a{bn_*,w}	(256,) (256,) (256,) (256,) (256,512,1,1)
res4.0.conv2.*	res4_0branch2b{bn_*,w}	(256,) (256,) (256,) (256,) (256,256,3,3)
res4.0.conv3.*	res4_0branch2c{bn_*,w}	(1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res4.0.shortcut.*	res4_0branch1{bn_*,w}	(1024,) (1024,) (1024,) (1024,) (1024,512,1,1)
res4.1.conv1.*	res4_1branch2a{bn_*,w}	(256,) (256,) (256,) (256,) (256,1024,1,1)
res4.1.conv2.*	res4_1branch2b{bn_*,w}	(256,) (256,) (256,) (256,) (256,256,3,3)
res4.1.conv3.*	res4_1branch2c{bn_*,w}	(1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res4.2.conv1.*	res4_2branch2a{bn_*,w}	(256,) (256,) (256,) (256,) (256,1024,1,1)
res4.2.conv2.*	res4_2branch2b{bn_*,w}	(256,) (256,) (256,) (256,) (256,256,3,3)
res4.2.conv3.*	res4_2branch2c{bn_*,w}	(1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res4.3.conv1.*	res4_3branch2a{bn_*,w}	(256,) (256,) (256,) (256,) (256,1024,1,1)
res4.3.conv2.*	res4_3branch2b{bn_*,w}	(256,) (256,) (256,) (256,) (256,256,3,3)
res4.3.conv3.*	res4_3branch2c{bn_*,w}	(1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res4.4.conv1.*	res4_4branch2a{bn_*,w}	(256,) (256,) (256,) (256,) (256,1024,1,1)
res4.4.conv2.*	res4_4branch2b{bn_*,w}	(256,) (256,) (256,) (256,) (256,256,3,3)
res4.4.conv3.*	res4_4branch2c{bn_*,w}	(1024,) (1024,) (1024,) (1024,) (1024,256,1,1)

proposal_generator.rpn_head.anchor_deltas.{bias, weight} proposal_generator.rpn_head.conv.{bias, weight} proposal_generator.rpn_head.objectness_logits.{bias, weight} roi_heads.box_head.fc1.{bias, weight} roi_heads.box_head.fc2.{bias, weight} roi_heads.box_predictor.bbox_pred.{bias, weight} roi_heads.box_predictor.cls_score.{bias, weight} roi_heads.mask_head.deconv.{bias, weight} roi_heads.mask_head.mask_fcn1.{bias, weight} roi_heads.mask_head.mask_fcn2.{bias, weight} roi_heads.mask_head.mask_fcn3.{bias, weight} roi_heads.mask_head.mask_fcn4.{bias, weight} roi_heads.mask_head.predictor.{bias, weight} WARNING [05/23 15:50:04 fvcore.common.checkpoint]: The checkpoint state_dict contains keys that are not used by the model: fc1000.{bias, weight} stem.conv1.bias [05/23 15:50:04 d2.engine.train_loop]: Starting training from iteration 0 /rootpath/anaconda3/envs/sgg_torch111_detectron06/lib/python3.10/site-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the index backbone.fpn_output2.{bias, weight} backbone.fpn_output3.{bias, weight} backbone.fpn_output4.{bias, weight} backbone.fpn_output5.{bias, weight} proposal_generator.rpn_head.anchor_deltas.{bias, weight} proposal_generator.rpn_head.conv.{bias, weight} proposal_generator.rpn_head.objectness_logits.{bias, weight} roi_heads.box_head.fc1.{bias, weight} roi_heads.box_head.fc2.{bias, weight} roi_heads.box_predictor.bbox_pred.{bias, weight} roi_heads.box_predictor.cls_score.{bias, weight} roi_heads.mask_head.deconv.{bias, weight} roi_heads.mask_head.mask_fcn1.{bias, weight} roi_heads.mask_head.mask_fcn2.{bias, weight} roi_heads.mask_head.mask_fcn3.{bias, weight} roi_heads.mask_head.mask_fcn4.{bias, weight} roi_heads.mask_head.predictor.{bias, weight} WARNING [05/23 15:50:04 fvcore.common.checkpoint]: The checkpoint state_dict contains keys that are not used by the model: fc1000.{bias, weight} stem.conv1.bias [05/23 15:50:04 d2.engine.train_loop]: Starting training from iteration 0 /rootpath/anaconda3/envs/sgg_torch111_detectron06/lib/python3.10/site-packages/torch/functional.py:568: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:2228.) return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined] [05/23 15:50:12 d2.utils.events]: eta: 7:44:48 iter: 19 total_loss: 2.345 loss_cls: 0.5814 loss_box_reg: 0.01275 loss_mask: 0.6936 loss_rpn_cls: 0.6719 loss_rpn_loc: 0.0908 time: 0.3151 data_time: 0.0139 lr: 0.00039962 max_mem: 1481M [05/23 15:50:19 d2.utils.events]: eta: 8:08:10 iter: 39 total_loss: 1.601 loss_cls: 0.4312 loss_box_reg: 0.04747 loss_mask: 0.6906 loss_rpn_cls: 0.4376 loss_rpn_loc: 0.0764 time: 0.3254 data_time: 0.0026 lr: 0.00079922 max_mem: 1481M [05/23 15:50:26 d2.utils.events]: eta: 8:17:54 iter: 59 total_loss: 1.641 loss_cls: 0.4153 loss_box_reg: 0.09799 loss_mask: 0.691 loss_rpn_cls: 0.3649 loss_rpn_loc: 0.1253 time: 0.3259 data_time: 0.0028 lr: 0.0011988 max_mem: 1481M [05/23 15:50:32 d2.utils.events]: eta: 8:20:12 iter: 79 total_loss: 1.439 loss_cls: 0.3282 loss_box_reg: 0.09175 loss_mask: 0.6924 loss_rpn_cls: 0.2477 loss_rpn_loc: 0.05234 time: 0.3288 data_time: 0.0027 lr: 0.0015984 max_mem: 1481M [05/23 15:50:39 d2.utils.events]: eta: 8:20:06 iter: 99 total_loss: 1.285 loss_cls: 0.2667 loss_box_reg: 0.1191 loss_mask: 0.6891 loss_rpn_cls: 0.154 loss_rpn_loc: 0.05424 time: 0.3274 data_time: 0.0025 lr: 0.001998 max_mem: 1481M [05/23 15:50:45 d2.utils.events]: eta: 8:15:39 iter: 119 total_loss: 1.52 loss_cls: 0.346 loss_box_reg: 0.1504 loss_mask: 0.6818 loss_rpn_cls: 0.2181 loss_rpn_loc: 0.09391 time: 0.3256 data_time: 0.0025 lr: 0.0023976 max_mem: 1481M [05/23 15:50:51 d2.utils.events]: eta: 8:12:57 iter: 139 total_loss: 1.546 loss_cls: 0.2511 loss_box_reg: 0.1242 loss_mask: 0.6869 loss_rpn_cls: 0.2738 loss_rpn_loc: 0.04643 time: 0.3242 data_time: 0.0027 lr: 0.0027972 max_mem: 1481M [05/23 15:50:58 d2.utils.events]: eta: 8:12:51 iter: 159 total_loss: 1.687 loss_cls: 0.3452 loss_box_reg: 0.09927 loss_mask: 0.6778 loss_rpn_cls: 0.2546 loss_rpn_loc: 0.1271 time: 0.3253 data_time: 0.0028 lr: 0.0031968 max_mem: 1481M [05/23 15:51:05 d2.utils.events]: eta: 8:15:19 iter: 179 total_loss: 1.557 loss_cls: 0.4099 loss_box_reg: 0.1837 loss_mask: 0.6872 loss_rpn_cls: 0.1388 loss_rpn_loc: 0.06568 time: 0.3271 data_time: 0.0027 lr: 0.0035964 max_mem: 1481M [05/23 15:51:12 d2.utils.events]: eta: 8:16:06 iter: 199 total_loss: 1.931 loss_cls: 0.5021 loss_box_reg: 0.2378 loss_mask: 0.6843 loss_rpn_cls: 0.2495 loss_rpn_loc: 0.1568 time: 0.3284 data_time: 0.0035 lr: 0.003996 max_mem: 1481M


run2:

[05/23 15:52:57 d2.utils.events]: eta: 7:49:54 iter: 19 total_loss: 2.349 loss_cls: 0.5801 loss_box_reg: 0.01275 loss_mask: 0.6936 loss_rpn_cls: 0.6719 loss_rpn_loc: 0.09081 time: 0.3190 data_time: 0.0176 lr: 0.00039962 max_mem: 1481M [05/23 15:53:04 d2.utils.events]: eta: 8:10:18 iter: 39 total_loss: 1.603 loss_cls: 0.4004 loss_box_reg: 0.04758 loss_mask: 0.6906 loss_rpn_cls: 0.4404 loss_rpn_loc: 0.07629 time: 0.3276 data_time: 0.0025 lr: 0.00079922 max_mem: 1481M [05/23 15:53:10 d2.utils.events]: eta: 8:19:58 iter: 59 total_loss: 1.646 loss_cls: 0.4176 loss_box_reg: 0.1167 loss_mask: 0.6912 loss_rpn_cls: 0.3633 loss_rpn_loc: 0.1252 time: 0.3274 data_time: 0.0026 lr: 0.0011988 max_mem: 1481M [05/23 15:53:17 d2.utils.events]: eta: 8:21:51 iter: 79 total_loss: 1.428 loss_cls: 0.299 loss_box_reg: 0.0902 loss_mask: 0.6921 loss_rpn_cls: 0.2449 loss_rpn_loc: 0.05256 time: 0.3296 data_time: 0.0026 lr: 0.0015984 max_mem: 1481M [05/23 15:53:23 d2.utils.events]: eta: 8:21:44 iter: 99 total_loss: 1.319 loss_cls: 0.2876 loss_box_reg: 0.1062 loss_mask: 0.6898 loss_rpn_cls: 0.1512 loss_rpn_loc: 0.05531 time: 0.3289 data_time: 0.0027 lr: 0.001998 max_mem: 1481M [05/23 15:53:30 d2.utils.events]: eta: 8:17:13 iter: 119 total_loss: 1.441 loss_cls: 0.28 loss_box_reg: 0.1317 loss_mask: 0.6835 loss_rpn_cls: 0.2149 loss_rpn_loc: 0.09209 time: 0.3274 data_time: 0.0025 lr: 0.0023976 max_mem: 1481M [05/23 15:53:36 d2.utils.events]: eta: 8:15:03 iter: 139 total_loss: 1.496 loss_cls: 0.272 loss_box_reg: 0.1103 loss_mask: 0.6876 loss_rpn_cls: 0.2564 loss_rpn_loc: 0.04832 time: 0.3262 data_time: 0.0025 lr: 0.0027972 max_mem: 1481M [05/23 15:53:43 d2.utils.events]: eta: 8:14:56 iter: 159 total_loss: 1.737 loss_cls: 0.3486 loss_box_reg: 0.06897 loss_mask: 0.678 loss_rpn_cls: 0.2603 loss_rpn_loc: 0.1359 time: 0.3266 data_time: 0.0025 lr: 0.0031968 max_mem: 1481M [05/23 15:53:49 d2.utils.events]: eta: 8:16:21 iter: 179 total_loss: 1.525 loss_cls: 0.3834 loss_box_reg: 0.1672 loss_mask: 0.6877 loss_rpn_cls: 0.1623 loss_rpn_loc: 0.08118 time: 0.3272 data_time: 0.0026 lr: 0.0035964 max_mem: 1481M [05/23 15:53:56 d2.utils.events]: eta: 8:16:14 iter: 199 total_loss: 1.598 loss_cls: 0.3331 loss_box_reg: 0.1141 loss_mask: 0.6792 loss_rpn_cls: 0.2563 loss_rpn_loc: 0.1831 time: 0.3270 data_time: 0.0026 lr: 0.003996 max_mem: 1481M


run3:

[05/23 15:56:10 d2.utils.events]: eta: 7:45:39 iter: 19 total_loss: 2.348 loss_cls: 0.5763 loss_box_reg: 0.01275 loss_mask: 0.6936 loss_rpn_cls: 0.6719 loss_rpn_loc: 0.0908 time: 0.3167 data_time: 0.0122 lr: 0.00039962 max_mem: 1481M [05/23 15:56:16 d2.utils.events]: eta: 8:10:26 iter: 39 total_loss: 1.605 loss_cls: 0.3891 loss_box_reg: 0.04755 loss_mask: 0.6906 loss_rpn_cls: 0.4403 loss_rpn_loc: 0.07635 time: 0.3277 data_time: 0.0027 lr: 0.00079922 max_mem: 1481M [05/23 15:56:23 d2.utils.events]: eta: 8:23:04 iter: 59 total_loss: 1.679 loss_cls: 0.4163 loss_box_reg: 0.1102 loss_mask: 0.6912 loss_rpn_cls: 0.3563 loss_rpn_loc: 0.1251 time: 0.3293 data_time: 0.0031 lr: 0.0011988 max_mem: 1481M [05/23 15:56:30 d2.utils.events]: eta: 8:21:28 iter: 79 total_loss: 1.433 loss_cls: 0.3133 loss_box_reg: 0.07978 loss_mask: 0.6921 loss_rpn_cls: 0.2468 loss_rpn_loc: 0.05257 time: 0.3303 data_time: 0.0028 lr: 0.0015984 max_mem: 1481M [05/23 15:56:36 d2.utils.events]: eta: 8:22:50 iter: 99 total_loss: 1.317 loss_cls: 0.2764 loss_box_reg: 0.1469 loss_mask: 0.6895 loss_rpn_cls: 0.1487 loss_rpn_loc: 0.05474 time: 0.3291 data_time: 0.0027 lr: 0.001998 max_mem: 1481M [05/23 15:56:43 d2.utils.events]: eta: 8:20:03 iter: 119 total_loss: 1.455 loss_cls: 0.3264 loss_box_reg: 0.1456 loss_mask: 0.6827 loss_rpn_cls: 0.209 loss_rpn_loc: 0.09486 time: 0.3281 data_time: 0.0030 lr: 0.0023976 max_mem: 1481M [05/23 15:56:49 d2.utils.events]: eta: 8:16:57 iter: 139 total_loss: 1.475 loss_cls: 0.2835 loss_box_reg: 0.09706 loss_mask: 0.6861 loss_rpn_cls: 0.2541 loss_rpn_loc: 0.04725 time: 0.3260 data_time: 0.0027 lr: 0.0027972 max_mem: 1481M [05/23 15:56:56 d2.utils.events]: eta: 8:18:19 iter: 159 total_loss: 1.675 loss_cls: 0.3287 loss_box_reg: 0.1219 loss_mask: 0.6776 loss_rpn_cls: 0.2344 loss_rpn_loc: 0.1299 time: 0.3269 data_time: 0.0028 lr: 0.0031968 max_mem: 1481M [05/23 15:57:02 d2.utils.events]: eta: 8:19:43 iter: 179 total_loss: 1.568 loss_cls: 0.4459 loss_box_reg: 0.1866 loss_mask: 0.6875 loss_rpn_cls: 0.124 loss_rpn_loc: 0.06825 time: 0.3279 data_time: 0.0027 lr: 0.0035964 max_mem: 1481M [05/23 15:57:09 d2.utils.events]: eta: 8:19:37 iter: 199 total_loss: 1.803 loss_cls: 0.4938 loss_box_reg: 0.1835 loss_mask: 0.6884 loss_rpn_cls: 0.2585 loss_rpn_loc: 0.1701 time: 0.3281 data_time: 0.0029 lr: 0.003996 max_mem: 1481M


## Expected behavior:

I would expect the losses to be (largely) identical in the default training setup, when using identical machine/code/random seed/config and PyTorch flags for deterministic training.

facebookresearch / detectron2

Repeated training not deterministic despite identical setup and reproducibility flags #4260

Instructions To Reproduce the Issue:

1.0 PROPOSAL_GENERATOR: MIN_SIZE: 0 NAME: RPN RESNETS: DEFORM_MODULATED: false DEFORM_NUM_GROUPS: 1 DEFORM_ON_PER_STAGE: