facebookresearch / detectron2

Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.
https://detectron2.readthedocs.io/en/latest/
Apache License 2.0
30.55k stars 7.49k forks source link

No evaluation results printed during multi-gpu training #1022

Closed ghost closed 4 years ago

ghost commented 4 years ago

I am training an object detection on a custom COCO-format dataset. While multi-gpu training, I periodically do evaluation using the cfg.TEST.EVAL_PERIOD. However, I don't get any evaluation results, such as mAP scores, or per-category AP scores. This issue is similar to #937 with the only difference that no evaluation results. How can I get the evaluation results?

ppwwyyxx commented 4 years ago

If you need help with an unexpected issue, please include details following the issue template.

ghost commented 4 years ago

Instructions To Reproduce the Issue:

  1. what changes you made (git diff) or what code you wrote
    
    Running traditional plain_train_net.py. I have only changed --gpus-num = 2, NUM_WORKERS = 2, batch_size to 7, learning rate = 0.02, cfg.TEST.EVAL_PERIOD = 4510, and MAX_ITER = 100000.

Here are the config argumets:

cfg = get_cfg() cfg.merge_from_file(model_zoo.get_config_file("Misc/cascade_mask_rcnn_R_50_FPN_3x.yaml"))

cfg.DATASETS.TRAIN = ()

cfg.DATASETS.TRAIN = ("idd_train",)
#cfg.DATASETS.TEST = ()
cfg.DATASETS.TEST = ("idd_val",)
cfg.MODEL.WEIGHTS = "/idd_data_coco/model_final_480dd8.pkl" # Cascade
cfg.OUTPUT_DIR = '/idd_data_coco/models/'
cfg.MODEL.MASK_ON = False
cfg.DATALOADER.NUM_WORKERS = 2
cfg.SOLVER.MAX_ITER = 1900
cfg.SOLVER.CHECKPOINT_PERIOD = 120000
cfg.SOLVER.BASE_LR = 0.02  # pick a good LR
cfg.SOLVER.GAMMA = 0.3
cfg.SOLVER.STEPS = (15000, 20000,)
cfg.TEST.EVAL_PERIOD = 4510
cfg.SOLVER.IMS_PER_BATCH = 7
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.5
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE =  128 # default: 512
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 15
2. what exact command you run: python plain_train_net.py
3. what you observed (including the full logs):

[03/10 10:23:24] detectron2 INFO: Rank of current process: 0. World size: 2 [03/10 10:23:25] detectron2 INFO: Environment info:


sys.platform linux Python 3.7.6 (default, Jan 8 2020, 19:59:22) [GCC 7.3.0] numpy 1.18.1 detectron2 0.1.1 @/home/username/detectron2_v1/detectron2 detectron2 compiler GCC 5.5 detectron2 CUDA compiler 10.2 detectron2 arch flags sm_75 DETECTRON2_ENV_MODULE PyTorch 1.4.0+cu100 @/home/username/miniconda3/envs/env_det2/lib/python3.7/site-packages/torch PyTorch debug build False CUDA available True GPU 0,1 GeForce RTX 2080 Ti CUDA_HOME /usr/local/cuda NVCC Cuda compilation tools, release 10.2, V10.2.89 Pillow 6.2.2 torchvision 0.5.0+cu100 @/home/username/miniconda3/envs/env_det2/lib/python3.7/site-packages/torchvision torchvision arch flags sm_35, sm_50, sm_60, sm_70, sm_75 cv2 4.1.2


PyTorch built with:

[03/10 10:23:25] detectron2 INFO: Command line arguments: Namespace(config_file='', dist_url='tcp://127.0.0.1:51012', eval_only=False, machine_rank=0, num_gpus=2, num_machines=1, opts=[], resume=False) [03/10 10:23:25] detectron2 INFO: Running with full config: CUDNN_BENCHMARK: False DATALOADER: ASPECT_RATIO_GROUPING: True FILTER_EMPTY_ANNOTATIONS: True NUM_WORKERS: 2 REPEAT_THRESHOLD: 0.0 SAMPLER_TRAIN: TrainingSampler DATASETS: PRECOMPUTED_PROPOSAL_TOPK_TEST: 1000 PRECOMPUTED_PROPOSAL_TOPK_TRAIN: 2000 PROPOSAL_FILES_TEST: () PROPOSAL_FILES_TRAIN: () TEST: ('idd_val',) TRAIN: ('idd_train',) GLOBAL: HACK: 1.0 INPUT: CROP: ENABLED: False SIZE: [0.9, 0.9] TYPE: relative_range FORMAT: BGR MASK_FORMAT: polygon MAX_SIZE_TEST: 1333 MAX_SIZE_TRAIN: 1333 MIN_SIZE_TEST: 800 MIN_SIZE_TRAIN: (640, 672, 704, 736, 768, 800) MIN_SIZE_TRAIN_SAMPLING: choice MODEL: ANCHOR_GENERATOR: ANGLES: [[-90, 0, 90]] ASPECT_RATIOS: [[0.5, 1.0, 2.0]] NAME: DefaultAnchorGenerator OFFSET: 0.0 SIZES: [[32], [64], [128], [256], [512]] BACKBONE: FREEZE_AT: 2 NAME: build_resnet_fpn_backbone DEVICE: cuda FPN: FUSE_TYPE: sum IN_FEATURES: ['res2', 'res3', 'res4', 'res5'] NORM: OUT_CHANNELS: 256 KEYPOINT_ON: False LOAD_PROPOSALS: False MASK_ON: False META_ARCHITECTURE: GeneralizedRCNN PANOPTIC_FPN: COMBINE: ENABLED: True INSTANCES_CONFIDENCE_THRESH: 0.5 OVERLAP_THRESH: 0.5 STUFF_AREA_LIMIT: 4096 INSTANCE_LOSS_WEIGHT: 1.0 PIXEL_MEAN: [103.53, 116.28, 123.675] PIXEL_STD: [1.0, 1.0, 1.0] PROPOSAL_GENERATOR: MIN_SIZE: 0 NAME: RPN RESNETS: DEFORM_MODULATED: False DEFORM_NUM_GROUPS: 1 DEFORM_ON_PER_STAGE: [False, False, False, False] DEPTH: 50 NORM: FrozenBN NUM_GROUPS: 1 OUT_FEATURES: ['res2', 'res3', 'res4', 'res5'] RES2_OUT_CHANNELS: 256 RES5_DILATION: 1 STEM_OUT_CHANNELS: 64 STRIDE_IN_1X1: True WIDTH_PER_GROUP: 64 RETINANET: BBOX_REG_WEIGHTS: (1.0, 1.0, 1.0, 1.0) FOCAL_LOSS_ALPHA: 0.25 FOCAL_LOSS_GAMMA: 2.0 IN_FEATURES: ['p3', 'p4', 'p5', 'p6', 'p7'] IOU_LABELS: [0, -1, 1] IOU_THRESHOLDS: [0.4, 0.5] NMS_THRESH_TEST: 0.5 NUM_CLASSES: 80 NUM_CONVS: 4 PRIOR_PROB: 0.01 SCORE_THRESH_TEST: 0.05 SMOOTH_L1_LOSS_BETA: 0.1 TOPK_CANDIDATES_TEST: 1000 ROI_BOX_CASCADE_HEAD: BBOX_REG_WEIGHTS: ((10.0, 10.0, 5.0, 5.0), (20.0, 20.0, 10.0, 10.0), (30.0, 30.0, 15.0, 15.0)) IOUS: (0.5, 0.6, 0.7) ROI_BOX_HEAD: BBOX_REG_WEIGHTS: (10.0, 10.0, 5.0, 5.0) CLS_AGNOSTIC_BBOX_REG: True CONV_DIM: 256 FC_DIM: 1024 NAME: FastRCNNConvFCHead NORM: NUM_CONV: 0 NUM_FC: 2 POOLER_RESOLUTION: 7 POOLER_SAMPLING_RATIO: 0 POOLER_TYPE: ROIAlignV2 SMOOTH_L1_BETA: 0.0 TRAIN_ON_PRED_BOXES: False ROI_HEADS: BATCH_SIZE_PER_IMAGE: 128 IN_FEATURES: ['p2', 'p3', 'p4', 'p5'] IOU_LABELS: [0, 1] IOU_THRESHOLDS: [0.5] NAME: CascadeROIHeads NMS_THRESH_TEST: 0.5 NUM_CLASSES: 15 POSITIVE_FRACTION: 0.25 PROPOSAL_APPEND_GT: True SCORE_THRESH_TEST: 0.5 ROI_KEYPOINT_HEAD: CONV_DIMS: (512, 512, 512, 512, 512, 512, 512, 512) LOSS_WEIGHT: 1.0 MIN_KEYPOINTS_PER_IMAGE: 1 NAME: KRCNNConvDeconvUpsampleHead NORMALIZE_LOSS_BY_VISIBLE_KEYPOINTS: True NUM_KEYPOINTS: 17 POOLER_RESOLUTION: 14 POOLER_SAMPLING_RATIO: 0 POOLER_TYPE: ROIAlignV2 ROI_MASK_HEAD: CLS_AGNOSTIC_MASK: False CONV_DIM: 256 NAME: MaskRCNNConvUpsampleHead NORM: NUM_CONV: 4 POOLER_RESOLUTION: 14 POOLER_SAMPLING_RATIO: 0 POOLER_TYPE: ROIAlignV2 RPN: BATCH_SIZE_PER_IMAGE: 256 BBOX_REG_WEIGHTS: (1.0, 1.0, 1.0, 1.0) BOUNDARY_THRESH: -1 HEAD_NAME: StandardRPNHead IN_FEATURES: ['p2', 'p3', 'p4', 'p5', 'p6'] IOU_LABELS: [0, -1, 1] IOU_THRESHOLDS: [0.3, 0.7] LOSS_WEIGHT: 1.0 NMS_THRESH: 0.7 POSITIVE_FRACTION: 0.5 POST_NMS_TOPK_TEST: 1000 POST_NMS_TOPK_TRAIN: 2000 PRE_NMS_TOPK_TEST: 1000 PRE_NMS_TOPK_TRAIN: 2000 SMOOTH_L1_BETA: 0.0 SEM_SEG_HEAD: COMMON_STRIDE: 4 CONVS_DIM: 128 IGNORE_VALUE: 255 IN_FEATURES: ['p2', 'p3', 'p4', 'p5'] LOSS_WEIGHT: 1.0 NAME: SemSegFPNHead NORM: GN NUM_CLASSES: 54 WEIGHTS: /ssd_scratch/cvit/username/idd_data_coco/model_final_480dd8.pkl OUTPUT_DIR: /ssd_scratch/cvit/username/idd_data_coco/models/ SEED: -1 SOLVER: BASE_LR: 0.02 BIAS_LR_FACTOR: 1.0 CHECKPOINT_PERIOD: 500000 GAMMA: 0.3 IMS_PER_BATCH: 14 LR_SCHEDULER_NAME: WarmupMultiStepLR MAX_ITER: 120000 MOMENTUM: 0.9 STEPS: (80000, 100000) WARMUP_FACTOR: 0.001 WARMUP_ITERS: 1000 WARMUP_METHOD: linear WEIGHT_DECAY: 0.0001 WEIGHT_DECAY_BIAS: 0.0001 WEIGHT_DECAY_NORM: 0.0 TEST: AUG: ENABLED: False FLIP: True MAX_SIZE: 4000 MIN_SIZES: (400, 500, 600, 700, 800, 900, 1000, 1100, 1200) DETECTIONS_PER_IMAGE: 100 EVAL_PERIOD: 4510 EXPECTED_RESULTS: [] KEYPOINT_OKS_SIGMAS: [] PRECISE_BN: ENABLED: False NUM_ITER: 200 VERSION: 2 VIS_PERIOD: 0 [03/10 10:23:25] detectron2 INFO: Full config saved to /ssd_scratch/cvit/username/idd_data_coco/models/config.yaml [03/10 10:23:25] d2.utils.env INFO: Using a generated random seed 25458272 [03/10 10:23:26] detectron2 INFO: Model: GeneralizedRCNN( (backbone): FPN( (fpn_lateral2): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1)) (fpn_output2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (fpn_lateral3): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1)) (fpn_output3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (fpn_lateral4): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1)) (fpn_output4): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (fpn_lateral5): Conv2d(2048, 256, kernel_size=(1, 1), stride=(1, 1)) (fpn_output5): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (top_block): LastLevelMaxPool() (bottom_up): ResNet( (stem): BasicStem( (conv1): Conv2d( 3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) ) (res2): Sequential( (0): BottleneckBlock( (shortcut): Conv2d( 64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv1): Conv2d( 64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv2): Conv2d( 64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv3): Conv2d( 64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) ) (1): BottleneckBlock( (conv1): Conv2d( 256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv2): Conv2d( 64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv3): Conv2d( 64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) ) (2): BottleneckBlock( (conv1): Conv2d( 256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv2): Conv2d( 64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv3): Conv2d( 64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) ) ) (res3): Sequential( (0): BottleneckBlock( (shortcut): Conv2d( 256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv1): Conv2d( 256, 128, kernel_size=(1, 1), stride=(2, 2), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv2): Conv2d( 128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv3): Conv2d( 128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) ) (1): BottleneckBlock( (conv1): Conv2d( 512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv2): Conv2d( 128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv3): Conv2d( 128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) ) (2): BottleneckBlock( (conv1): Conv2d( 512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv2): Conv2d( 128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv3): Conv2d( 128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) ) (3): BottleneckBlock( (conv1): Conv2d( 512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv2): Conv2d( 128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv3): Conv2d( 128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) ) ) (res4): Sequential( (0): BottleneckBlock( (shortcut): Conv2d( 512, 1024, kernel_size=(1, 1), stride=(2, 2), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) (conv1): Conv2d( 512, 256, kernel_size=(1, 1), stride=(2, 2), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) (1): BottleneckBlock( (conv1): Conv2d( 1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) (2): BottleneckBlock( (conv1): Conv2d( 1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) (3): BottleneckBlock( (conv1): Conv2d( 1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) (4): BottleneckBlock( (conv1): Conv2d( 1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) (5): BottleneckBlock( (conv1): Conv2d( 1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) ) (res5): Sequential( (0): BottleneckBlock( (shortcut): Conv2d( 1024, 2048, kernel_size=(1, 1), stride=(2, 2), bias=False (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05) ) (conv1): Conv2d( 1024, 512, kernel_size=(1, 1), stride=(2, 2), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv2): Conv2d( 512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv3): Conv2d( 512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05) ) ) (1): BottleneckBlock( (conv1): Conv2d( 2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv2): Conv2d( 512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv3): Conv2d( 512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05) ) ) (2): BottleneckBlock( (conv1): Conv2d( 2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv2): Conv2d( 512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv3): Conv2d( 512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05) ) ) ) ) ) (proposal_generator): RPN( (anchor_generator): DefaultAnchorGenerator( (cell_anchors): BufferList() ) (rpn_head): StandardRPNHead( (conv): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (objectness_logits): Conv2d(256, 3, kernel_size=(1, 1), stride=(1, 1)) (anchor_deltas): Conv2d(256, 12, kernel_size=(1, 1), stride=(1, 1)) ) ) (roi_heads): CascadeROIHeads( (box_pooler): ROIPooler( (level_poolers): ModuleList( (0): ROIAlign(output_size=(7, 7), spatial_scale=0.25, sampling_ratio=0, aligned=True) (1): ROIAlign(output_size=(7, 7), spatial_scale=0.125, sampling_ratio=0, aligned=True) (2): ROIAlign(output_size=(7, 7), spatial_scale=0.0625, sampling_ratio=0, aligned=True) (3): ROIAlign(output_size=(7, 7), spatial_scale=0.03125, sampling_ratio=0, aligned=True) ) ) (box_head): ModuleList( (0): FastRCNNConvFCHead( (fc1): Linear(in_features=12544, out_features=1024, bias=True) (fc2): Linear(in_features=1024, out_features=1024, bias=True) ) (1): FastRCNNConvFCHead( (fc1): Linear(in_features=12544, out_features=1024, bias=True) (fc2): Linear(in_features=1024, out_features=1024, bias=True) ) (2): FastRCNNConvFCHead( (fc1): Linear(in_features=12544, out_features=1024, bias=True) (fc2): Linear(in_features=1024, out_features=1024, bias=True) ) ) (box_predictor): ModuleList( (0): FastRCNNOutputLayers( (cls_score): Linear(in_features=1024, out_features=16, bias=True) (bbox_pred): Linear(in_features=1024, out_features=4, bias=True) ) (1): FastRCNNOutputLayers( (cls_score): Linear(in_features=1024, out_features=16, bias=True) (bbox_pred): Linear(in_features=1024, out_features=4, bias=True) ) (2): FastRCNNOutputLayers( (cls_score): Linear(in_features=1024, out_features=16, bias=True) (bbox_pred): Linear(in_features=1024, out_features=4, bias=True) ) ) ) ) [03/10 10:23:26] fvcore.common.checkpoint INFO: Loading checkpoint from /ssd_scratch/cvit/username/idd_data_coco/model_final_480dd8.pkl [03/10 10:23:27] fvcore.common.checkpoint INFO: Reading a file from 'Detectron2 Model Zoo' [03/10 10:23:27] fvcore.common.checkpoint WARNING: 'roi_heads.box_predictor.0.cls_score.weight' has shape (81, 1024) in the checkpoint but (16, 1024) in the model! Skipped. [03/10 10:23:27] fvcore.common.checkpoint WARNING: 'roi_heads.box_predictor.0.cls_score.bias' has shape (81,) in the checkpoint but (16,) in the model! Skipped. [03/10 10:23:27] fvcore.common.checkpoint WARNING: 'roi_heads.box_predictor.1.cls_score.weight' has shape (81, 1024) in the checkpoint but (16, 1024) in the model! Skipped. [03/10 10:23:27] fvcore.common.checkpoint WARNING: 'roi_heads.box_predictor.1.cls_score.bias' has shape (81,) in the checkpoint but (16,) in the model! Skipped. [03/10 10:23:27] fvcore.common.checkpoint WARNING: 'roi_heads.box_predictor.2.cls_score.weight' has shape (81, 1024) in the checkpoint but (16, 1024) in the model! Skipped. [03/10 10:23:27] fvcore.common.checkpoint WARNING: 'roi_heads.box_predictor.2.cls_score.bias' has shape (81,) in the checkpoint but (16,) in the model! Skipped. [03/10 10:23:27] fvcore.common.checkpoint INFO: Some model parameters are not in the checkpoint: roi_heads.box_predictor.0.cls_score.{weight, bias} roi_heads.box_predictor.1.cls_score.{weight, bias} roi_heads.box_predictor.2.cls_score.{weight, bias} [03/10 10:23:27] fvcore.common.checkpoint INFO: The checkpoint contains parameters not used by the model: roi_heads.mask_head.mask_fcn1.{weight, bias} roi_heads.mask_head.mask_fcn2.{weight, bias} roi_heads.mask_head.mask_fcn3.{weight, bias} roi_heads.mask_head.mask_fcn4.{weight, bias} roi_heads.mask_head.deconv.{weight, bias} roi_heads.mask_head.predictor.{weight, bias} [03/10 10:23:30] d2.data.datasets.coco INFO: Loading /ssd_scratch/cvit/username/idd_data_coco/idd_train_annotation.json takes 2.07 seconds. [03/10 10:23:30] d2.data.datasets.coco WARNING: Category ids in annotations are not in [1, #categories]! We'll apply a mapping for you.

[03/10 10:23:30] d2.data.datasets.coco INFO: Loaded 31569 images in COCO format from /ssd_scratch/cvit/username/idd_data_coco/idd_train_annotation.json [03/10 10:23:31] d2.data.build INFO: Removed 0 images with no usable annotations. 31569 images left. [03/10 10:23:32] d2.data.build INFO: Distribution of instances among all 15 categories:  category #instances category #instances category #instances
car 65676 bus 13829 autorickshaw 24498
vehicle fal.. 14992 truck 20759 motorcycle 78119
rider 73108 person 70319 bicycle 2573
animal 4764 traffic sign 9916 train 47
trailer 11 traffic light 2780 caravan 125
total 381516 

[03/10 10:23:32] d2.data.common INFO: Serializing 31569 elements to byte tensors and concatenating them all ... [03/10 10:23:33] d2.data.common INFO: Serialized dataset takes 18.66 MiB [03/10 10:23:33] d2.data.detection_utils INFO: TransformGens used in training: [ResizeShortestEdge(short_edge_length=(640, 672, 704, 736, 768, 800), max_size=1333, sample_style='choice'), RandomFlip()] [03/10 10:23:33] d2.data.build INFO: Using training sampler TrainingSampler [03/10 10:23:33] detectron2 INFO: Starting training from iteration 0 [03/10 11:34:35] d2.data.datasets.coco WARNING: Category ids in annotations are not in [1, #categories]! We'll apply a mapping for you.

[03/10 11:34:35] d2.data.datasets.coco INFO: Loaded 10225 images in COCO format from /ssd_scratch/cvit/username/idd_data_coco/idd_val_annotation.json [03/10 11:34:36] d2.data.build INFO: Distribution of instances among all 15 categories:  category #instances category #instances category #instances
truck 7078 person 18078 motorcycle 25489
bus 4916 autorickshaw 7782 rider 24518
car 24844 vehicle fal.. 6089 traffic sign 4287
bicycle 569 animal 1460 traffic light 919
trailer 7 caravan 11 train 13
total 126060 

[03/10 11:34:36] d2.data.common INFO: Serializing 10225 elements to byte tensors and concatenating them all ... [03/10 11:34:36] d2.data.common INFO: Serialized dataset takes 6.09 MiB [03/10 11:34:37] d2.evaluation.evaluator INFO: Start inference on 5113 images [03/10 11:34:41] d2.evaluation.evaluator INFO: Inference done 11/5113. 0.0850 s / img. ETA=0:07:22 [03/10 11:34:46] d2.evaluation.evaluator INFO: Inference done 70/5113. 0.0839 s / img. ETA=0:07:13 [03/10 11:34:51] d2.evaluation.evaluator INFO: Inference done 129/5113. 0.0837 s / img. ETA=0:07:07 [03/10 11:34:56] d2.evaluation.evaluator INFO: Inference done 188/5113. 0.0837 s / img. ETA=0:07:02 [03/10 11:35:01] d2.evaluation.evaluator INFO: Inference done 247/5113. 0.0837 s / img. ETA=0:06:57 [03/10 11:35:06] d2.evaluation.evaluator INFO: Inference done 306/5113. 0.0837 s / img. ETA=0:06:52 [03/10 11:35:11] d2.evaluation.evaluator INFO: Inference done 365/5113. 0.0837 s / img. ETA=0:06:47 [03/10 11:35:16] d2.evaluation.evaluator INFO: Inference done 424/5113. 0.0837 s / img. ETA=0:06:42 [03/10 11:35:21] d2.evaluation.evaluator INFO: Inference done 483/5113. 0.0838 s / img. ETA=0:06:37 [03/10 11:35:26] d2.evaluation.evaluator INFO: Inference done 542/5113. 0.0837 s / img. ETA=0:06:32 [03/10 11:35:32] d2.evaluation.evaluator INFO: Inference done 601/5113. 0.0838 s / img. ETA=0:06:27 [03/10 11:35:37] d2.evaluation.evaluator INFO: Inference done 660/5113. 0.0838 s / img. ETA=0:06:22 [03/10 11:35:42] d2.evaluation.evaluator INFO: Inference done 718/5113. 0.0838 s / img. ETA=0:06:17 [03/10 11:35:47] d2.evaluation.evaluator INFO: Inference done 777/5113. 0.0838 s / img. ETA=0:06:12 [03/10 11:35:52] d2.evaluation.evaluator INFO: Inference done 836/5113. 0.0838 s / img. ETA=0:06:07 [03/10 11:35:57] d2.evaluation.evaluator INFO: Inference done 895/5113. 0.0838 s / img. ETA=0:06:02 [03/10 11:36:02] d2.evaluation.evaluator INFO: Inference done 954/5113. 0.0838 s / img. ETA=0:05:57 [03/10 11:36:07] d2.evaluation.evaluator INFO: Inference done 1012/5113. 0.0839 s / img. ETA=0:05:52 [03/10 11:36:12] d2.evaluation.evaluator INFO: Inference done 1070/5113. 0.0839 s / img. ETA=0:05:47 [03/10 11:36:17] d2.evaluation.evaluator INFO: Inference done 1127/5113. 0.0840 s / img. ETA=0:05:43 [03/10 11:36:22] d2.evaluation.evaluator INFO: Inference done 1183/5113. 0.0842 s / img. ETA=0:05:39 [03/10 11:36:27] d2.evaluation.evaluator INFO: Inference done 1239/5113. 0.0843 s / img. ETA=0:05:34 [03/10 11:36:32] d2.evaluation.evaluator INFO: Inference done 1296/5113. 0.0844 s / img. ETA=0:05:30 [03/10 11:36:37] d2.evaluation.evaluator INFO: Inference done 1353/5113. 0.0845 s / img. ETA=0:05:25 [03/10 11:36:42] d2.evaluation.evaluator INFO: Inference done 1410/5113. 0.0846 s / img. ETA=0:05:21 [03/10 11:36:47] d2.evaluation.evaluator INFO: Inference done 1467/5113. 0.0846 s / img. ETA=0:05:16 [03/10 11:36:52] d2.evaluation.evaluator INFO: Inference done 1525/5113. 0.0846 s / img. ETA=0:05:11 [03/10 11:36:57] d2.evaluation.evaluator INFO: Inference done 1582/5113. 0.0847 s / img. ETA=0:05:06 [03/10 11:37:02] d2.evaluation.evaluator INFO: Inference done 1639/5113. 0.0848 s / img. ETA=0:05:01 [03/10 11:37:07] d2.evaluation.evaluator INFO: Inference done 1696/5113. 0.0848 s / img. ETA=0:04:57 [03/10 11:37:12] d2.evaluation.evaluator INFO: Inference done 1753/5113. 0.0848 s / img. ETA=0:04:52 [03/10 11:37:18] d2.evaluation.evaluator INFO: Inference done 1810/5113. 0.0849 s / img. ETA=0:04:47 [03/10 11:37:23] d2.evaluation.evaluator INFO: Inference done 1866/5113. 0.0850 s / img. ETA=0:04:42 [03/10 11:37:28] d2.evaluation.evaluator INFO: Inference done 1922/5113. 0.0850 s / img. ETA=0:04:38 [03/10 11:37:33] d2.evaluation.evaluator INFO: Inference done 1978/5113. 0.0851 s / img. ETA=0:04:33 [03/10 11:37:38] d2.evaluation.evaluator INFO: Inference done 2035/5113. 0.0852 s / img. ETA=0:04:28 [03/10 11:37:43] d2.evaluation.evaluator INFO: Inference done 2092/5113. 0.0852 s / img. ETA=0:04:24 [03/10 11:37:48] d2.evaluation.evaluator INFO: Inference done 2148/5113. 0.0853 s / img. ETA=0:04:19 [03/10 11:37:53] d2.evaluation.evaluator INFO: Inference done 2204/5113. 0.0854 s / img. ETA=0:04:14 [03/10 11:37:58] d2.evaluation.evaluator INFO: Inference done 2260/5113. 0.0854 s / img. ETA=0:04:09 [03/10 11:38:03] d2.evaluation.evaluator INFO: Inference done 2316/5113. 0.0855 s / img. ETA=0:04:05 [03/10 11:38:08] d2.evaluation.evaluator INFO: Inference done 2372/5113. 0.0855 s / img. ETA=0:04:00 [03/10 11:38:13] d2.evaluation.evaluator INFO: Inference done 2428/5113. 0.0856 s / img. ETA=0:03:55 [03/10 11:38:18] d2.evaluation.evaluator INFO: Inference done 2484/5113. 0.0857 s / img. ETA=0:03:50 [03/10 11:38:23] d2.evaluation.evaluator INFO: Inference done 2540/5113. 0.0857 s / img. ETA=0:03:46 [03/10 11:38:28] d2.evaluation.evaluator INFO: Inference done 2596/5113. 0.0857 s / img. ETA=0:03:41 [03/10 11:38:33] d2.evaluation.evaluator INFO: Inference done 2652/5113. 0.0858 s / img. ETA=0:03:36 [03/10 11:38:38] d2.evaluation.evaluator INFO: Inference done 2708/5113. 0.0858 s / img. ETA=0:03:31 [03/10 11:38:43] d2.evaluation.evaluator INFO: Inference done 2764/5113. 0.0858 s / img. ETA=0:03:26 [03/10 11:38:48] d2.evaluation.evaluator INFO: Inference done 2820/5113. 0.0859 s / img. ETA=0:03:21 [03/10 11:38:53] d2.evaluation.evaluator INFO: Inference done 2876/5113. 0.0859 s / img. ETA=0:03:17 [03/10 11:38:58] d2.evaluation.evaluator INFO: Inference done 2932/5113. 0.0859 s / img. ETA=0:03:12 [03/10 11:39:03] d2.evaluation.evaluator INFO: Inference done 2988/5113. 0.0860 s / img. ETA=0:03:07 [03/10 11:39:08] d2.evaluation.evaluator INFO: Inference done 3044/5113. 0.0860 s / img. ETA=0:03:02 [03/10 11:39:13] d2.evaluation.evaluator INFO: Inference done 3100/5113. 0.0860 s / img. ETA=0:02:57 [03/10 11:39:18] d2.evaluation.evaluator INFO: Inference done 3156/5113. 0.0861 s / img. ETA=0:02:52 [03/10 11:39:23] d2.evaluation.evaluator INFO: Inference done 3212/5113. 0.0861 s / img. ETA=0:02:47 [03/10 11:39:29] d2.evaluation.evaluator INFO: Inference done 3268/5113. 0.0861 s / img. ETA=0:02:42 [03/10 11:39:34] d2.evaluation.evaluator INFO: Inference done 3324/5113. 0.0862 s / img. ETA=0:02:38 [03/10 11:39:39] d2.evaluation.evaluator INFO: Inference done 3380/5113. 0.0862 s / img. ETA=0:02:33 [03/10 11:39:44] d2.evaluation.evaluator INFO: Inference done 3436/5113. 0.0862 s / img. ETA=0:02:28 [03/10 11:39:49] d2.evaluation.evaluator INFO: Inference done 3492/5113. 0.0862 s / img. ETA=0:02:23 [03/10 11:39:54] d2.evaluation.evaluator INFO: Inference done 3548/5113. 0.0863 s / img. ETA=0:02:18 [03/10 11:39:59] d2.evaluation.evaluator INFO: Inference done 3605/5113. 0.0863 s / img. ETA=0:02:13 [03/10 11:40:04] d2.evaluation.evaluator INFO: Inference done 3660/5113. 0.0863 s / img. ETA=0:02:08 [03/10 11:40:09] d2.evaluation.evaluator INFO: Inference done 3715/5113. 0.0864 s / img. ETA=0:02:03 [03/10 11:40:14] d2.evaluation.evaluator INFO: Inference done 3771/5113. 0.0864 s / img. ETA=0:01:58 [03/10 11:40:19] d2.evaluation.evaluator INFO: Inference done 3827/5113. 0.0864 s / img. ETA=0:01:53 [03/10 11:40:24] d2.evaluation.evaluator INFO: Inference done 3881/5113. 0.0865 s / img. ETA=0:01:49 [03/10 11:40:29] d2.evaluation.evaluator INFO: Inference done 3936/5113. 0.0865 s / img. ETA=0:01:44 [03/10 11:40:34] d2.evaluation.evaluator INFO: Inference done 3992/5113. 0.0865 s / img. ETA=0:01:39 [03/10 11:40:39] d2.evaluation.evaluator INFO: Inference done 4048/5113. 0.0865 s / img. ETA=0:01:34 [03/10 11:40:44] d2.evaluation.evaluator INFO: Inference done 4102/5113. 0.0866 s / img. ETA=0:01:29 [03/10 11:40:49] d2.evaluation.evaluator INFO: Inference done 4157/5113. 0.0866 s / img. ETA=0:01:24 [03/10 11:40:54] d2.evaluation.evaluator INFO: Inference done 4213/5113. 0.0866 s / img. ETA=0:01:19 [03/10 11:40:59] d2.evaluation.evaluator INFO: Inference done 4269/5113. 0.0867 s / img. ETA=0:01:14 [03/10 11:41:04] d2.evaluation.evaluator INFO: Inference done 4325/5113. 0.0867 s / img. ETA=0:01:10 [03/10 11:41:09] d2.evaluation.evaluator INFO: Inference done 4381/5113. 0.0867 s / img. ETA=0:01:05 [03/10 11:41:14] d2.evaluation.evaluator INFO: Inference done 4437/5113. 0.0867 s / img. ETA=0:01:00 [03/10 11:41:19] d2.evaluation.evaluator INFO: Inference done 4493/5113. 0.0867 s / img. ETA=0:00:55 [03/10 11:41:25] d2.evaluation.evaluator INFO: Inference done 4549/5113. 0.0867 s / img. ETA=0:00:50 [03/10 11:41:30] d2.evaluation.evaluator INFO: Inference done 4605/5113. 0.0868 s / img. ETA=0:00:45 [03/10 11:41:35] d2.evaluation.evaluator INFO: Inference done 4660/5113. 0.0868 s / img. ETA=0:00:40 [03/10 11:41:40] d2.evaluation.evaluator INFO: Inference done 4716/5113. 0.0868 s / img. ETA=0:00:35 [03/10 11:41:45] d2.evaluation.evaluator INFO: Inference done 4772/5113. 0.0868 s / img. ETA=0:00:30 [03/10 11:41:50] d2.evaluation.evaluator INFO: Inference done 4828/5113. 0.0868 s / img. ETA=0:00:25 [03/10 11:41:55] d2.evaluation.evaluator INFO: Inference done 4883/5113. 0.0869 s / img. ETA=0:00:20 [03/10 11:42:00] d2.evaluation.evaluator INFO: Inference done 4939/5113. 0.0869 s / img. ETA=0:00:15 [03/10 11:42:05] d2.evaluation.evaluator INFO: Inference done 4995/5113. 0.0869 s / img. ETA=0:00:10 [03/10 11:42:10] d2.evaluation.evaluator INFO: Inference done 5051/5113. 0.0869 s / img. ETA=0:00:05 [03/10 11:42:15] d2.evaluation.evaluator INFO: Inference done 5107/5113. 0.0869 s / img. ETA=0:00:00


4. please also simplify the steps as much as possible so they do not require additional resources to
     run, such as a private dataset.

Running plain_train_net.py generates two log files, 
[log.txt](https://github.com/facebookresearch/detectron2/files/4311199/log.txt) and log.txt.rank1 (exactly similar to log.txt). Both are exactly similar. However, none of them contain COCO-format based evaluation results such as AP scores for each category and IoUs. To replicate, one can replicate with COCO-format dataset containing more than 1 category for evaluation.

## Expected behavior:
Get COCO-format based evaluation results such as AP scores for each category and IoUs during multi-gpu training.

## Environment:

Run `python -m detectron2.utils.collect_env` in the environment where you observerd the issue, and paste the output.
ppwwyyxx commented 4 years ago

Could you share full logs? The log you provide does not seem to be complete.

ghost commented 4 years ago

Those are actually the full logs. Single GPU evaluation printed the training loss and mAP scores in its log file after the last line as seen in multi-gpu log.txt (above)

andriilitvynchuk commented 4 years ago

I found the next solution to this problem : in file detecton2/utils/events.py find line 307 and replace function latest_with_smoothing_hint on this one:

def latest_with_smoothing_hint(self, window_size=20):
        """
        Similar to :meth:`latest`, but the returned values
        are either the un-smoothed original latest value,
        or a median of the given window_size,
        depend on whether the smoothing_hint is True.
        This provides a default behavior that other writers can use.
        """
        result = {}
        # for k, v in self._latest_scalars.items():
        #     result[k] = self._history[k].median(window_size) if self._smoothing_hints[k] else v
        for k, v in self._history.items():
            result[k] = self._history[k].median(window_size) if self._smoothing_hints[k] else v.latest()
        return result

The problem is that evaluation metrics cannot be accessed via self._latest_scalars.items()

ppwwyyxx commented 4 years ago

[03/10 10:23:33] detectron2 INFO: Starting training from iteration 0 [03/10 11:34:35] d2.data.datasets.coco WARNING:

Are you saying that there is one hour of nothing printed on the screen? I don't think that's what the script would do unless you made other modifications.

ghost commented 4 years ago

Sorry, didn't know you were looking for that one. In this line plain_train_net.py#L179, I am printing losses every 4510 iterations (1 epoch) instead of 20 iterations. Hence, no values were being printed. Also, this is the log-singlegpu.txt file where I am getting the training loss at 4510 iteration and evaluation results post that.

ppwwyyxx commented 4 years ago

I cannot effectively investigate the issue since you seems to have written many of your own code and use your own dataset, both of which I don't have access to.

You can verify that

python tools/plain_train_net.py --num-gpus 2 --config-file configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_1x.yaml SOLVER.IMS_PER_BATCH 2 DATALOADER.NUM_WORKERS 2 SOLVER.MAX_ITER 1000 TEST.EVAL_PERIOD 100

with no modifications to code, does run the evaluation properly.

ghost commented 4 years ago

@ppwwyyxx, I cloned the repo at this tree. After that, I ran balloon_train_net_experiment.py with the balloon dataset. Although I get the evaluation results, the code breaks due to empty results dictionary from do_test(cfg, model) method. Please look for ####### Changes ###### tag indicating parts of the code where I have made changes.

Instructions To Reproduce the Issue:

  1. what changes you made (git diff) or what code you wrote
    
    Run balloon_train_net_experiment.py (below) on the above repo using balloon dataset.

Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved

""" Detectron2 training script with a plain training loop.

This scripts reads a given config file and runs the training or evaluation. It is an entry point that is able to train standard models in detectron2.

In order to let one script support training of many models, this script contains logic that are specific to these built-in models and therefore may not be suitable for your own project. For example, your research project perhaps only needs a single "evaluator".

Therefore, we recommend you to use detectron2 as an library and take this file as an example of how to use the library. You may want to write your own script with your datasets and other customizations.

Compared to "train_net.py", this script supports fewer default features. It also includes fewer abstraction, therefore is easier to add custom logic. """

You may need to restart your runtime prior to this, to let your installation take effect

Some basic setup

Setup detectron2 logger

import detectron2

import some common libraries

import numpy as np import cv2 import random import json from detectron2.structures import BoxMode

from google.colab.patches import cv2_imshow

import logging import os from collections import OrderedDict import torch from torch.nn.parallel import DistributedDataParallel

from detectron2 import model_zoo from detectron2.data.datasets import register_coco_instances from detectron2.data import MetadataCatalog, DatasetCatalog import detectron2.utils.comm as comm from detectron2.checkpoint import DetectionCheckpointer, PeriodicCheckpointer from detectron2.config import get_cfg from detectron2.data import ( MetadataCatalog, build_detection_test_loader, build_detection_train_loader, ) from detectron2.engine import default_argument_parser, default_setup, launch from detectron2.evaluation import ( CityscapesEvaluator, COCOEvaluator, COCOPanopticEvaluator, DatasetEvaluators, LVISEvaluator, PascalVOCDetectionEvaluator, SemSegEvaluator, inference_on_dataset, print_csv_format, ) from detectron2.modeling import build_model from detectron2.solver import build_lr_scheduler, build_optimizer from detectron2.utils.events import ( CommonMetricPrinter, EventStorage, JSONWriter, TensorboardXWriter, )

logger = logging.getLogger("detectron2")

def get_evaluator(cfg, dataset_name, output_folder=None): """ Create evaluator(s) for a given dataset. This uses the special metadata "evaluator_type" associated with each builtin dataset. For your own dataset, you can simply create an evaluator manually in your script and do not have to worry about the hacky if-else logic here. """ if output_folder is None: output_folder = os.path.join(cfg.OUTPUT_DIR, "inference") evaluator_list = [] evaluator_type = MetadataCatalog.get(dataset_name).evaluator_type if evaluator_type in ["sem_seg", "coco_panoptic_seg"]: evaluator_list.append( SemSegEvaluator( dataset_name, distributed=True, num_classes=cfg.MODEL.SEM_SEG_HEAD.NUM_CLASSES, ignore_label=cfg.MODEL.SEM_SEG_HEAD.IGNORE_VALUE, output_dir=output_folder, ) ) if evaluator_type in ["coco", "coco_panoptic_seg"]: evaluator_list.append(COCOEvaluator(dataset_name, cfg, True, output_folder)) if evaluator_type == "coco_panoptic_seg": evaluator_list.append(COCOPanopticEvaluator(dataset_name, output_folder)) if evaluator_type == "cityscapes": assert ( torch.cuda.device_count() >= comm.get_rank() ), "CityscapesEvaluator currently do not work with multiple machines." return CityscapesEvaluator(dataset_name) if evaluator_type == "pascal_voc": return PascalVOCDetectionEvaluator(dataset_name) if evaluator_type == "lvis": return LVISEvaluator(dataset_name, cfg, True, output_folder) if len(evaluator_list) == 0: raise NotImplementedError( "no Evaluator for the dataset {} with the type {}".format(dataset_name, evaluator_type) ) if len(evaluator_list) == 1: return evaluator_list[0] return DatasetEvaluators(evaluator_list)

def do_test(cfg, model): results = OrderedDict() for dataset_name in cfg.DATASETS.TEST: data_loader = build_detection_test_loader(cfg, dataset_name) evaluator = get_evaluator( cfg, dataset_name, os.path.join(cfg.OUTPUT_DIR, "inference", dataset_name) ) results_i = inference_on_dataset(model, data_loader, evaluator) results[dataset_name] = results_i

    ####### Changes (Print statements to debug) ########
    print("Before ", comm.get_rank())

    if comm.is_main_process():
        print("First ", comm.get_rank())
        logger.info("Evaluation results for {} in csv format:".format(dataset_name))
        print_csv_format(results_i)

print("Second ", comm.get_rank())

if len(results) == 1:
    results = list(results.values())[0]
if results == {} or results_i == {}:
   print("Third ", comm.get_rank())
return results

def do_train(cfg, model, resume=False):

####### Changes ########
default_val_AP = 0
default_val_AP50 = 0
default_val_AP75 = 0

best_model_dict = {}

model.train()
optimizer = build_optimizer(cfg, model)
scheduler = build_lr_scheduler(cfg, optimizer)

checkpointer = DetectionCheckpointer(
    model, cfg.OUTPUT_DIR, optimizer=optimizer, scheduler=scheduler
)
start_iter = (
    checkpointer.resume_or_load(cfg.MODEL.WEIGHTS, resume=resume).get("iteration", -1) + 1
)
max_iter = cfg.SOLVER.MAX_ITER

periodic_checkpointer = PeriodicCheckpointer(
    checkpointer, cfg.SOLVER.CHECKPOINT_PERIOD, max_iter=max_iter
)

####### Changes ########
writers = (
    [
        JSONWriter(os.path.join(cfg.OUTPUT_DIR, "metrics.json")),
        TensorboardXWriter(cfg.OUTPUT_DIR),
    ]
    if comm.is_main_process()
    else []
)
####### Changes ########
terminal_writer = ( [CommonMetricPrinter(max_iter)]
                    if comm.is_main_process()
                    else [] )
# compared to "train_net.py", we do not support accurate timing and
# precise BN here, because they are not trivial to implement
data_loader = build_detection_train_loader(cfg)
logger.info("Starting training from iteration {}".format(start_iter))
with EventStorage(start_iter) as storage:
    for data, iteration in zip(data_loader, range(start_iter, max_iter)):

        iteration = iteration + 1
        storage.step()

        loss_dict = model(data)
        losses = sum(loss for loss in loss_dict.values())
        assert torch.isfinite(losses).all(), loss_dict

        loss_dict_reduced = {k: v.item() for k, v in comm.reduce_dict(loss_dict).items()}
        losses_reduced = sum(loss for loss in loss_dict_reduced.values())
        if comm.is_main_process():
            storage.put_scalars(total_loss=losses_reduced, **loss_dict_reduced)

        optimizer.zero_grad()
        losses.backward()
        optimizer.step()
        storage.put_scalar("lr", optimizer.param_groups[0]["lr"], smoothing_hint=False)
        scheduler.step()

        if (
            cfg.TEST.EVAL_PERIOD > 0
            and iteration % cfg.TEST.EVAL_PERIOD == 0
            and iteration != max_iter
        ):
            val_dict = do_test(cfg, model)

            ####### Changes (save best model) ########

            print("Val dict value  ", val_dict)

            if (val_dict['bbox']['AP'] > default_val_AP and val_dict['bbox']['AP50'] > default_val_AP50 and val_dict['bbox']['AP75'] > default_val_AP75):

                default_val_AP = val_dict['bbox']['AP']
                default_val_AP50 = val_dict['bbox']['AP50']
                default_val_AP75 = val_dict['bbox']['AP75']

                best_model_dict = {}

                best_model_dict["model"] = model.state_dict()
                best_model_dict["optimizer"] = optimizer.state_dict()
                best_model_dict["scheduler"] = scheduler.state_dict()
                torch.save(best_model_dict, cfg.OUTPUT_DIR+'model_dict'+str(iteration)+'.pth') 

            # Compared to "train_net.py", the test results are not dumped to EventStorage
            comm.synchronize()

        ####### Changes (log values to tensorboard and json every 400th iteration)  ########
        if iteration - start_iter > 5 and (iteration % 400 == 0 or iteration == max_iter): 
            for writer in writers:
                writer.write()
        ####### Changes (print on terminal every 20th iteration)  ########
        if iteration - start_iter > 5 and (iteration % 20 == 0 or iteration == max_iter):
            for wrt in terminal_writer:
                wrt.write()
        periodic_checkpointer.step(iteration)

def get_balloon_dicts(img_dir): json_file = os.path.join(img_dir, "via_region_data.json") with open(json_file) as f: imgs_anns = json.load(f)

dataset_dicts = []
for idx, v in enumerate(imgs_anns.values()):
    record = {}

    filename = os.path.join(img_dir, v["filename"])
    height, width = cv2.imread(filename).shape[:2]

    record["file_name"] = filename
    record["image_id"] = idx
    record["height"] = height
    record["width"] = width

    annos = v["regions"]
    objs = []
    for _, anno in annos.items():
        assert not anno["region_attributes"]
        anno = anno["shape_attributes"]
        px = anno["all_points_x"]
        py = anno["all_points_y"]
        poly = [(x + 0.5, y + 0.5) for x, y in zip(px, py)]
        poly = [p for x in poly for p in x]

        obj = {
            "bbox": [np.min(px), np.min(py), np.max(px), np.max(py)],
            "bbox_mode": BoxMode.XYXY_ABS,
            "segmentation": [poly],
            "category_id": 0,
            "iscrowd": 0
        }
        objs.append(obj)
    record["annotations"] = objs
    dataset_dicts.append(record)
return dataset_dicts

def setup(args): """ Create configs and perform basic setups. """

for d in ["train", "val"]:
    DatasetCatalog.register("balloon_" + d, lambda d=d: get_balloon_dicts("balloon/" + d))
    MetadataCatalog.get("balloon_" + d).set(thing_classes=["balloon"])

cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file("Misc/cascade_mask_rcnn_R_50_FPN_3x.yaml"))
cfg.DATASETS.TRAIN = ("balloon_train",)
cfg.MODEL.MASK_ON = False
cfg.MODEL.ROI_HEADS.SCORE_THRESH_TEST = 0.7   # set the testing threshold for this model
cfg.DATASETS.TEST = ("balloon_val",)
MetadataCatalog.get("balloon_val").evaluator_type = "coco"
#cfg.DATASETS.TEST = ()
cfg.DATALOADER.NUM_WORKERS = 1
cfg.OUTPUT_DIR = '/balloon/models/'
cfg.MODEL.WEIGHTS = "/balloon/model_final_480dd8.pkl"
cfg.SOLVER.IMS_PER_BATCH = 14
cfg.SOLVER.BASE_LR = 0.00025  # pick a good LR
cfg.SOLVER.MAX_ITER = 1200
cfg.TEST.EVAL_PERIOD = 400
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 128   # faster, and good enough for this toy dataset (default: 512)
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 1  # only has one class (ballon)

os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)

cfg.merge_from_list(args.opts)
cfg.freeze()
default_setup(
    cfg, args
)  # if you don't like any of the default setup, write your own setup code

return cfg

def main(args):

cfg = setup(args)

model = build_model(cfg)
logger.info("Model:\n{}".format(model))
if args.eval_only:
    DetectionCheckpointer(model, save_dir=cfg.OUTPUT_DIR).resume_or_load(
        cfg.MODEL.WEIGHTS, resume=args.resume
    )
    return do_test(cfg, model)

distributed = comm.get_world_size() > 1
if distributed:
    model = DistributedDataParallel(
        model, device_ids=[comm.get_local_rank()], broadcast_buffers=False
    )

do_train(cfg, model)
print("Done")

if name == "main": args = default_argument_parser().parse_args() print("Command Line Args:", args) launch( main, args.num_gpus, num_machines=args.num_machines, machine_rank=args.machine_rank, dist_url=args.dist_url, args=(args,), )

2. what exact command you run: python \path\balloon_train_net_experiment.py --num-gpus=2 --dist-url="auto"
3. what you observed (including __full logs__):

========================================== SLURM_JOB_ID = 110821 SLURM_NODELIST = gnode03 SLURM_JOB_GPUS = 2,3

Obtaining file:///home/username/detectron2_repo_trial Requirement already satisfied: termcolor>=1.1 in ./miniconda3/envs/det_trial/lib/python3.7/site-packages (from detectron2==0.1.1) (1.1.0) Requirement already satisfied: Pillow in ./miniconda3/envs/det_trial/lib/python3.7/site-packages (from detectron2==0.1.1) (6.2.2) Requirement already satisfied: yacs>=0.1.6 in ./miniconda3/envs/det_trial/lib/python3.7/site-packages (from detectron2==0.1.1) (0.1.6) Requirement already satisfied: tabulate in ./miniconda3/envs/det_trial/lib/python3.7/site-packages (from detectron2==0.1.1) (0.8.6) Requirement already satisfied: cloudpickle in ./miniconda3/envs/det_trial/lib/python3.7/site-packages (from detectron2==0.1.1) (1.2.2) Requirement already satisfied: matplotlib in ./miniconda3/envs/det_trial/lib/python3.7/site-packages/matplotlib-3.2.0rc1-py3.7-linux-x86_64.egg (from detectron2==0.1.1) (3.2.0rc1) Requirement already satisfied: tqdm>4.29.0 in ./miniconda3/envs/det_trial/lib/python3.7/site-packages (from detectron2==0.1.1) (4.41.1) Requirement already satisfied: tensorboard in ./miniconda3/envs/det_trial/lib/python3.7/site-packages (from detectron2==0.1.1) (2.1.0) Requirement already satisfied: fvcore in ./miniconda3/envs/det_trial/lib/python3.7/site-packages (from detectron2==0.1.1) (0.1.dev200114) Requirement already satisfied: future in ./miniconda3/envs/det_trial/lib/python3.7/site-packages (from detectron2==0.1.1) (0.18.2) Requirement already satisfied: pydot in ./miniconda3/envs/det_trial/lib/python3.7/site-packages (from detectron2==0.1.1) (1.4.1) Requirement already satisfied: PyYAML in ./miniconda3/envs/det_trial/lib/python3.7/site-packages (from yacs>=0.1.6->detectron2==0.1.1) (5.1) Requirement already satisfied: cycler>=0.10 in ./miniconda3/envs/det_trial/lib/python3.7/site-packages/cycler-0.10.0-py3.7.egg (from matplotlib->detectron2==0.1.1) (0.10.0) Requirement already satisfied: kiwisolver>=1.0.1 in ./miniconda3/envs/det_trial/lib/python3.7/site-packages/kiwisolver-1.1.0-py3.7-linux-x86_64.egg (from matplotlib->detectron2==0.1.1) (1.1.0) Requirement already satisfied: numpy>=1.11 in ./miniconda3/envs/det_trial/lib/python3.7/site-packages (from matplotlib->detectron2==0.1.1) (1.18.1) Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in ./miniconda3/envs/det_trial/lib/python3.7/site-packages/pyparsing-2.4.6-py3.7.egg (from matplotlib->detectron2==0.1.1) (2.4.6) Requirement already satisfied: python-dateutil>=2.1 in ./miniconda3/envs/det_trial/lib/python3.7/site-packages/python_dateutil-2.8.1-py3.7.egg (from matplotlib->detectron2==0.1.1) (2.8.1) Requirement already satisfied: setuptools>=41.0.0 in ./miniconda3/envs/det_trial/lib/python3.7/site-packages (from tensorboard->detectron2==0.1.1) (44.0.0.post20200106) Requirement already satisfied: markdown>=2.6.8 in ./miniconda3/envs/det_trial/lib/python3.7/site-packages (from tensorboard->detectron2==0.1.1) (3.1.1) Requirement already satisfied: requests<3,>=2.21.0 in ./miniconda3/envs/det_trial/lib/python3.7/site-packages (from tensorboard->detectron2==0.1.1) (2.22.0) Requirement already satisfied: wheel>=0.26; python_version >= "3" in ./miniconda3/envs/det_trial/lib/python3.7/site-packages (from tensorboard->detectron2==0.1.1) (0.33.6) Requirement already satisfied: google-auth<2,>=1.6.3 in ./miniconda3/envs/det_trial/lib/python3.7/site-packages (from tensorboard->detectron2==0.1.1) (1.11.0) Requirement already satisfied: absl-py>=0.4 in ./miniconda3/envs/det_trial/lib/python3.7/site-packages (from tensorboard->detectron2==0.1.1) (0.9.0) Requirement already satisfied: grpcio>=1.24.3 in ./miniconda3/envs/det_trial/lib/python3.7/site-packages (from tensorboard->detectron2==0.1.1) (1.26.0) Requirement already satisfied: werkzeug>=0.11.15 in ./miniconda3/envs/det_trial/lib/python3.7/site-packages (from tensorboard->detectron2==0.1.1) (0.16.0) Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in ./miniconda3/envs/det_trial/lib/python3.7/site-packages (from tensorboard->detectron2==0.1.1) (0.4.1) Requirement already satisfied: protobuf>=3.6.0 in ./miniconda3/envs/det_trial/lib/python3.7/site-packages (from tensorboard->detectron2==0.1.1) (3.11.2) Requirement already satisfied: six>=1.10.0 in ./miniconda3/envs/det_trial/lib/python3.7/site-packages (from tensorboard->detectron2==0.1.1) (1.14.0) Requirement already satisfied: portalocker in ./miniconda3/envs/det_trial/lib/python3.7/site-packages (from fvcore->detectron2==0.1.1) (1.5.2) Requirement already satisfied: chardet<3.1.0,>=3.0.2 in ./miniconda3/envs/det_trial/lib/python3.7/site-packages (from requests<3,>=2.21.0->tensorboard->detectron2==0.1.1) (3.0.4) Requirement already satisfied: idna<2.9,>=2.5 in ./miniconda3/envs/det_trial/lib/python3.7/site-packages (from requests<3,>=2.21.0->tensorboard->detectron2==0.1.1) (2.8) Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in ./miniconda3/envs/det_trial/lib/python3.7/site-packages (from requests<3,>=2.21.0->tensorboard->detectron2==0.1.1) (1.25.8) Requirement already satisfied: certifi>=2017.4.17 in ./miniconda3/envs/det_trial/lib/python3.7/site-packages (from requests<3,>=2.21.0->tensorboard->detectron2==0.1.1) (2019.11.28) Requirement already satisfied: pyasn1-modules>=0.2.1 in ./miniconda3/envs/det_trial/lib/python3.7/site-packages (from google-auth<2,>=1.6.3->tensorboard->detectron2==0.1.1) (0.2.8) Requirement already satisfied: rsa<4.1,>=3.1.4 in ./miniconda3/envs/det_trial/lib/python3.7/site-packages (from google-auth<2,>=1.6.3->tensorboard->detectron2==0.1.1) (4.0) Requirement already satisfied: cachetools<5.0,>=2.0.0 in ./miniconda3/envs/det_trial/lib/python3.7/site-packages (from google-auth<2,>=1.6.3->tensorboard->detectron2==0.1.1) (4.0.0) Requirement already satisfied: requests-oauthlib>=0.7.0 in ./miniconda3/envs/det_trial/lib/python3.7/site-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard->detectron2==0.1.1) (1.3.0) Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in ./miniconda3/envs/det_trial/lib/python3.7/site-packages (from pyasn1-modules>=0.2.1->google-auth<2,>=1.6.3->tensorboard->detectron2==0.1.1) (0.4.8) Requirement already satisfied: oauthlib>=3.0.0 in ./miniconda3/envs/det_trial/lib/python3.7/site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard->detectron2==0.1.1) (3.1.0) Installing collected packages: detectron2 Found existing installation: detectron2 0.1.1 Uninstalling detectron2-0.1.1: Successfully uninstalled detectron2-0.1.1 Running setup.py develop for detectron2 Successfully installed detectron2 Command Line Args: Namespace(config_file='', dist_url='auto', eval_only=False, machine_rank=0, num_gpus=2, num_machines=1, opts=[], resume=False) [03/11 21:46:11 detectron2]: Rank of current process: 0. World size: 2 [03/11 21:46:15 detectron2]: Environment info:


sys.platform linux Python 3.7.6 (default, Jan 8 2020, 19:59:22) [GCC 7.3.0] numpy 1.18.1 detectron2 0.1.1 @/home/username/detectron2_repo_trial/detectron2 detectron2 compiler GCC 5.5 detectron2 CUDA compiler 10.2 detectron2 arch flags sm_61 DETECTRON2_ENV_MODULE PyTorch 1.4.0+cu100 @/home/username/miniconda3/envs/det_trial/lib/python3.7/site-packages/torch PyTorch debug build False CUDA available True GPU 0,1 GeForce GTX 1080 Ti CUDA_HOME /usr/local/cuda NVCC Cuda compilation tools, release 10.2, V10.2.89 Pillow 6.2.2 torchvision 0.5.0+cu100 @/home/username/miniconda3/envs/det_trial/lib/python3.7/site-packages/torchvision torchvision arch flags sm_35, sm_50, sm_60, sm_70, sm_75 cv2 4.1.2


PyTorch built with:

[03/11 21:46:15 detectron2]: Command line arguments: Namespace(config_file='', dist_url='auto', eval_only=False, machine_rank=0, num_gpus=2, num_machines=1, opts=[], resume=False) [03/11 21:46:15 detectron2]: Running with full config: CUDNN_BENCHMARK: False DATALOADER: ASPECT_RATIO_GROUPING: True FILTER_EMPTY_ANNOTATIONS: True NUM_WORKERS: 2 REPEAT_THRESHOLD: 0.0 SAMPLER_TRAIN: TrainingSampler DATASETS: PRECOMPUTED_PROPOSAL_TOPK_TEST: 1000 PRECOMPUTED_PROPOSAL_TOPK_TRAIN: 2000 PROPOSAL_FILES_TEST: () PROPOSAL_FILES_TRAIN: () TEST: ('balloon_val',) TRAIN: ('balloon_train',) GLOBAL: HACK: 1.0 INPUT: CROP: ENABLED: False SIZE: [0.9, 0.9] TYPE: relative_range FORMAT: BGR MASK_FORMAT: polygon MAX_SIZE_TEST: 1333 MAX_SIZE_TRAIN: 1333 MIN_SIZE_TEST: 800 MIN_SIZE_TRAIN: (640, 672, 704, 736, 768, 800) MIN_SIZE_TRAIN_SAMPLING: choice MODEL: ANCHOR_GENERATOR: ANGLES: [[-90, 0, 90]] ASPECT_RATIOS: [[0.5, 1.0, 2.0]] NAME: DefaultAnchorGenerator OFFSET: 0.0 SIZES: [[32], [64], [128], [256], [512]] BACKBONE: FREEZE_AT: 2 NAME: build_resnet_fpn_backbone DEVICE: cuda FPN: FUSE_TYPE: sum IN_FEATURES: ['res2', 'res3', 'res4', 'res5'] NORM: OUT_CHANNELS: 256 KEYPOINT_ON: False LOAD_PROPOSALS: False MASK_ON: False META_ARCHITECTURE: GeneralizedRCNN PANOPTIC_FPN: COMBINE: ENABLED: True INSTANCES_CONFIDENCE_THRESH: 0.5 OVERLAP_THRESH: 0.5 STUFF_AREA_LIMIT: 4096 INSTANCE_LOSS_WEIGHT: 1.0 PIXEL_MEAN: [103.53, 116.28, 123.675] PIXEL_STD: [1.0, 1.0, 1.0] PROPOSAL_GENERATOR: MIN_SIZE: 0 NAME: RPN RESNETS: DEFORM_MODULATED: False DEFORM_NUM_GROUPS: 1 DEFORM_ON_PER_STAGE: [False, False, False, False] DEPTH: 50 NORM: FrozenBN NUM_GROUPS: 1 OUT_FEATURES: ['res2', 'res3', 'res4', 'res5'] RES2_OUT_CHANNELS: 256 RES5_DILATION: 1 STEM_OUT_CHANNELS: 64 STRIDE_IN_1X1: True WIDTH_PER_GROUP: 64 RETINANET: BBOX_REG_WEIGHTS: (1.0, 1.0, 1.0, 1.0) FOCAL_LOSS_ALPHA: 0.25 FOCAL_LOSS_GAMMA: 2.0 IN_FEATURES: ['p3', 'p4', 'p5', 'p6', 'p7'] IOU_LABELS: [0, -1, 1] IOU_THRESHOLDS: [0.4, 0.5] NMS_THRESH_TEST: 0.5 NUM_CLASSES: 80 NUM_CONVS: 4 PRIOR_PROB: 0.01 SCORE_THRESH_TEST: 0.05 SMOOTH_L1_LOSS_BETA: 0.1 TOPK_CANDIDATES_TEST: 1000 ROI_BOX_CASCADE_HEAD: BBOX_REG_WEIGHTS: ((10.0, 10.0, 5.0, 5.0), (20.0, 20.0, 10.0, 10.0), (30.0, 30.0, 15.0, 15.0)) IOUS: (0.5, 0.6, 0.7) ROI_BOX_HEAD: BBOX_REG_WEIGHTS: (10.0, 10.0, 5.0, 5.0) CLS_AGNOSTIC_BBOX_REG: True CONV_DIM: 256 FC_DIM: 1024 NAME: FastRCNNConvFCHead NORM: NUM_CONV: 0 NUM_FC: 2 POOLER_RESOLUTION: 7 POOLER_SAMPLING_RATIO: 0 POOLER_TYPE: ROIAlignV2 SMOOTH_L1_BETA: 0.0 TRAIN_ON_PRED_BOXES: False ROI_HEADS: BATCH_SIZE_PER_IMAGE: 128 IN_FEATURES: ['p2', 'p3', 'p4', 'p5'] IOU_LABELS: [0, 1] IOU_THRESHOLDS: [0.5] NAME: CascadeROIHeads NMS_THRESH_TEST: 0.5 NUM_CLASSES: 1 POSITIVE_FRACTION: 0.25 PROPOSAL_APPEND_GT: True SCORE_THRESH_TEST: 0.7 ROI_KEYPOINT_HEAD: CONV_DIMS: (512, 512, 512, 512, 512, 512, 512, 512) LOSS_WEIGHT: 1.0 MIN_KEYPOINTS_PER_IMAGE: 1 NAME: KRCNNConvDeconvUpsampleHead NORMALIZE_LOSS_BY_VISIBLE_KEYPOINTS: True NUM_KEYPOINTS: 17 POOLER_RESOLUTION: 14 POOLER_SAMPLING_RATIO: 0 POOLER_TYPE: ROIAlignV2 ROI_MASK_HEAD: CLS_AGNOSTIC_MASK: False CONV_DIM: 256 NAME: MaskRCNNConvUpsampleHead NORM: NUM_CONV: 4 POOLER_RESOLUTION: 14 POOLER_SAMPLING_RATIO: 0 POOLER_TYPE: ROIAlignV2 RPN: BATCH_SIZE_PER_IMAGE: 256 BBOX_REG_WEIGHTS: (1.0, 1.0, 1.0, 1.0) BOUNDARY_THRESH: -1 HEAD_NAME: StandardRPNHead IN_FEATURES: ['p2', 'p3', 'p4', 'p5', 'p6'] IOU_LABELS: [0, -1, 1] IOU_THRESHOLDS: [0.3, 0.7] LOSS_WEIGHT: 1.0 NMS_THRESH: 0.7 POSITIVE_FRACTION: 0.5 POST_NMS_TOPK_TEST: 1000 POST_NMS_TOPK_TRAIN: 2000 PRE_NMS_TOPK_TEST: 1000 PRE_NMS_TOPK_TRAIN: 2000 SMOOTH_L1_BETA: 0.0 SEM_SEG_HEAD: COMMON_STRIDE: 4 CONVS_DIM: 128 IGNORE_VALUE: 255 IN_FEATURES: ['p2', 'p3', 'p4', 'p5'] LOSS_WEIGHT: 1.0 NAME: SemSegFPNHead NORM: GN NUM_CLASSES: 54 WEIGHTS: /ssd_scratch/cvit/username/balloon/model_final_480dd8.pkl OUTPUT_DIR: /ssd_scratch/cvit/username/balloon/models/ SEED: -1 SOLVER: BASE_LR: 0.00025 BIAS_LR_FACTOR: 1.0 CHECKPOINT_PERIOD: 5000 CLIP_GRADIENTS: CLIP_TYPE: value CLIP_VALUE: 1.0 ENABLED: False NORM_TYPE: 2.0 GAMMA: 0.1 IMS_PER_BATCH: 14 LR_SCHEDULER_NAME: WarmupMultiStepLR MAX_ITER: 1200 MOMENTUM: 0.9 STEPS: (210000, 250000) WARMUP_FACTOR: 0.001 WARMUP_ITERS: 1000 WARMUP_METHOD: linear WEIGHT_DECAY: 0.0001 WEIGHT_DECAY_BIAS: 0.0001 WEIGHT_DECAY_NORM: 0.0 TEST: AUG: ENABLED: False FLIP: True MAX_SIZE: 4000 MIN_SIZES: (400, 500, 600, 700, 800, 900, 1000, 1100, 1200) DETECTIONS_PER_IMAGE: 100 EVAL_PERIOD: 400 EXPECTED_RESULTS: [] KEYPOINT_OKS_SIGMAS: [] PRECISE_BN: ENABLED: False NUM_ITER: 200 VERSION: 2 VIS_PERIOD: 0 [03/11 21:46:15 detectron2]: Full config saved to /ssd_scratch/cvit/username/balloon/models/config.yaml [03/11 21:46:15 d2.utils.env]: Using a generated random seed 15491967 [03/11 21:46:16 detectron2]: Model: GeneralizedRCNN( (backbone): FPN( (fpn_lateral2): Conv2d(256, 256, kernel_size=(1, 1), stride=(1, 1)) (fpn_output2): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (fpn_lateral3): Conv2d(512, 256, kernel_size=(1, 1), stride=(1, 1)) (fpn_output3): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (fpn_lateral4): Conv2d(1024, 256, kernel_size=(1, 1), stride=(1, 1)) (fpn_output4): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (fpn_lateral5): Conv2d(2048, 256, kernel_size=(1, 1), stride=(1, 1)) (fpn_output5): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (top_block): LastLevelMaxPool() (bottom_up): ResNet( (stem): BasicStem( (conv1): Conv2d( 3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) ) (res2): Sequential( (0): BottleneckBlock( (shortcut): Conv2d( 64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv1): Conv2d( 64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv2): Conv2d( 64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv3): Conv2d( 64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) ) (1): BottleneckBlock( (conv1): Conv2d( 256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv2): Conv2d( 64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv3): Conv2d( 64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) ) (2): BottleneckBlock( (conv1): Conv2d( 256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv2): Conv2d( 64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv3): Conv2d( 64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) ) ) (res3): Sequential( (0): BottleneckBlock( (shortcut): Conv2d( 256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv1): Conv2d( 256, 128, kernel_size=(1, 1), stride=(2, 2), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv2): Conv2d( 128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv3): Conv2d( 128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) ) (1): BottleneckBlock( (conv1): Conv2d( 512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv2): Conv2d( 128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv3): Conv2d( 128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) ) (2): BottleneckBlock( (conv1): Conv2d( 512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv2): Conv2d( 128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv3): Conv2d( 128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) ) (3): BottleneckBlock( (conv1): Conv2d( 512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv2): Conv2d( 128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv3): Conv2d( 128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) ) ) (res4): Sequential( (0): BottleneckBlock( (shortcut): Conv2d( 512, 1024, kernel_size=(1, 1), stride=(2, 2), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) (conv1): Conv2d( 512, 256, kernel_size=(1, 1), stride=(2, 2), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) (1): BottleneckBlock( (conv1): Conv2d( 1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) (2): BottleneckBlock( (conv1): Conv2d( 1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) (3): BottleneckBlock( (conv1): Conv2d( 1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) (4): BottleneckBlock( (conv1): Conv2d( 1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) (5): BottleneckBlock( (conv1): Conv2d( 1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) ) (res5): Sequential( (0): BottleneckBlock( (shortcut): Conv2d( 1024, 2048, kernel_size=(1, 1), stride=(2, 2), bias=False (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05) ) (conv1): Conv2d( 1024, 512, kernel_size=(1, 1), stride=(2, 2), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv2): Conv2d( 512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv3): Conv2d( 512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05) ) ) (1): BottleneckBlock( (conv1): Conv2d( 2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv2): Conv2d( 512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv3): Conv2d( 512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05) ) ) (2): BottleneckBlock( (conv1): Conv2d( 2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv2): Conv2d( 512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv3): Conv2d( 512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05) ) ) ) ) ) (proposal_generator): RPN( (anchor_generator): DefaultAnchorGenerator( (cell_anchors): BufferList() ) (rpn_head): StandardRPNHead( (conv): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (objectness_logits): Conv2d(256, 3, kernel_size=(1, 1), stride=(1, 1)) (anchor_deltas): Conv2d(256, 12, kernel_size=(1, 1), stride=(1, 1)) ) ) (roi_heads): CascadeROIHeads( (box_pooler): ROIPooler( (level_poolers): ModuleList( (0): ROIAlign(output_size=(7, 7), spatial_scale=0.25, sampling_ratio=0, aligned=True) (1): ROIAlign(output_size=(7, 7), spatial_scale=0.125, sampling_ratio=0, aligned=True) (2): ROIAlign(output_size=(7, 7), spatial_scale=0.0625, sampling_ratio=0, aligned=True) (3): ROIAlign(output_size=(7, 7), spatial_scale=0.03125, sampling_ratio=0, aligned=True) ) ) (box_head): ModuleList( (0): FastRCNNConvFCHead( (fc1): Linear(in_features=12544, out_features=1024, bias=True) (fc2): Linear(in_features=1024, out_features=1024, bias=True) ) (1): FastRCNNConvFCHead( (fc1): Linear(in_features=12544, out_features=1024, bias=True) (fc2): Linear(in_features=1024, out_features=1024, bias=True) ) (2): FastRCNNConvFCHead( (fc1): Linear(in_features=12544, out_features=1024, bias=True) (fc2): Linear(in_features=1024, out_features=1024, bias=True) ) ) (box_predictor): ModuleList( (0): FastRCNNOutputLayers( (cls_score): Linear(in_features=1024, out_features=2, bias=True) (bbox_pred): Linear(in_features=1024, out_features=4, bias=True) ) (1): FastRCNNOutputLayers( (cls_score): Linear(in_features=1024, out_features=2, bias=True) (bbox_pred): Linear(in_features=1024, out_features=4, bias=True) ) (2): FastRCNNOutputLayers( (cls_score): Linear(in_features=1024, out_features=2, bias=True) (bbox_pred): Linear(in_features=1024, out_features=4, bias=True) ) ) ) ) [03/11 21:46:16 fvcore.common.checkpoint]: Loading checkpoint from /ssd_scratch/cvit/username/balloon/model_final_480dd8.pkl [03/11 21:46:17 fvcore.common.checkpoint]: Reading a file from 'Detectron2 Model Zoo' WARNING [03/11 21:46:17 fvcore.common.checkpoint]: 'roi_heads.box_predictor.0.cls_score.weight' has shape (81, 1024) in the checkpoint but (2, 1024) in the model! Skipped. WARNING [03/11 21:46:17 fvcore.common.checkpoint]: 'roi_heads.box_predictor.0.cls_score.bias' has shape (81,) in the checkpoint but (2,) in the model! Skipped. WARNING [03/11 21:46:17 fvcore.common.checkpoint]: 'roi_heads.box_predictor.1.cls_score.weight' has shape (81, 1024) in the checkpoint but (2, 1024) in the model! Skipped. WARNING [03/11 21:46:17 fvcore.common.checkpoint]: 'roi_heads.box_predictor.1.cls_score.bias' has shape (81,) in the checkpoint but (2,) in the model! Skipped. WARNING [03/11 21:46:17 fvcore.common.checkpoint]: 'roi_heads.box_predictor.2.cls_score.weight' has shape (81, 1024) in the checkpoint but (2, 1024) in the model! Skipped. WARNING [03/11 21:46:17 fvcore.common.checkpoint]: 'roi_heads.box_predictor.2.cls_score.bias' has shape (81,) in the checkpoint but (2,) in the model! Skipped. [03/11 21:46:17 fvcore.common.checkpoint]: Some model parameters are not in the checkpoint: roi_heads.box_predictor.0.cls_score.{weight, bias} roi_heads.box_predictor.1.cls_score.{weight, bias} roi_heads.box_predictor.2.cls_score.{weight, bias} [03/11 21:46:17 fvcore.common.checkpoint]: The checkpoint contains parameters not used by the model: roi_heads.mask_head.mask_fcn1.{weight, bias} roi_heads.mask_head.mask_fcn2.{weight, bias} roi_heads.mask_head.mask_fcn3.{weight, bias} roi_heads.mask_head.mask_fcn4.{weight, bias} roi_heads.mask_head.deconv.{weight, bias} roi_heads.mask_head.predictor.{weight, bias} [03/11 21:46:22 d2.data.build]: Removed 0 images with no usable annotations. 61 images left. [03/11 21:46:22 d2.data.build]: Distribution of instances among all 1 categories:  category #instances
balloon 255

[03/11 21:46:22 d2.data.common]: Serializing 61 elements to byte tensors and concatenating them all ... [03/11 21:46:22 d2.data.common]: Serialized dataset takes 0.18 MiB [03/11 21:46:22 d2.data.detection_utils]: TransformGens used in training: [ResizeShortestEdge(short_edge_length=(640, 672, 704, 736, 768, 800), max_size=1333, sample_style='choice'), RandomFlip()] [03/11 21:46:22 d2.data.build]: Using training sampler TrainingSampler [03/11 21:46:22 detectron2]: Starting training from iteration 0 [03/11 21:46:52 d2.utils.events]:  eta: N/A iter: 20 total_loss: 3.262 loss_box_reg_stage0: 0.195 loss_box_reg_stage1: 0.309 loss_box_reg_stage2: 0.389 loss_cls_stage0: 0.796 loss_cls_stage1: 0.744 loss_cls_stage2: 0.762 loss_rpn_cls: 0.043 loss_rpn_loc: 0.014 lr: 0.000005 max_mem: 8398M [03/11 21:47:15 d2.utils.events]:  eta: N/A iter: 40 total_loss: 3.172 loss_box_reg_stage0: 0.191 loss_box_reg_stage1: 0.309 loss_box_reg_stage2: 0.400 loss_cls_stage0: 0.765 loss_cls_stage1: 0.717 loss_cls_stage2: 0.742 loss_rpn_cls: 0.045 loss_rpn_loc: 0.015 lr: 0.000010 max_mem: 8517M [03/11 21:47:38 d2.utils.events]:  eta: N/A iter: 60 total_loss: 2.940 loss_box_reg_stage0: 0.185 loss_box_reg_stage1: 0.299 loss_box_reg_stage2: 0.388 loss_cls_stage0: 0.698 loss_cls_stage1: 0.670 loss_cls_stage2: 0.692 loss_rpn_cls: 0.047 loss_rpn_loc: 0.014 lr: 0.000015 max_mem: 8517M [03/11 21:48:01 d2.utils.events]:  eta: N/A iter: 80 total_loss: 2.749 loss_box_reg_stage0: 0.187 loss_box_reg_stage1: 0.283 loss_box_reg_stage2: 0.384 loss_cls_stage0: 0.621 loss_cls_stage1: 0.602 loss_cls_stage2: 0.613 loss_rpn_cls: 0.035 loss_rpn_loc: 0.016 lr: 0.000020 max_mem: 8517M [03/11 21:48:24 d2.utils.events]:  eta: N/A iter: 100 total_loss: 2.526 loss_box_reg_stage0: 0.172 loss_box_reg_stage1: 0.278 loss_box_reg_stage2: 0.360 loss_cls_stage0: 0.552 loss_cls_stage1: 0.540 loss_cls_stage2: 0.553 loss_rpn_cls: 0.040 loss_rpn_loc: 0.014 lr: 0.000025 max_mem: 8517M [03/11 21:48:47 d2.utils.events]:  eta: N/A iter: 120 total_loss: 2.308 loss_box_reg_stage0: 0.165 loss_box_reg_stage1: 0.286 loss_box_reg_stage2: 0.360 loss_cls_stage0: 0.489 loss_cls_stage1: 0.477 loss_cls_stage2: 0.497 loss_rpn_cls: 0.029 loss_rpn_loc: 0.013 lr: 0.000030 max_mem: 8517M [03/11 21:49:10 d2.utils.events]:  eta: N/A iter: 140 total_loss: 2.145 loss_box_reg_stage0: 0.152 loss_box_reg_stage1: 0.272 loss_box_reg_stage2: 0.361 loss_cls_stage0: 0.437 loss_cls_stage1: 0.427 loss_cls_stage2: 0.441 loss_rpn_cls: 0.034 loss_rpn_loc: 0.013 lr: 0.000035 max_mem: 8576M [03/11 21:49:33 d2.utils.events]:  eta: N/A iter: 160 total_loss: 1.995 loss_box_reg_stage0: 0.154 loss_box_reg_stage1: 0.268 loss_box_reg_stage2: 0.369 loss_cls_stage0: 0.396 loss_cls_stage1: 0.381 loss_cls_stage2: 0.398 loss_rpn_cls: 0.028 loss_rpn_loc: 0.012 lr: 0.000040 max_mem: 8576M [03/11 21:49:55 d2.utils.events]:  eta: N/A iter: 180 total_loss: 1.831 loss_box_reg_stage0: 0.151 loss_box_reg_stage1: 0.257 loss_box_reg_stage2: 0.374 loss_cls_stage0: 0.357 loss_cls_stage1: 0.339 loss_cls_stage2: 0.355 loss_rpn_cls: 0.026 loss_rpn_loc: 0.011 lr: 0.000045 max_mem: 8576M [03/11 21:50:18 d2.utils.events]:  eta: N/A iter: 200 total_loss: 1.779 loss_box_reg_stage0: 0.153 loss_box_reg_stage1: 0.256 loss_box_reg_stage2: 0.359 loss_cls_stage0: 0.321 loss_cls_stage1: 0.302 loss_cls_stage2: 0.317 loss_rpn_cls: 0.030 loss_rpn_loc: 0.013 lr: 0.000050 max_mem: 8576M [03/11 21:50:41 d2.utils.events]:  eta: N/A iter: 220 total_loss: 1.564 loss_box_reg_stage0: 0.137 loss_box_reg_stage1: 0.224 loss_box_reg_stage2: 0.325 loss_cls_stage0: 0.291 loss_cls_stage1: 0.269 loss_cls_stage2: 0.287 loss_rpn_cls: 0.024 loss_rpn_loc: 0.012 lr: 0.000055 max_mem: 8576M [03/11 21:51:04 d2.utils.events]:  eta: N/A iter: 240 total_loss: 1.571 loss_box_reg_stage0: 0.149 loss_box_reg_stage1: 0.247 loss_box_reg_stage2: 0.361 loss_cls_stage0: 0.270 loss_cls_stage1: 0.246 loss_cls_stage2: 0.262 loss_rpn_cls: 0.026 loss_rpn_loc: 0.012 lr: 0.000060 max_mem: 8576M [03/11 21:51:27 d2.utils.events]:  eta: N/A iter: 260 total_loss: 1.429 loss_box_reg_stage0: 0.140 loss_box_reg_stage1: 0.234 loss_box_reg_stage2: 0.362 loss_cls_stage0: 0.240 loss_cls_stage1: 0.214 loss_cls_stage2: 0.230 loss_rpn_cls: 0.025 loss_rpn_loc: 0.012 lr: 0.000065 max_mem: 8576M [03/11 21:51:50 d2.utils.events]:  eta: N/A iter: 280 total_loss: 1.324 loss_box_reg_stage0: 0.133 loss_box_reg_stage1: 0.218 loss_box_reg_stage2: 0.312 loss_cls_stage0: 0.220 loss_cls_stage1: 0.191 loss_cls_stage2: 0.207 loss_rpn_cls: 0.022 loss_rpn_loc: 0.011 lr: 0.000070 max_mem: 8576M [03/11 21:52:12 d2.utils.events]:  eta: N/A iter: 300 total_loss: 1.313 loss_box_reg_stage0: 0.137 loss_box_reg_stage1: 0.248 loss_box_reg_stage2: 0.345 loss_cls_stage0: 0.199 loss_cls_stage1: 0.170 loss_cls_stage2: 0.183 loss_rpn_cls: 0.022 loss_rpn_loc: 0.011 lr: 0.000075 max_mem: 8576M [03/11 21:52:35 d2.utils.events]:  eta: N/A iter: 320 total_loss: 1.197 loss_box_reg_stage0: 0.128 loss_box_reg_stage1: 0.222 loss_box_reg_stage2: 0.309 loss_cls_stage0: 0.182 loss_cls_stage1: 0.153 loss_cls_stage2: 0.163 loss_rpn_cls: 0.017 loss_rpn_loc: 0.010 lr: 0.000080 max_mem: 8576M [03/11 21:52:58 d2.utils.events]:  eta: N/A iter: 340 total_loss: 1.174 loss_box_reg_stage0: 0.135 loss_box_reg_stage1: 0.216 loss_box_reg_stage2: 0.314 loss_cls_stage0: 0.169 loss_cls_stage1: 0.142 loss_cls_stage2: 0.152 loss_rpn_cls: 0.018 loss_rpn_loc: 0.011 lr: 0.000085 max_mem: 8576M [03/11 21:53:21 d2.utils.events]:  eta: N/A iter: 360 total_loss: 1.101 loss_box_reg_stage0: 0.131 loss_box_reg_stage1: 0.209 loss_box_reg_stage2: 0.299 loss_cls_stage0: 0.155 loss_cls_stage1: 0.123 loss_cls_stage2: 0.137 loss_rpn_cls: 0.022 loss_rpn_loc: 0.011 lr: 0.000090 max_mem: 8576M [03/11 21:53:44 d2.utils.events]:  eta: N/A iter: 380 total_loss: 1.046 loss_box_reg_stage0: 0.127 loss_box_reg_stage1: 0.202 loss_box_reg_stage2: 0.300 loss_cls_stage0: 0.147 loss_cls_stage1: 0.120 loss_cls_stage2: 0.127 loss_rpn_cls: 0.018 loss_rpn_loc: 0.011 lr: 0.000095 max_mem: 8576M [03/11 21:54:06 d2.data.build]: Distribution of instances among all 1 categories:  category #instances
balloon 50

[03/11 21:54:06 d2.data.common]: Serializing 13 elements to byte tensors and concatenating them all ... [03/11 21:54:06 d2.data.common]: Serialized dataset takes 0.04 MiB WARNING [03/11 21:54:06 d2.evaluation.coco_evaluation]: json_file was not found in MetaDataCatalog for 'balloon_val'. Trying to convert it to COCO format ... WARNING [03/11 21:54:07 d2.data.datasets.coco]: Using previously cached COCO format annotations at '/ssd_scratch/cvit/username/balloon/models/inference/balloon_val/balloon_val_coco_format.json'. You need to clear the cache file if your dataset has been modified. [03/11 21:54:07 d2.evaluation.evaluator]: Start inference on 7 images [03/11 21:54:17 d2.evaluation.evaluator]: Total inference time: 0:00:00.353311 (0.176655 s / img per device, on 2 devices) [03/11 21:54:17 d2.evaluation.evaluator]: Total inference pure compute time: 0:00:00 (0.092158 s / img per device, on 2 devices) [03/11 21:54:17 d2.evaluation.coco_evaluation]: Preparing results for COCO format ... [03/11 21:54:17 d2.evaluation.coco_evaluation]: Saving results to /ssd_scratch/cvit/username/balloon/models/inference/balloon_val/coco_instances_results.json [03/11 21:54:17 d2.evaluation.coco_evaluation]: Evaluating predictions ... Loading and preparing results... DONE (t=0.00s) creating index... index created! Running per image evaluation... Evaluate annotation type bbox DONE (t=0.01s). Accumulating evaluation results... DONE (t=0.01s). Average Precision (AP) @[ IoU=0.50:0.95 area= all maxDets=100 ] = 0.722 Average Precision (AP) @[ IoU=0.50 area= all maxDets=100 ] = 0.778 Average Precision (AP) @[ IoU=0.75 area= all maxDets=100 ] = 0.778 Average Precision (AP) @[ IoU=0.50:0.95 area= small maxDets=100 ] = 0.000 Average Precision (AP) @[ IoU=0.50:0.95 area=medium maxDets=100 ] = 0.494 Average Precision (AP) @[ IoU=0.50:0.95 area= large maxDets=100 ] = 0.909 Average Recall (AR) @[ IoU=0.50:0.95 area= all maxDets= 1 ] = 0.254 Average Recall (AR) @[ IoU=0.50:0.95 area= all maxDets= 10 ] = 0.730 Average Recall (AR) @[ IoU=0.50:0.95 area= all maxDets=100 ] = 0.730 Average Recall (AR) @[ IoU=0.50:0.95 area= small maxDets=100 ] = 0.000 Average Recall (AR) @[ IoU=0.50:0.95 area=medium maxDets=100 ] = 0.524 Average Recall (AR) @[ IoU=0.50:0.95 area= large maxDets=100 ] = 0.920 [03/11 21:54:17 d2.evaluation.coco_evaluation]: Evaluation results for bbox: AP AP50 AP75 APs APm APl
72.213 77.848 77.848 0.000 49.406 90.933

Before 0 First 0 [03/11 21:54:17 detectron2]: Evaluation results for balloon_val in csv format: [03/11 21:54:17 d2.evaluation.testing]: copypaste: Task: bbox [03/11 21:54:17 d2.evaluation.testing]: copypaste: AP,AP50,AP75,APs,APm,APl [03/11 21:54:17 d2.evaluation.testing]: copypaste: 72.2132,77.8478,77.8478,0.0000,49.4059,90.9329 Before 1 Second 1 Third 1 {} Traceback (most recent call last): File "/home/username/detectron2_repo_trial/tools/balloon_train_net_experiment.py", line 344, in args=(args,), File "/home/username/detectron2_repo_trial/detectron2/engine/launch.py", line 49, in launch daemon=False, File "/home/username/miniconda3/envs/det_trial/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn while not spawn_context.join(): File "/home/username/miniconda3/envs/det_trial/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception:

-- Process 1 terminated with the following error: Traceback (most recent call last): File "/home/username/miniconda3/envs/det_trial/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, args) File "/home/username/detectron2_repo_trial/detectron2/engine/launch.py", line 84, in _distributed_worker main_func(args) File "/home/username/detectron2_repo_trial/tools/balloon_train_net_experiment.py", line 332, in main do_train(cfg, model) File "/home/username/detectron2_repo_trial/tools/balloon_train_net_experiment.py", line 212, in do_train if (val_dict['bbox']['AP'] > default_val_AP and val_dict['bbox']['AP50'] > default_val_AP50 and val_dict['bbox']['AP75'] > default_val_AP75): KeyError: 'bbox'

/home/username/miniconda3/envs/det_trial/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 14 leaked semaphores to clean up at shutdown len(cache))

## Expected behavior:

do_test being executed only once with the evaluation results being printed for every eval period during multi-gpu training.

## Environment:

Provide your environment information using the following command:

wget -nc -q https://github.com/facebookresearch/detectron2/raw/master/detectron2/utils/collect_env.py && python collect_env.py

ppwwyyxx commented 4 years ago

That's because only one GPU does evaluation and it's by design.

You need if comm.is_main_process():

Closing as the issue is solved.

ghost commented 4 years ago

So is there a way where I can have results dict to be not {} and run through do_test only once?

ppwwyyxx commented 4 years ago

I don't understand that question. There is no need to have the same evaluation results on other GPUs.

ghost commented 4 years ago

Correct. But in my case, it does go through do_test one more time? What can I do to run do_test on only one GPU?

ppwwyyxx commented 4 years ago

All GPUs have to go through do_test to make predictions together. Only one GPU evaluates the predictions.

ghost commented 4 years ago

My bad! Since I am using 2 GPUs, and each of them is going through do_test once, in the second run results_i = inference_on_dataset(model, data_loader, evaluator) returns null (or {}). Is there a way to handle this?

ppwwyyxx commented 4 years ago

I don't know how exactly do you want to handle this, but you can use if comm.is_main_process(): as I said above.