Closed Wei-i closed 3 years ago
It seems that there may be wrong with 'Loading checkpoint from detectron2://ImageNetPretrained/MSRA/R-50.pkl '?
Could you check the version of portalocker
in your environment? And run the following code snippets to verify whether the portalocker
has the attribute Lock
or not:
>>> import portalocker
>>> portalocker.__version__
>>> portalocker.Lock
Sorry, there must be something wrong with my portalocker
?
(yolof) cw@MAC-3DGroup:~$ python
Python 3.6.13 | packaged by conda-forge | (default, Feb 19 2021, 05:36:01)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import fvcore
>>> import portalocker
>>>
>>> portalocker.__version__
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: module 'portalocker' has no attribute '__version__'
Try to re-install portalocker
?
first
pip uninstall portalocker
and
conda install portalocker
then bug will be fixed.
[03/26 08:50:52 d2.engine.hooks]: Total training time: 0:00:01 (0:00:00 on hooks) [03/26 08:50:52 d2.utils.events]: iter: 0 lr: N/A max_mem: 622M Traceback (most recent call last): File "./tools/train_net.py", line 234, in <module> args=(args,), File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/launch.py", line 62, in launch main_func(*args) File "./tools/train_net.py", line 221, in main return trainer.train() File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/defaults.py", line 431, in train super().train(self.start_iter, self.max_iter) File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/train_loop.py", line 140, in train self.run_step() File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/defaults.py", line 441, in run_step self._trainer.run_step() File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/train_loop.py", line 234, in run_step loss_dict = self.model(data) File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/cw/YOLOF/yolof/modeling/yolof.py", line 294, in forward pred_logits, pred_anchor_deltas) File "/home/cw/YOLOF/yolof/modeling/yolof.py", line 387, in losses dist.all_reduce(num_foreground) File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 953, in all_reduce _check_default_pg() File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 211, in _check_default_pg "Default process group is not initialized" AssertionError: Default process group is not initialized
How to close DDP?
I think this is the last bug before I can train YOLOF...
Comment out the lines with dist
in the yolof.py
file.
BTW, when you train with only one GPU, you should adjust the learning rate and batch size. Refer to this response.
Support training with one GPU in this commit.
Thanks !
[03/26 09:10:09 d2.engine.hooks]: Overall training speed: 57 iterations in 0:00:18 (0.3159 s / it)
[03/26 09:10:09 d2.engine.hooks]: Total training time: 0:00:18 (0:00:00 on hooks)
[03/26 09:10:09 d2.utils.events]: eta: 2:01:11 iter: 59 total_loss: 2.067 loss_cls: 1.342 loss_box_reg: 0.7438 time: 0.3139 data_time: 0.0022 lr: 3.9308e-06 max_mem: 1076M
Traceback (most recent call last):
File "tools/train_net.py", line 234, in
真的很抱歉,作者,由于我的水平太低,这又出现了新的bug。请问这个应该怎么改?
Could you give more details about what command you use?
command : CUDA_VISIBLE_DEVICES=1 python tools/train_net.py --num-gpus 1 --config-file ./configs/yolof_R_50_C5_1x.yaml
yaml:
MODEL: META_ARCHITECTURE: "YOLOF" BACKBONE: NAME: "build_resnet_backbone" RESNETS: OUT_FEATURES: ["res5"] DATASETS: TRAIN: ("coco_2017_train",) TEST: ("coco_2017_val",) DATALOADER:
NUM_WORKERS: 4 SOLVER:
IMS_PER_BATCH: 2
BASE_LR: 0.00001 WARMUP_FACTOR: 0.00066667 WARMUP_ITERS: 1500
STEPS: (480000, 640000)
MAX_ITER: 720000 CHECKPOINT_PERIOD: 2500 INPUT: MIN_SIZE_TRAIN: (800,)
Can you try this setting?
IMS_PER_BATCH: 8
BASE_LR: 0.03
WARMUP_FACTOR: 0.00066667
WARMUP_ITERS: 1500
STEPS: (120000, 160000)
MAX_ITER: 180000
I left lab and I will try tomorrow. thanks a lot from bottom of my heart !
Good Morning! When I tried your setting, it stiill remains the same bug as:
[03/26 17:38:53 d2.engine.train_loop]: Starting training from iteration 0
/opt/conda/conda-bld/pytorch_1607370169888/work/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [11,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
ERROR [03/26 17:39:00 d2.engine.train_loop]: Exception during training:
Traceback (most recent call last):
File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/train_loop.py", line 140, in train
self.run_step()
File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/defaults.py", line 441, in run_step
self._trainer.run_step()
File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/train_loop.py", line 234, in run_step
loss_dict = self.model(data)
File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/cw/YOLOF/yolof/modeling/yolof.py", line 295, in forward
pred_logits, pred_anchor_deltas)
File "/home/cw/YOLOF/yolof/modeling/yolof.py", line 397, in losses
pred_class_logits[valid_idxs],
RuntimeError: CUDA error: device-side assert triggered
Sorry, the BASE_LR
should be 0.015
. But I can run with one GPU with an initial learning rate of both 0.03
and 0.015
. I can not reproduce your error on my side.
Try to warm up more iterations, e.g.,
WARMUP_FACTOR: 0.0002
WARMUP_ITERS: 5000
Thanks. It stil does not work...
Could you upload your training log file?
[03/26 21:44:54] detectron2 INFO: Rank of current process: 0. World size: 2 [03/26 21:44:55] detectron2 INFO: Environment info:
sys.platform linux Python 3.7.8 | packaged by conda-forge | (default, Jul 31 2020, 02:25:08) [GCC 7.5.0]
numpy 1.19.1
detectron2 0.4 @/home/cw/detectron2/detectron2
Compiler GCC 5.4
CUDA compiler CUDA 9.0
detectron2 arch flags 6.1
DETECTRON2_ENV_MODULE |
---|
PyTorch built with:
[03/26 21:44:55] detectron2 INFO: Command line arguments: Namespace(config_file='./configs/yolof_R_50_C5_1x.yaml', dist_url='tcp://127.0.0.1:50159', eval_only=False, machine_rank=0, num_gpus=2, num_machines=1, opts=['OUTPUT_DIR', '/hdd2/wh/cw/train/yolof/R_50_C5_1x/'], resume=False) [03/26 21:44:55] detectron2 INFO: Contents of args.config_file=./configs/yolof_R_50_C5_1x.yaml: BASE: "Base-YOLOF.yaml" MODEL: WEIGHTS: "detectron2://ImageNetPretrained/MSRA/R-50.pkl" RESNETS: DEPTH: 50 OUTPUT_DIR: "output/yolof/R_50_C5_1x"
[03/26 21:44:55] detectron2 INFO: Running with full config: CUDNN_BENCHMARK: False DATALOADER: ASPECT_RATIO_GROUPING: True FILTER_EMPTY_ANNOTATIONS: True NUM_WORKERS: 8 REPEAT_THRESHOLD: 0.0 SAMPLER_TRAIN: TrainingSampler DATASETS: PRECOMPUTED_PROPOSAL_TOPK_TEST: 1000 PRECOMPUTED_PROPOSAL_TOPK_TRAIN: 2000 PROPOSAL_FILES_TEST: () PROPOSAL_FILES_TRAIN: () TEST: ('coco_2017_val',) TRAIN: ('coco_2017_train',) GLOBAL: HACK: 1.0 INPUT: CROP: ENABLED: False SIZE: [0.9, 0.9] TYPE: relative_range DISTORTION: ENABLED: False EXPOSURE: 1.5 HUE: 0.1 SATURATION: 1.5 FORMAT: BGR JITTER_CROP: ENABLED: False JITTER_RATIO: 0.3 MASK_FORMAT: polygon MAX_SIZE_TEST: 1333 MAX_SIZE_TRAIN: 1333 MIN_SIZE_TEST: 800 MIN_SIZE_TRAIN: (800,) MIN_SIZE_TRAIN_SAMPLING: choice MOSAIC: ENABLED: False MIN_OFFSET: 0.2 MOSAIC_HEIGHT: 640 MOSAIC_WIDTH: 640 NUM_IMAGES: 4 POOL_CAPACITY: 1000 RANDOM_FLIP: horizontal RESIZE: ENABLED: False SCALE_JITTER: (0.8, 1.2) SHAPE: (640, 640) TEST_SHAPE: (608, 608) SHIFT: SHIFT_PIXELS: 32 MODEL: ANCHOR_GENERATOR: ANGLES: [[-90, 0, 90]] ASPECT_RATIOS: [[1.0]] NAME: DefaultAnchorGenerator OFFSET: 0.0 SIZES: [[32, 64, 128, 256, 512]] BACKBONE: FREEZE_AT: 2 NAME: build_resnet_backbone DARKNET: DEPTH: 53 NORM: BN OUT_FEATURES: ['res5'] RES5_DILATION: 1 WITH_CSP: True DEVICE: cuda FPN: FUSE_TYPE: sum IN_FEATURES: [] NORM: OUT_CHANNELS: 256 KEYPOINT_ON: False LOAD_PROPOSALS: False MASK_ON: False META_ARCHITECTURE: YOLOF PANOPTIC_FPN: COMBINE: ENABLED: True INSTANCES_CONFIDENCE_THRESH: 0.5 OVERLAP_THRESH: 0.5 STUFF_AREA_LIMIT: 4096 INSTANCE_LOSS_WEIGHT: 1.0 PIXEL_MEAN: [103.53, 116.28, 123.675] PIXEL_STD: [1.0, 1.0, 1.0] PROPOSAL_GENERATOR: MIN_SIZE: 0 NAME: RPN RESNETS: DEFORM_MODULATED: False DEFORM_NUM_GROUPS: 1 DEFORM_ON_PER_STAGE: [False, False, False, False] DEPTH: 50 NORM: FrozenBN NUM_GROUPS: 1 OUT_FEATURES: ['res5'] RES2_OUT_CHANNELS: 256 RES5_DILATION: 1 STEM_OUT_CHANNELS: 64 STRIDE_IN_1X1: True WIDTH_PER_GROUP: 64 RETINANET: BBOX_REG_LOSS_TYPE: smooth_l1 BBOX_REG_WEIGHTS: (1.0, 1.0, 1.0, 1.0) FOCAL_LOSS_ALPHA: 0.25 FOCAL_LOSS_GAMMA: 2.0 IN_FEATURES: ['p3', 'p4', 'p5', 'p6', 'p7'] IOU_LABELS: [0, -1, 1] IOU_THRESHOLDS: [0.4, 0.5] NMS_THRESH_TEST: 0.5 NORM: NUM_CLASSES: 80 NUM_CONVS: 4 PRIOR_PROB: 0.01 SCORE_THRESH_TEST: 0.05 SMOOTH_L1_LOSS_BETA: 0.1 TOPK_CANDIDATES_TEST: 1000 ROI_BOX_CASCADE_HEAD: BBOX_REG_WEIGHTS: ((10.0, 10.0, 5.0, 5.0), (20.0, 20.0, 10.0, 10.0), (30.0, 30.0, 15.0, 15.0)) IOUS: (0.5, 0.6, 0.7) ROI_BOX_HEAD: BBOX_REG_LOSS_TYPE: smooth_l1 BBOX_REG_LOSS_WEIGHT: 1.0 BBOX_REG_WEIGHTS: (10.0, 10.0, 5.0, 5.0) CLS_AGNOSTIC_BBOX_REG: False CONV_DIM: 256 FC_DIM: 1024 NAME: NORM: NUM_CONV: 0 NUM_FC: 0 POOLER_RESOLUTION: 14 POOLER_SAMPLING_RATIO: 0 POOLER_TYPE: ROIAlignV2 SMOOTH_L1_BETA: 0.0 TRAIN_ON_PRED_BOXES: False ROI_HEADS: BATCH_SIZE_PER_IMAGE: 512 IN_FEATURES: ['res4'] IOU_LABELS: [0, 1] IOU_THRESHOLDS: [0.5] NAME: Res5ROIHeads NMS_THRESH_TEST: 0.5 NUM_CLASSES: 80 POSITIVE_FRACTION: 0.25 PROPOSAL_APPEND_GT: True SCORE_THRESH_TEST: 0.05 ROI_KEYPOINT_HEAD: CONV_DIMS: (512, 512, 512, 512, 512, 512, 512, 512) LOSS_WEIGHT: 1.0 MIN_KEYPOINTS_PER_IMAGE: 1 NAME: KRCNNConvDeconvUpsampleHead NORMALIZE_LOSS_BY_VISIBLE_KEYPOINTS: True NUM_KEYPOINTS: 17 POOLER_RESOLUTION: 14 POOLER_SAMPLING_RATIO: 0 POOLER_TYPE: ROIAlignV2 ROI_MASK_HEAD: CLS_AGNOSTIC_MASK: False CONV_DIM: 256 NAME: MaskRCNNConvUpsampleHead NORM: NUM_CONV: 0 POOLER_RESOLUTION: 14 POOLER_SAMPLING_RATIO: 0 POOLER_TYPE: ROIAlignV2 RPN: BATCH_SIZE_PER_IMAGE: 256 BBOX_REG_LOSS_TYPE: smooth_l1 BBOX_REG_LOSS_WEIGHT: 1.0 BBOX_REG_WEIGHTS: (1.0, 1.0, 1.0, 1.0) BOUNDARY_THRESH: -1 HEAD_NAME: StandardRPNHead IN_FEATURES: ['res4'] IOU_LABELS: [0, -1, 1] IOU_THRESHOLDS: [0.3, 0.7] LOSS_WEIGHT: 1.0 NMS_THRESH: 0.7 POSITIVE_FRACTION: 0.5 POST_NMS_TOPK_TEST: 1000 POST_NMS_TOPK_TRAIN: 2000 PRE_NMS_TOPK_TEST: 6000 PRE_NMS_TOPK_TRAIN: 12000 SMOOTH_L1_BETA: 0.0 SEM_SEG_HEAD: COMMON_STRIDE: 4 CONVS_DIM: 128 IGNORE_VALUE: 255 IN_FEATURES: ['p2', 'p3', 'p4', 'p5'] LOSS_WEIGHT: 1.0 NAME: SemSegFPNHead NORM: GN NUM_CLASSES: 54 WEIGHTS: detectron2://ImageNetPretrained/MSRA/R-50.pkl YOLOF: BOX_TRANSFORM: ADD_CTR_CLAMP: True BBOX_REG_WEIGHTS: (1.0, 1.0, 1.0, 1.0) CTR_CLAMP: 32 DECODER: ACTIVATION: ReLU CLS_NUM_CONVS: 2 IN_CHANNELS: 512 NORM: BN NUM_ANCHORS: 5 NUM_CLASSES: 80 PRIOR_PROB: 0.01 REG_NUM_CONVS: 4 DETECTIONS_PER_IMAGE: 100 ENCODER: ACTIVATION: ReLU BACKBONE_LEVEL: res5 BLOCK_DILATIONS: [2, 4, 6, 8] BLOCK_MID_CHANNELS: 128 IN_CHANNELS: 2048 NORM: BN NUM_CHANNELS: 512 NUM_RESIDUAL_BLOCKS: 4 LOSSES: BBOX_REG_LOSS_TYPE: giou FOCAL_LOSS_ALPHA: 0.25 FOCAL_LOSS_GAMMA: 2.0 MATCHER: TOPK: 4 NEG_IGNORE_THRESHOLD: 0.7 NMS_THRESH_TEST: 0.6 POS_IGNORE_THRESHOLD: 0.15 SCORE_THRESH_TEST: 0.05 TOPK_CANDIDATES_TEST: 1000 OUTPUT_DIR: /hdd2/wh/cw/train/yolof/R_50_C5_1x/ SEED: -1 SOLVER: AMP: ENABLED: False BACKBONE_MULTIPLIER: 0.334 BASE_LR: 0.003 BIAS_LR_FACTOR: 1.0 CHECKPOINT_PERIOD: 2500 CLIP_GRADIENTS: CLIP_TYPE: value CLIP_VALUE: 1.0 ENABLED: False NORM_TYPE: 2.0 GAMMA: 0.1 IMS_PER_BATCH: 16 LR_SCHEDULER_NAME: WarmupMultiStepLR MAX_ITER: 90000 MOMENTUM: 0.9 NESTEROV: False REFERENCE_WORLD_SIZE: 0 STEPS: (60000, 80000) WARMUP_FACTOR: 0.0002 WARMUP_ITERS: 5000 WARMUP_METHOD: linear WEIGHT_DECAY: 0.0001 WEIGHT_DECAY_BIAS: 0.0001 WEIGHT_DECAY_NORM: 0.0 TEST: AUG: ENABLED: False FLIP: True MAX_SIZE: 4000 MIN_SIZES: (400, 500, 600, 700, 800, 900, 1000, 1100, 1200) DETECTIONS_PER_IMAGE: 100 EVAL_PERIOD: 0 EXPECTED_RESULTS: [] KEYPOINT_OKS_SIGMAS: [] PRECISE_BN: ENABLED: False NUM_ITER: 200 VERSION: 2 VIS_PERIOD: 0 [03/26 21:44:55] detectron2 INFO: Full config saved to /hdd2/wh/cw/train/yolof/R_50_C5_1x/config.yaml [03/26 21:44:55] d2.utils.env INFO: Using a generated random seed 55931624 [03/26 21:44:56] d2.engine.defaults INFO: Model: YOLOF( (backbone): ResNet( (stem): BasicStem( (conv1): Conv2d( 3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) ) (res2): Sequential( (0): BottleneckBlock( (shortcut): Conv2d( 64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv1): Conv2d( 64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv2): Conv2d( 64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv3): Conv2d( 64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) ) (1): BottleneckBlock( (conv1): Conv2d( 256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv2): Conv2d( 64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv3): Conv2d( 64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) ) (2): BottleneckBlock( (conv1): Conv2d( 256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv2): Conv2d( 64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv3): Conv2d( 64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) ) ) (res3): Sequential( (0): BottleneckBlock( (shortcut): Conv2d( 256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv1): Conv2d( 256, 128, kernel_size=(1, 1), stride=(2, 2), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv2): Conv2d( 128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv3): Conv2d( 128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) ) (1): BottleneckBlock( (conv1): Conv2d( 512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv2): Conv2d( 128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv3): Conv2d( 128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) ) (2): BottleneckBlock( (conv1): Conv2d( 512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv2): Conv2d( 128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv3): Conv2d( 128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) ) (3): BottleneckBlock( (conv1): Conv2d( 512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv2): Conv2d( 128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv3): Conv2d( 128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) ) ) (res4): Sequential( (0): BottleneckBlock( (shortcut): Conv2d( 512, 1024, kernel_size=(1, 1), stride=(2, 2), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) (conv1): Conv2d( 512, 256, kernel_size=(1, 1), stride=(2, 2), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) (1): BottleneckBlock( (conv1): Conv2d( 1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) (2): BottleneckBlock( (conv1): Conv2d( 1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) (3): BottleneckBlock( (conv1): Conv2d( 1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) (4): BottleneckBlock( (conv1): Conv2d( 1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) (5): BottleneckBlock( (conv1): Conv2d( 1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) ) (res5): Sequential( (0): BottleneckBlock( (shortcut): Conv2d( 1024, 2048, kernel_size=(1, 1), stride=(2, 2), bias=False (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05) ) (conv1): Conv2d( 1024, 512, kernel_size=(1, 1), stride=(2, 2), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv2): Conv2d( 512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv3): Conv2d( 512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05) ) ) (1): BottleneckBlock( (conv1): Conv2d( 2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv2): Conv2d( 512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv3): Conv2d( 512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05) ) ) (2): BottleneckBlock( (conv1): Conv2d( 2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv2): Conv2d( 512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv3): Conv2d( 512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05) ) ) ) ) (encoder): DilatedEncoder( (lateral_conv): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1)) (lateral_norm): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (fpn_conv): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (fpn_norm): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (dilated_encoder_blocks): Sequential( (0): Bottleneck( (conv1): Sequential( (0): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1)) (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) (conv2): Sequential( (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(2, 2), dilation=(2, 2)) (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) (conv3): Sequential( (0): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1)) (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) ) (1): Bottleneck( (conv1): Sequential( (0): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1)) (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) (conv2): Sequential( (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(4, 4), dilation=(4, 4)) (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) (conv3): Sequential( (0): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1)) (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) ) (2): Bottleneck( (conv1): Sequential( (0): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1)) (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) (conv2): Sequential( (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(6, 6), dilation=(6, 6)) (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) (conv3): Sequential( (0): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1)) (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) ) (3): Bottleneck( (conv1): Sequential( (0): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1)) (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) (conv2): Sequential( (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(8, 8), dilation=(8, 8)) (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) (conv3): Sequential( (0): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1)) (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) ) ) ) (decoder): Decoder( (cls_subnet): Sequential( (0): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) (3): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (4): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (5): ReLU(inplace=True) ) (bbox_subnet): Sequential( (0): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) (3): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (4): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (5): ReLU(inplace=True) (6): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (7): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (8): ReLU(inplace=True) (9): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (10): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (11): ReLU(inplace=True) ) (cls_score): Conv2d(512, 400, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (bbox_pred): Conv2d(512, 20, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (object_pred): Conv2d(512, 5, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) ) (anchor_generator): DefaultAnchorGenerator( (cell_anchors): BufferList() ) (anchor_matcher): UniformMatcher() ) [03/26 21:45:18] d2.data.datasets.coco INFO: Loading datasets/coco/annotations/instances_train2017.json takes 21.19 seconds. [03/26 21:45:19] d2.data.datasets.coco INFO: Loaded 118287 images in COCO format from datasets/coco/annotations/instances_train2017.json [03/26 21:45:31] d2.data.build INFO: Removed 1021 images with no usable annotations. 117266 images left. [03/26 21:45:38] d2.data.build INFO: Distribution of instances among all 80 categories: [36m | category | #instances | category | #instances | category | #instances |
---|---|---|---|---|---|---|
person | 257253 | bicycle | 7056 | car | 43533 | |
motorcycle | 8654 | airplane | 5129 | bus | 6061 | |
train | 4570 | truck | 9970 | boat | 10576 | |
traffic light | 12842 | fire hydrant | 1865 | stop sign | 1983 | |
parking meter | 1283 | bench | 9820 | bird | 10542 | |
cat | 4766 | dog | 5500 | horse | 6567 | |
sheep | 9223 | cow | 8014 | elephant | 5484 | |
bear | 1294 | zebra | 5269 | giraffe | 5128 | |
backpack | 8714 | umbrella | 11265 | handbag | 12342 | |
tie | 6448 | suitcase | 6112 | frisbee | 2681 | |
skis | 6623 | snowboard | 2681 | sports ball | 6299 | |
kite | 8802 | baseball bat | 3273 | baseball gl.. | 3747 | |
skateboard | 5536 | surfboard | 6095 | tennis racket | 4807 | |
bottle | 24070 | wine glass | 7839 | cup | 20574 | |
fork | 5474 | knife | 7760 | spoon | 6159 | |
bowl | 14323 | banana | 9195 | apple | 5776 | |
sandwich | 4356 | orange | 6302 | broccoli | 7261 | |
carrot | 7758 | hot dog | 2884 | pizza | 5807 | |
donut | 7005 | cake | 6296 | chair | 38073 | |
couch | 5779 | potted plant | 8631 | bed | 4192 | |
dining table | 15695 | toilet | 4149 | tv | 5803 | |
laptop | 4960 | mouse | 2261 | remote | 5700 | |
keyboard | 2854 | cell phone | 6422 | microwave | 1672 | |
oven | 3334 | toaster | 225 | sink | 5609 | |
refrigerator | 2634 | book | 24077 | clock | 6320 | |
vase | 6577 | scissors | 1464 | teddy bear | 4729 | |
hair drier | 198 | toothbrush | 1945 | |||
total | 849949 | [0m |
[03/26 21:45:38] d2.data.build INFO: Using training sampler TrainingSampler [03/26 21:45:40] d2.data.common INFO: Serializing 117266 elements to byte tensors and concatenating them all ... [03/26 21:45:46] d2.data.common INFO: Serialized dataset takes 451.21 MiB [03/26 21:45:54] fvcore.common.checkpoint INFO: Loading checkpoint from detectron2://ImageNetPretrained/MSRA/R-50.pkl [03/26 21:45:54] d2.checkpoint.c2_model_loading INFO: Renaming Caffe2 weights ...... [03/26 21:45:54] d2.checkpoint.c2_model_loading INFO: Following weights matched with submodule backbone: | Names in Model | Names in Checkpoint | Shapes |
---|---|---|---|
res2.0.conv1.* | res2_0branch2a{bn_*,w} | (64,) (64,) (64,) (64,) (64,64,1,1) | |
res2.0.conv2.* | res2_0branch2b{bn_*,w} | (64,) (64,) (64,) (64,) (64,64,3,3) | |
res2.0.conv3.* | res2_0branch2c{bn_*,w} | (256,) (256,) (256,) (256,) (256,64,1,1) | |
res2.0.shortcut.* | res2_0branch1{bn_*,w} | (256,) (256,) (256,) (256,) (256,64,1,1) | |
res2.1.conv1.* | res2_1branch2a{bn_*,w} | (64,) (64,) (64,) (64,) (64,256,1,1) | |
res2.1.conv2.* | res2_1branch2b{bn_*,w} | (64,) (64,) (64,) (64,) (64,64,3,3) | |
res2.1.conv3.* | res2_1branch2c{bn_*,w} | (256,) (256,) (256,) (256,) (256,64,1,1) | |
res2.2.conv1.* | res2_2branch2a{bn_*,w} | (64,) (64,) (64,) (64,) (64,256,1,1) | |
res2.2.conv2.* | res2_2branch2b{bn_*,w} | (64,) (64,) (64,) (64,) (64,64,3,3) | |
res2.2.conv3.* | res2_2branch2c{bn_*,w} | (256,) (256,) (256,) (256,) (256,64,1,1) | |
res3.0.conv1.* | res3_0branch2a{bn_*,w} | (128,) (128,) (128,) (128,) (128,256,1,1) | |
res3.0.conv2.* | res3_0branch2b{bn_*,w} | (128,) (128,) (128,) (128,) (128,128,3,3) | |
res3.0.conv3.* | res3_0branch2c{bn_*,w} | (512,) (512,) (512,) (512,) (512,128,1,1) | |
res3.0.shortcut.* | res3_0branch1{bn_*,w} | (512,) (512,) (512,) (512,) (512,256,1,1) | |
res3.1.conv1.* | res3_1branch2a{bn_*,w} | (128,) (128,) (128,) (128,) (128,512,1,1) | |
res3.1.conv2.* | res3_1branch2b{bn_*,w} | (128,) (128,) (128,) (128,) (128,128,3,3) | |
res3.1.conv3.* | res3_1branch2c{bn_*,w} | (512,) (512,) (512,) (512,) (512,128,1,1) | |
res3.2.conv1.* | res3_2branch2a{bn_*,w} | (128,) (128,) (128,) (128,) (128,512,1,1) | |
res3.2.conv2.* | res3_2branch2b{bn_*,w} | (128,) (128,) (128,) (128,) (128,128,3,3) | |
res3.2.conv3.* | res3_2branch2c{bn_*,w} | (512,) (512,) (512,) (512,) (512,128,1,1) | |
res3.3.conv1.* | res3_3branch2a{bn_*,w} | (128,) (128,) (128,) (128,) (128,512,1,1) | |
res3.3.conv2.* | res3_3branch2b{bn_*,w} | (128,) (128,) (128,) (128,) (128,128,3,3) | |
res3.3.conv3.* | res3_3branch2c{bn_*,w} | (512,) (512,) (512,) (512,) (512,128,1,1) | |
res4.0.conv1.* | res4_0branch2a{bn_*,w} | (256,) (256,) (256,) (256,) (256,512,1,1) | |
res4.0.conv2.* | res4_0branch2b{bn_*,w} | (256,) (256,) (256,) (256,) (256,256,3,3) | |
res4.0.conv3.* | res4_0branch2c{bn_*,w} | (1024,) (1024,) (1024,) (1024,) (1024,256,1,1) | |
res4.0.shortcut.* | res4_0branch1{bn_*,w} | (1024,) (1024,) (1024,) (1024,) (1024,512,1,1) | |
res4.1.conv1.* | res4_1branch2a{bn_*,w} | (256,) (256,) (256,) (256,) (256,1024,1,1) | |
res4.1.conv2.* | res4_1branch2b{bn_*,w} | (256,) (256,) (256,) (256,) (256,256,3,3) | |
res4.1.conv3.* | res4_1branch2c{bn_*,w} | (1024,) (1024,) (1024,) (1024,) (1024,256,1,1) | |
res4.2.conv1.* | res4_2branch2a{bn_*,w} | (256,) (256,) (256,) (256,) (256,1024,1,1) | |
res4.2.conv2.* | res4_2branch2b{bn_*,w} | (256,) (256,) (256,) (256,) (256,256,3,3) | |
res4.2.conv3.* | res4_2branch2c{bn_*,w} | (1024,) (1024,) (1024,) (1024,) (1024,256,1,1) | |
res4.3.conv1.* | res4_3branch2a{bn_*,w} | (256,) (256,) (256,) (256,) (256,1024,1,1) | |
res4.3.conv2.* | res4_3branch2b{bn_*,w} | (256,) (256,) (256,) (256,) (256,256,3,3) | |
res4.3.conv3.* | res4_3branch2c{bn_*,w} | (1024,) (1024,) (1024,) (1024,) (1024,256,1,1) | |
res4.4.conv1.* | res4_4branch2a{bn_*,w} | (256,) (256,) (256,) (256,) (256,1024,1,1) | |
res4.4.conv2.* | res4_4branch2b{bn_*,w} | (256,) (256,) (256,) (256,) (256,256,3,3) | |
res4.4.conv3.* | res4_4branch2c{bn_*,w} | (1024,) (1024,) (1024,) (1024,) (1024,256,1,1) | |
res4.5.conv1.* | res4_5branch2a{bn_*,w} | (256,) (256,) (256,) (256,) (256,1024,1,1) | |
res4.5.conv2.* | res4_5branch2b{bn_*,w} | (256,) (256,) (256,) (256,) (256,256,3,3) | |
res4.5.conv3.* | res4_5branch2c{bn_*,w} | (1024,) (1024,) (1024,) (1024,) (1024,256,1,1) | |
res5.0.conv1.* | res5_0branch2a{bn_*,w} | (512,) (512,) (512,) (512,) (512,1024,1,1) | |
res5.0.conv2.* | res5_0branch2b{bn_*,w} | (512,) (512,) (512,) (512,) (512,512,3,3) | |
res5.0.conv3.* | res5_0branch2c{bn_*,w} | (2048,) (2048,) (2048,) (2048,) (2048,512,1,1) | |
res5.0.shortcut.* | res5_0branch1{bn_*,w} | (2048,) (2048,) (2048,) (2048,) (2048,1024,1,1) | |
res5.1.conv1.* | res5_1branch2a{bn_*,w} | (512,) (512,) (512,) (512,) (512,2048,1,1) | |
res5.1.conv2.* | res5_1branch2b{bn_*,w} | (512,) (512,) (512,) (512,) (512,512,3,3) | |
res5.1.conv3.* | res5_1branch2c{bn_*,w} | (2048,) (2048,) (2048,) (2048,) (2048,512,1,1) | |
res5.2.conv1.* | res5_2branch2a{bn_*,w} | (512,) (512,) (512,) (512,) (512,2048,1,1) | |
res5.2.conv2.* | res5_2branch2b{bn_*,w} | (512,) (512,) (512,) (512,) (512,512,3,3) | |
res5.2.conv3.* | res5_2branch2c{bn_*,w} | (2048,) (2048,) (2048,) (2048,) (2048,512,1,1) | |
stem.conv1.norm.* | res_conv1bn* | (64,) (64,) (64,) (64,) | |
stem.conv1.weight | conv1_w | (64, 3, 7, 7) |
[03/26 21:45:54] fvcore.common.checkpoint INFO: Some model parameters or buffers are not found in the checkpoint: [34manchor_generator.cell_anchors.0[0m [34mdecoder.bbox_pred.{bias, weight}[0m [34mdecoder.bbox_subnet.0.{bias, weight}[0m [34mdecoder.bbox_subnet.1.{bias, running_mean, running_var, weight}[0m [34mdecoder.bbox_subnet.10.{bias, running_mean, running_var, weight}[0m [34mdecoder.bbox_subnet.3.{bias, weight}[0m [34mdecoder.bbox_subnet.4.{bias, running_mean, running_var, weight}[0m [34mdecoder.bbox_subnet.6.{bias, weight}[0m [34mdecoder.bbox_subnet.7.{bias, running_mean, running_var, weight}[0m [34mdecoder.bbox_subnet.9.{bias, weight}[0m [34mdecoder.cls_score.{bias, weight}[0m [34mdecoder.cls_subnet.0.{bias, weight}[0m [34mdecoder.cls_subnet.1.{bias, running_mean, running_var, weight}[0m [34mdecoder.cls_subnet.3.{bias, weight}[0m [34mdecoder.cls_subnet.4.{bias, running_mean, running_var, weight}[0m [34mdecoder.object_pred.{bias, weight}[0m [34mencoder.dilated_encoder_blocks.0.conv1.0.{bias, weight}[0m [34mencoder.dilated_encoder_blocks.0.conv1.1.{bias, running_mean, running_var, weight}[0m [34mencoder.dilated_encoder_blocks.0.conv2.0.{bias, weight}[0m [34mencoder.dilated_encoder_blocks.0.conv2.1.{bias, running_mean, running_var, weight}[0m [34mencoder.dilated_encoder_blocks.0.conv3.0.{bias, weight}[0m [34mencoder.dilated_encoder_blocks.0.conv3.1.{bias, running_mean, running_var, weight}[0m [34mencoder.dilated_encoder_blocks.1.conv1.0.{bias, weight}[0m [34mencoder.dilated_encoder_blocks.1.conv1.1.{bias, running_mean, running_var, weight}[0m [34mencoder.dilated_encoder_blocks.1.conv2.0.{bias, weight}[0m [34mencoder.dilated_encoder_blocks.1.conv2.1.{bias, running_mean, running_var, weight}[0m [34mencoder.dilated_encoder_blocks.1.conv3.0.{bias, weight}[0m [34mencoder.dilated_encoder_blocks.1.conv3.1.{bias, running_mean, running_var, weight}[0m [34mencoder.dilated_encoder_blocks.2.conv1.0.{bias, weight}[0m [34mencoder.dilated_encoder_blocks.2.conv1.1.{bias, running_mean, running_var, weight}[0m [34mencoder.dilated_encoder_blocks.2.conv2.0.{bias, weight}[0m [34mencoder.dilated_encoder_blocks.2.conv2.1.{bias, running_mean, running_var, weight}[0m [34mencoder.dilated_encoder_blocks.2.conv3.0.{bias, weight}[0m [34mencoder.dilated_encoder_blocks.2.conv3.1.{bias, running_mean, running_var, weight}[0m [34mencoder.dilated_encoder_blocks.3.conv1.0.{bias, weight}[0m [34mencoder.dilated_encoder_blocks.3.conv1.1.{bias, running_mean, running_var, weight}[0m [34mencoder.dilated_encoder_blocks.3.conv2.0.{bias, weight}[0m [34mencoder.dilated_encoder_blocks.3.conv2.1.{bias, running_mean, running_var, weight}[0m [34mencoder.dilated_encoder_blocks.3.conv3.0.{bias, weight}[0m [34mencoder.dilated_encoder_blocks.3.conv3.1.{bias, running_mean, running_var, weight}[0m [34mencoder.fpn_conv.{bias, weight}[0m [34mencoder.fpn_norm.{bias, running_mean, running_var, weight}[0m [34mencoder.lateral_conv.{bias, weight}[0m [34mencoder.lateral_norm.{bias, running_mean, running_var, weight}[0m [03/26 21:45:54] fvcore.common.checkpoint INFO: The checkpoint state_dict contains keys that are not used by the model: [35mfc1000.{bias, weight}[0m [35mstem.conv1.bias[0m [03/26 21:45:54] d2.engine.train_loop INFO: Starting training from iteration 0
[03/26 21:45:54 d2.engine.train_loop]: Starting training from iteration 0
/opt/conda/conda-bld/pytorch_1595629408163/work/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [120,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1595629408163/work/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [61,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered
Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1595629408163/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f4c7325377d in /home/cw/miniconda3/envs/py_dt2/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xb5d (0x7f4c734a3d9d in /home/cw/miniconda3/envs/py_dt2/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f4c7323fb1d in /home/cw/miniconda3/envs/py_dt2/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3:
Traceback (most recent call last):
File "./tools/train_net.py", line 234, in
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/cw/miniconda3/envs/py_dt2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap
fn(i, args)
File "/home/cw/detectron2/detectron2/engine/launch.py", line 94, in _distributed_worker
main_func(args)
File "/home/cw/YOLOF/tools/train_net.py", line 221, in main
return trainer.train()
File "/home/cw/detectron2/detectron2/engine/defaults.py", line 431, in train
super().train(self.start_iter, self.max_iter)
File "/home/cw/detectron2/detectron2/engine/train_loop.py", line 140, in train
self.run_step()
File "/home/cw/detectron2/detectron2/engine/defaults.py", line 441, in run_step
self._trainer.run_step()
File "/home/cw/detectron2/detectron2/engine/train_loop.py", line 234, in run_step
loss_dict = self.model(data)
File "/home/cw/miniconda3/envs/py_dt2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, kwargs)
File "/home/cw/miniconda3/envs/py_dt2/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 511, in forward
output = self.module(*inputs[0], *kwargs[0])
File "/home/cw/miniconda3/envs/py_dt2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(input, kwargs)
File "/home/cw/YOLOF/yolof/modeling/yolof.py", line 295, in forward
pred_logits, pred_anchor_deltas)
File "/home/cw/YOLOF/yolof/modeling/yolof.py", line 394, in losses
pred_class_logits[valid_idxs],
RuntimeError: copy_if failed to synchronize: device-side assert triggered
command :
python ./tools/train_net.py --num-gpus 2 --config-file ./configs/yolof_R_50_C5_1x.yaml OUTPUT_DIR /hdd2/wh/cw/train/yolof/R_50_C5_1x/
yaml: ` MODEL: META_ARCHITECTURE: "YOLOF" BACKBONE: NAME: "build_resnet_backbone" RESNETS: OUT_FEATURES: ["res5"] DATASETS: TRAIN: ("coco_2017_train",) TEST: ("coco_2017_val",) DATALOADER: NUM_WORKERS: 8 SOLVER:
IMS_PER_BATCH: 16
BASE_LR: 0.03 WARMUP_FACTOR: 0.0002 # 0.00066667 WARMUP_ITERS: 5000 # 1500
STEPS: (60000, 80000)
MAX_ITER: 90000 CHECKPOINT_PERIOD: 2500 INPUT: MIN_SIZE_TRAIN: (800,)
OUTPUT_DIR: '/hdd2/wh/cw/train/yolof/R_50_C5_1x' `
作者您好,我粗浅的认为是跟数据有关的错误,可能是发生了数组越界等? 但我的数据集是coco2017,不应该有这个错误才对,其余的我也没有更改了,我原本的detectron2也重新升级了一下。 我发现部分输入的图片是没有错误的,网络还可以迭代几次,打印loss
看错误是在取index的时候越界了,但是很奇怪,我之前跑过这么多次都没有遇到过这个问题。log文件一眼看过去也并没有找到明显不对的地方,感觉没啥道理。。。你这个是每次一跑,必然会出现这个错误嘛?
pred_class_logits[valid_idxs].size()
是的每次一跑必定越界..... 我在yolof.py错误的地方调试了一下
print("gt_classes >= 0", gt_classes[gt_classes >= 0].size())
gt_class.size() torch.Size([38000]) gt_classes >= 0 torch.Size([37818]) valid_idxs torch.Size([38000]) pred_class_logits.size() torch.Size([38000, 80]) pred_class_logits[valid_idxs] torch.Size([37818, 80])
gt_class.size() torch.Size([42000]) gt_classes >= 0 torch.Size([41866]) valid_idxs torch.Size([42000]) pred_class_logits.size() torch.Size([42000, 80]) pred_class_logits[valid_idxs] torch.Size([41866, 80])
gt_class.size() torch.Size([38000])
没输出print("gt_classes >= 0", gt_classes[gt_classes >= 0].size())
/opt/conda/conda-bld/pytorch_1595629408163/work/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [9,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
/opt/conda/conda-bld/pytorch_1595629408163/work/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [11,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds"
failed.
感觉这里不应该有错,建议用
CUDA_LAUNCH_BLOCKING=1 python ./tools/train_net.py --num-gpus 2 --config-file ./configs/yolof_R_50_C5_1x.yaml OUTPUT_DIR /hdd2/wh/cw/train/yolof/R_50_C5_1x/
看看到底哪里出错了。 或者换个机器重新配一下环境,跑一下试试,按理来说能直接跑才对
还是不行,可能只能用其他机器试一下, 我不确定我的cuda 9.0是否会对这个有影响 。
作者您好 我换了一台cuda 10.1的然后重复了我之前的操作,(基本就是直接安装了,数据集也是直接从原来的服务器上传输的)。 然后就没问题了。。。奇葩 我暂时只能归咎于是我之前cuda的9.0,cudatoolkits 9.2不好适配您的代码? (我跑adelaidet还是没问题的,本来大半年没更新了 今天更新了下 简要修改了train.py没啥问题hhh)
另外有一个小地方,我觉得您可以考虑修改下 。
就是您建议安装的mish-cuda
,好像不能够直接 build
After git clone,you should movemish-cuda/external/CUDAApplyUtils.cuh
to csrc/
before python setup.py build install
Thanks for sharing your great work. I am sorry that I have a bug when I use
python ./tools/train_net.py --num-gpus 1 --config-file ./configs/yolof_R_50_C5_1x.yaml
Bug log below as :
[03/26 07:38:03 d2.data.build]: Using training sampler TrainingSampler [03/26 07:38:03 d2.data.common]: Serializing 117266 elements to byte tensors and concatenating them all ... [03/26 07:38:10 d2.data.common]: Serialized dataset takes 451.21 MiB [03/26 07:38:15 fvcore.common.checkpoint]: Loading checkpoint from detectron2://ImageNetPretrained/MSRA/R-50.pkl Traceback (most recent call last): File "./tools/train_net.py", line 234, in
args=(args,),
File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/launch.py", line 62, in launch
main_func(*args)
File "./tools/train_net.py", line 215, in main
trainer.resume_or_load(resume=args.resume)
File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/defaults.py", line 353, in resume_or_load
checkpoint = self.checkpointer.resume_or_load(self.cfg.MODEL.WEIGHTS, resume=resume)
File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/fvcore/common/checkpoint.py", line 215, in resume_or_load
return self.load(path, checkpointables=[])
File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/fvcore/common/checkpoint.py", line 140, in load
path = self.path_manager.get_local_path(path)
File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/iopath/common/file_io.py", line 1100, in get_local_path
path, force=force, kwargs
File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/utils/file_io.py", line 29, in _get_local_path
return PathManager.get_local_path(self.S3_DETECTRON2_PREFIX + name, kwargs)
File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/iopath/common/file_io.py", line 1100, in get_local_path
path, force=force, **kwargs
File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/iopath/common/file_io.py", line 755, in _get_local_path
with file_lock(cached):
File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/iopath/common/file_io.py", line 82, in file_lock
return portalocker.Lock(path + ".lock", timeout=3600) # type: ignore
AttributeError: module 'portalocker' has no attribute 'Lock'
I woule be grateful if you could give me some advice. Thanks.