chensnathan / YOLOF

You Only Look One-level Feature (YOLOF), CVPR2021, Detectron2
MIT License
271 stars 28 forks source link

AttributeError: module 'portalocker' has no attribute 'Lock' #4

Closed Wei-i closed 3 years ago

Wei-i commented 3 years ago

Thanks for sharing your great work. I am sorry that I have a bug when I use python ./tools/train_net.py --num-gpus 1 --config-file ./configs/yolof_R_50_C5_1x.yaml

Bug log below as :

[03/26 07:38:03 d2.data.build]: Using training sampler TrainingSampler [03/26 07:38:03 d2.data.common]: Serializing 117266 elements to byte tensors and concatenating them all ... [03/26 07:38:10 d2.data.common]: Serialized dataset takes 451.21 MiB [03/26 07:38:15 fvcore.common.checkpoint]: Loading checkpoint from detectron2://ImageNetPretrained/MSRA/R-50.pkl Traceback (most recent call last): File "./tools/train_net.py", line 234, in args=(args,), File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/launch.py", line 62, in launch main_func(*args) File "./tools/train_net.py", line 215, in main trainer.resume_or_load(resume=args.resume) File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/defaults.py", line 353, in resume_or_load checkpoint = self.checkpointer.resume_or_load(self.cfg.MODEL.WEIGHTS, resume=resume) File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/fvcore/common/checkpoint.py", line 215, in resume_or_load return self.load(path, checkpointables=[]) File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/fvcore/common/checkpoint.py", line 140, in load path = self.path_manager.get_local_path(path) File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/iopath/common/file_io.py", line 1100, in get_local_path path, force=force, kwargs File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/utils/file_io.py", line 29, in _get_local_path return PathManager.get_local_path(self.S3_DETECTRON2_PREFIX + name, kwargs) File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/iopath/common/file_io.py", line 1100, in get_local_path path, force=force, **kwargs File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/iopath/common/file_io.py", line 755, in _get_local_path with file_lock(cached): File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/iopath/common/file_io.py", line 82, in file_lock return portalocker.Lock(path + ".lock", timeout=3600) # type: ignore AttributeError: module 'portalocker' has no attribute 'Lock'

I woule be grateful if you could give me some advice. Thanks.

Wei-i commented 3 years ago

It seems that there may be wrong with 'Loading checkpoint from detectron2://ImageNetPretrained/MSRA/R-50.pkl '?

chensnathan commented 3 years ago

Could you check the version of portalocker in your environment? And run the following code snippets to verify whether the portalocker has the attribute Lock or not:

>>> import portalocker
>>> portalocker.__version__
>>> portalocker.Lock
Wei-i commented 3 years ago

Sorry, there must be something wrong with my portalocker?

(yolof) cw@MAC-3DGroup:~$ python
Python 3.6.13 | packaged by conda-forge | (default, Feb 19 2021, 05:36:01) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import fvcore
>>> import portalocker
>>> 
>>> portalocker.__version__
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'portalocker' has no attribute '__version__'
chensnathan commented 3 years ago

Try to re-install portalocker?

Wei-i commented 3 years ago

first pip uninstall portalocker and conda install portalocker then bug will be fixed.

Wei-i commented 3 years ago

[03/26 08:50:52 d2.engine.hooks]: Total training time: 0:00:01 (0:00:00 on hooks) [03/26 08:50:52 d2.utils.events]: iter: 0 lr: N/A max_mem: 622M Traceback (most recent call last): File "./tools/train_net.py", line 234, in <module> args=(args,), File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/launch.py", line 62, in launch main_func(*args) File "./tools/train_net.py", line 221, in main return trainer.train() File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/defaults.py", line 431, in train super().train(self.start_iter, self.max_iter) File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/train_loop.py", line 140, in train self.run_step() File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/defaults.py", line 441, in run_step self._trainer.run_step() File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/train_loop.py", line 234, in run_step loss_dict = self.model(data) File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/cw/YOLOF/yolof/modeling/yolof.py", line 294, in forward pred_logits, pred_anchor_deltas) File "/home/cw/YOLOF/yolof/modeling/yolof.py", line 387, in losses dist.all_reduce(num_foreground) File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 953, in all_reduce _check_default_pg() File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 211, in _check_default_pg "Default process group is not initialized" AssertionError: Default process group is not initialized

How to close DDP?

Wei-i commented 3 years ago

I think this is the last bug before I can train YOLOF...

chensnathan commented 3 years ago

Comment out the lines with dist in the yolof.py file.

BTW, when you train with only one GPU, you should adjust the learning rate and batch size. Refer to this response.

chensnathan commented 3 years ago

Support training with one GPU in this commit.

Wei-i commented 3 years ago

Thanks ! [03/26 09:10:09 d2.engine.hooks]: Overall training speed: 57 iterations in 0:00:18 (0.3159 s / it) [03/26 09:10:09 d2.engine.hooks]: Total training time: 0:00:18 (0:00:00 on hooks) [03/26 09:10:09 d2.utils.events]: eta: 2:01:11 iter: 59 total_loss: 2.067 loss_cls: 1.342 loss_box_reg: 0.7438 time: 0.3139 data_time: 0.0022 lr: 3.9308e-06 max_mem: 1076M Traceback (most recent call last): File "tools/train_net.py", line 234, in args=(args,), File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/launch.py", line 62, in launch main_func(args) File "tools/train_net.py", line 221, in main return trainer.train() File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/defaults.py", line 431, in train super().train(self.start_iter, self.max_iter) File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/train_loop.py", line 140, in train self.run_step() File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/defaults.py", line 441, in run_step self._trainer.run_step() File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/train_loop.py", line 234, in run_step loss_dict = self.model(data) File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, **kwargs) File "/home/cw/YOLOF/yolof/modeling/yolof.py", line 295, in forward pred_logits, pred_anchor_deltas) File "/home/cw/YOLOF/yolof/modeling/yolof.py", line 397, in losses pred_class_logits[valid_idxs], RuntimeError: CUDA error: device-side assert triggered

Wei-i commented 3 years ago

真的很抱歉,作者,由于我的水平太低,这又出现了新的bug。请问这个应该怎么改?

chensnathan commented 3 years ago

Could you give more details about what command you use?

Wei-i commented 3 years ago

command : CUDA_VISIBLE_DEVICES=1 python tools/train_net.py --num-gpus 1 --config-file ./configs/yolof_R_50_C5_1x.yaml

yaml:

MODEL: META_ARCHITECTURE: "YOLOF" BACKBONE: NAME: "build_resnet_backbone" RESNETS: OUT_FEATURES: ["res5"] DATASETS: TRAIN: ("coco_2017_train",) TEST: ("coco_2017_val",) DATALOADER:

NUM_WORKERS: 8

NUM_WORKERS: 4 SOLVER:

IMS_PER_BATCH: 64

IMS_PER_BATCH: 2

BASE_LR: 0.12

BASE_LR: 0.00001 WARMUP_FACTOR: 0.00066667 WARMUP_ITERS: 1500

STEPS: (15000, 20000)

STEPS: (480000, 640000)

MAX_ITER: 22500

MAX_ITER: 720000 CHECKPOINT_PERIOD: 2500 INPUT: MIN_SIZE_TRAIN: (800,)

chensnathan commented 3 years ago

Can you try this setting?

IMS_PER_BATCH: 8
BASE_LR: 0.03
WARMUP_FACTOR: 0.00066667
WARMUP_ITERS: 1500
STEPS: (120000, 160000)
MAX_ITER: 180000
Wei-i commented 3 years ago

I left lab and I will try tomorrow. thanks a lot from bottom of my heart !

Wei-i commented 3 years ago

Good Morning! When I tried your setting, it stiill remains the same bug as:

[03/26 17:38:53 d2.engine.train_loop]: Starting training from iteration 0

/opt/conda/conda-bld/pytorch_1607370169888/work/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [11,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.

ERROR [03/26 17:39:00 d2.engine.train_loop]: Exception during training: Traceback (most recent call last): File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/train_loop.py", line 140, in train self.run_step() File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/defaults.py", line 441, in run_step self._trainer.run_step() File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/train_loop.py", line 234, in run_step loss_dict = self.model(data) File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/cw/YOLOF/yolof/modeling/yolof.py", line 295, in forward pred_logits, pred_anchor_deltas) File "/home/cw/YOLOF/yolof/modeling/yolof.py", line 397, in losses pred_class_logits[valid_idxs], RuntimeError: CUDA error: device-side assert triggered

chensnathan commented 3 years ago

Sorry, the BASE_LR should be 0.015. But I can run with one GPU with an initial learning rate of both 0.03 and 0.015. I can not reproduce your error on my side.

Try to warm up more iterations, e.g.,

WARMUP_FACTOR: 0.0002
WARMUP_ITERS: 5000
Wei-i commented 3 years ago

Thanks. It stil does not work...

chensnathan commented 3 years ago

Could you upload your training log file?

Wei-i commented 3 years ago

[03/26 21:44:54] detectron2 INFO: Rank of current process: 0. World size: 2 [03/26 21:44:55] detectron2 INFO: Environment info:


sys.platform linux Python 3.7.8 packaged by conda-forge (default, Jul 31 2020, 02:25:08) [GCC 7.5.0] numpy 1.19.1 detectron2 0.4 @/home/cw/detectron2/detectron2 Compiler GCC 5.4 CUDA compiler CUDA 9.0 detectron2 arch flags 6.1 DETECTRON2_ENV_MODULE PyTorch 1.6.0 @/home/cw/miniconda3/envs/py_dt2/lib/python3.7/site-packages/torch PyTorch debug build False GPU available True GPU 0,1,2 GeForce GTX 1080 Ti (arch=6.1) CUDA_HOME /usr/local/cuda-9.0 Pillow 7.2.0 torchvision 0.7.0 @/home/cw/miniconda3/envs/py_dt2/lib/python3.7/site-packages/torchvision torchvision arch flags 3.5, 5.0, 6.0, 7.0 fvcore 0.1.5.post20210327 iopath 0.1.7 cv2 4.4.0

PyTorch built with:

[03/26 21:44:55] detectron2 INFO: Command line arguments: Namespace(config_file='./configs/yolof_R_50_C5_1x.yaml', dist_url='tcp://127.0.0.1:50159', eval_only=False, machine_rank=0, num_gpus=2, num_machines=1, opts=['OUTPUT_DIR', '/hdd2/wh/cw/train/yolof/R_50_C5_1x/'], resume=False) [03/26 21:44:55] detectron2 INFO: Contents of args.config_file=./configs/yolof_R_50_C5_1x.yaml: BASE: "Base-YOLOF.yaml" MODEL: WEIGHTS: "detectron2://ImageNetPretrained/MSRA/R-50.pkl" RESNETS: DEPTH: 50 OUTPUT_DIR: "output/yolof/R_50_C5_1x"

[03/26 21:44:55] detectron2 INFO: Running with full config: CUDNN_BENCHMARK: False DATALOADER: ASPECT_RATIO_GROUPING: True FILTER_EMPTY_ANNOTATIONS: True NUM_WORKERS: 8 REPEAT_THRESHOLD: 0.0 SAMPLER_TRAIN: TrainingSampler DATASETS: PRECOMPUTED_PROPOSAL_TOPK_TEST: 1000 PRECOMPUTED_PROPOSAL_TOPK_TRAIN: 2000 PROPOSAL_FILES_TEST: () PROPOSAL_FILES_TRAIN: () TEST: ('coco_2017_val',) TRAIN: ('coco_2017_train',) GLOBAL: HACK: 1.0 INPUT: CROP: ENABLED: False SIZE: [0.9, 0.9] TYPE: relative_range DISTORTION: ENABLED: False EXPOSURE: 1.5 HUE: 0.1 SATURATION: 1.5 FORMAT: BGR JITTER_CROP: ENABLED: False JITTER_RATIO: 0.3 MASK_FORMAT: polygon MAX_SIZE_TEST: 1333 MAX_SIZE_TRAIN: 1333 MIN_SIZE_TEST: 800 MIN_SIZE_TRAIN: (800,) MIN_SIZE_TRAIN_SAMPLING: choice MOSAIC: ENABLED: False MIN_OFFSET: 0.2 MOSAIC_HEIGHT: 640 MOSAIC_WIDTH: 640 NUM_IMAGES: 4 POOL_CAPACITY: 1000 RANDOM_FLIP: horizontal RESIZE: ENABLED: False SCALE_JITTER: (0.8, 1.2) SHAPE: (640, 640) TEST_SHAPE: (608, 608) SHIFT: SHIFT_PIXELS: 32 MODEL: ANCHOR_GENERATOR: ANGLES: [[-90, 0, 90]] ASPECT_RATIOS: [[1.0]] NAME: DefaultAnchorGenerator OFFSET: 0.0 SIZES: [[32, 64, 128, 256, 512]] BACKBONE: FREEZE_AT: 2 NAME: build_resnet_backbone DARKNET: DEPTH: 53 NORM: BN OUT_FEATURES: ['res5'] RES5_DILATION: 1 WITH_CSP: True DEVICE: cuda FPN: FUSE_TYPE: sum IN_FEATURES: [] NORM: OUT_CHANNELS: 256 KEYPOINT_ON: False LOAD_PROPOSALS: False MASK_ON: False META_ARCHITECTURE: YOLOF PANOPTIC_FPN: COMBINE: ENABLED: True INSTANCES_CONFIDENCE_THRESH: 0.5 OVERLAP_THRESH: 0.5 STUFF_AREA_LIMIT: 4096 INSTANCE_LOSS_WEIGHT: 1.0 PIXEL_MEAN: [103.53, 116.28, 123.675] PIXEL_STD: [1.0, 1.0, 1.0] PROPOSAL_GENERATOR: MIN_SIZE: 0 NAME: RPN RESNETS: DEFORM_MODULATED: False DEFORM_NUM_GROUPS: 1 DEFORM_ON_PER_STAGE: [False, False, False, False] DEPTH: 50 NORM: FrozenBN NUM_GROUPS: 1 OUT_FEATURES: ['res5'] RES2_OUT_CHANNELS: 256 RES5_DILATION: 1 STEM_OUT_CHANNELS: 64 STRIDE_IN_1X1: True WIDTH_PER_GROUP: 64 RETINANET: BBOX_REG_LOSS_TYPE: smooth_l1 BBOX_REG_WEIGHTS: (1.0, 1.0, 1.0, 1.0) FOCAL_LOSS_ALPHA: 0.25 FOCAL_LOSS_GAMMA: 2.0 IN_FEATURES: ['p3', 'p4', 'p5', 'p6', 'p7'] IOU_LABELS: [0, -1, 1] IOU_THRESHOLDS: [0.4, 0.5] NMS_THRESH_TEST: 0.5 NORM: NUM_CLASSES: 80 NUM_CONVS: 4 PRIOR_PROB: 0.01 SCORE_THRESH_TEST: 0.05 SMOOTH_L1_LOSS_BETA: 0.1 TOPK_CANDIDATES_TEST: 1000 ROI_BOX_CASCADE_HEAD: BBOX_REG_WEIGHTS: ((10.0, 10.0, 5.0, 5.0), (20.0, 20.0, 10.0, 10.0), (30.0, 30.0, 15.0, 15.0)) IOUS: (0.5, 0.6, 0.7) ROI_BOX_HEAD: BBOX_REG_LOSS_TYPE: smooth_l1 BBOX_REG_LOSS_WEIGHT: 1.0 BBOX_REG_WEIGHTS: (10.0, 10.0, 5.0, 5.0) CLS_AGNOSTIC_BBOX_REG: False CONV_DIM: 256 FC_DIM: 1024 NAME: NORM: NUM_CONV: 0 NUM_FC: 0 POOLER_RESOLUTION: 14 POOLER_SAMPLING_RATIO: 0 POOLER_TYPE: ROIAlignV2 SMOOTH_L1_BETA: 0.0 TRAIN_ON_PRED_BOXES: False ROI_HEADS: BATCH_SIZE_PER_IMAGE: 512 IN_FEATURES: ['res4'] IOU_LABELS: [0, 1] IOU_THRESHOLDS: [0.5] NAME: Res5ROIHeads NMS_THRESH_TEST: 0.5 NUM_CLASSES: 80 POSITIVE_FRACTION: 0.25 PROPOSAL_APPEND_GT: True SCORE_THRESH_TEST: 0.05 ROI_KEYPOINT_HEAD: CONV_DIMS: (512, 512, 512, 512, 512, 512, 512, 512) LOSS_WEIGHT: 1.0 MIN_KEYPOINTS_PER_IMAGE: 1 NAME: KRCNNConvDeconvUpsampleHead NORMALIZE_LOSS_BY_VISIBLE_KEYPOINTS: True NUM_KEYPOINTS: 17 POOLER_RESOLUTION: 14 POOLER_SAMPLING_RATIO: 0 POOLER_TYPE: ROIAlignV2 ROI_MASK_HEAD: CLS_AGNOSTIC_MASK: False CONV_DIM: 256 NAME: MaskRCNNConvUpsampleHead NORM: NUM_CONV: 0 POOLER_RESOLUTION: 14 POOLER_SAMPLING_RATIO: 0 POOLER_TYPE: ROIAlignV2 RPN: BATCH_SIZE_PER_IMAGE: 256 BBOX_REG_LOSS_TYPE: smooth_l1 BBOX_REG_LOSS_WEIGHT: 1.0 BBOX_REG_WEIGHTS: (1.0, 1.0, 1.0, 1.0) BOUNDARY_THRESH: -1 HEAD_NAME: StandardRPNHead IN_FEATURES: ['res4'] IOU_LABELS: [0, -1, 1] IOU_THRESHOLDS: [0.3, 0.7] LOSS_WEIGHT: 1.0 NMS_THRESH: 0.7 POSITIVE_FRACTION: 0.5 POST_NMS_TOPK_TEST: 1000 POST_NMS_TOPK_TRAIN: 2000 PRE_NMS_TOPK_TEST: 6000 PRE_NMS_TOPK_TRAIN: 12000 SMOOTH_L1_BETA: 0.0 SEM_SEG_HEAD: COMMON_STRIDE: 4 CONVS_DIM: 128 IGNORE_VALUE: 255 IN_FEATURES: ['p2', 'p3', 'p4', 'p5'] LOSS_WEIGHT: 1.0 NAME: SemSegFPNHead NORM: GN NUM_CLASSES: 54 WEIGHTS: detectron2://ImageNetPretrained/MSRA/R-50.pkl YOLOF: BOX_TRANSFORM: ADD_CTR_CLAMP: True BBOX_REG_WEIGHTS: (1.0, 1.0, 1.0, 1.0) CTR_CLAMP: 32 DECODER: ACTIVATION: ReLU CLS_NUM_CONVS: 2 IN_CHANNELS: 512 NORM: BN NUM_ANCHORS: 5 NUM_CLASSES: 80 PRIOR_PROB: 0.01 REG_NUM_CONVS: 4 DETECTIONS_PER_IMAGE: 100 ENCODER: ACTIVATION: ReLU BACKBONE_LEVEL: res5 BLOCK_DILATIONS: [2, 4, 6, 8] BLOCK_MID_CHANNELS: 128 IN_CHANNELS: 2048 NORM: BN NUM_CHANNELS: 512 NUM_RESIDUAL_BLOCKS: 4 LOSSES: BBOX_REG_LOSS_TYPE: giou FOCAL_LOSS_ALPHA: 0.25 FOCAL_LOSS_GAMMA: 2.0 MATCHER: TOPK: 4 NEG_IGNORE_THRESHOLD: 0.7 NMS_THRESH_TEST: 0.6 POS_IGNORE_THRESHOLD: 0.15 SCORE_THRESH_TEST: 0.05 TOPK_CANDIDATES_TEST: 1000 OUTPUT_DIR: /hdd2/wh/cw/train/yolof/R_50_C5_1x/ SEED: -1 SOLVER: AMP: ENABLED: False BACKBONE_MULTIPLIER: 0.334 BASE_LR: 0.003 BIAS_LR_FACTOR: 1.0 CHECKPOINT_PERIOD: 2500 CLIP_GRADIENTS: CLIP_TYPE: value CLIP_VALUE: 1.0 ENABLED: False NORM_TYPE: 2.0 GAMMA: 0.1 IMS_PER_BATCH: 16 LR_SCHEDULER_NAME: WarmupMultiStepLR MAX_ITER: 90000 MOMENTUM: 0.9 NESTEROV: False REFERENCE_WORLD_SIZE: 0 STEPS: (60000, 80000) WARMUP_FACTOR: 0.0002 WARMUP_ITERS: 5000 WARMUP_METHOD: linear WEIGHT_DECAY: 0.0001 WEIGHT_DECAY_BIAS: 0.0001 WEIGHT_DECAY_NORM: 0.0 TEST: AUG: ENABLED: False FLIP: True MAX_SIZE: 4000 MIN_SIZES: (400, 500, 600, 700, 800, 900, 1000, 1100, 1200) DETECTIONS_PER_IMAGE: 100 EVAL_PERIOD: 0 EXPECTED_RESULTS: [] KEYPOINT_OKS_SIGMAS: [] PRECISE_BN: ENABLED: False NUM_ITER: 200 VERSION: 2 VIS_PERIOD: 0 [03/26 21:44:55] detectron2 INFO: Full config saved to /hdd2/wh/cw/train/yolof/R_50_C5_1x/config.yaml [03/26 21:44:55] d2.utils.env INFO: Using a generated random seed 55931624 [03/26 21:44:56] d2.engine.defaults INFO: Model: YOLOF( (backbone): ResNet( (stem): BasicStem( (conv1): Conv2d( 3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) ) (res2): Sequential( (0): BottleneckBlock( (shortcut): Conv2d( 64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv1): Conv2d( 64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv2): Conv2d( 64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv3): Conv2d( 64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) ) (1): BottleneckBlock( (conv1): Conv2d( 256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv2): Conv2d( 64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv3): Conv2d( 64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) ) (2): BottleneckBlock( (conv1): Conv2d( 256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv2): Conv2d( 64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv3): Conv2d( 64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) ) ) (res3): Sequential( (0): BottleneckBlock( (shortcut): Conv2d( 256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv1): Conv2d( 256, 128, kernel_size=(1, 1), stride=(2, 2), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv2): Conv2d( 128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv3): Conv2d( 128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) ) (1): BottleneckBlock( (conv1): Conv2d( 512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv2): Conv2d( 128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv3): Conv2d( 128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) ) (2): BottleneckBlock( (conv1): Conv2d( 512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv2): Conv2d( 128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv3): Conv2d( 128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) ) (3): BottleneckBlock( (conv1): Conv2d( 512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv2): Conv2d( 128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv3): Conv2d( 128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) ) ) (res4): Sequential( (0): BottleneckBlock( (shortcut): Conv2d( 512, 1024, kernel_size=(1, 1), stride=(2, 2), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) (conv1): Conv2d( 512, 256, kernel_size=(1, 1), stride=(2, 2), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) (1): BottleneckBlock( (conv1): Conv2d( 1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) (2): BottleneckBlock( (conv1): Conv2d( 1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) (3): BottleneckBlock( (conv1): Conv2d( 1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) (4): BottleneckBlock( (conv1): Conv2d( 1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) (5): BottleneckBlock( (conv1): Conv2d( 1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) ) (res5): Sequential( (0): BottleneckBlock( (shortcut): Conv2d( 1024, 2048, kernel_size=(1, 1), stride=(2, 2), bias=False (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05) ) (conv1): Conv2d( 1024, 512, kernel_size=(1, 1), stride=(2, 2), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv2): Conv2d( 512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv3): Conv2d( 512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05) ) ) (1): BottleneckBlock( (conv1): Conv2d( 2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv2): Conv2d( 512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv3): Conv2d( 512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05) ) ) (2): BottleneckBlock( (conv1): Conv2d( 2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv2): Conv2d( 512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv3): Conv2d( 512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05) ) ) ) ) (encoder): DilatedEncoder( (lateral_conv): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1)) (lateral_norm): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (fpn_conv): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (fpn_norm): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (dilated_encoder_blocks): Sequential( (0): Bottleneck( (conv1): Sequential( (0): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1)) (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) (conv2): Sequential( (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(2, 2), dilation=(2, 2)) (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) (conv3): Sequential( (0): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1)) (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) ) (1): Bottleneck( (conv1): Sequential( (0): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1)) (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) (conv2): Sequential( (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(4, 4), dilation=(4, 4)) (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) (conv3): Sequential( (0): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1)) (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) ) (2): Bottleneck( (conv1): Sequential( (0): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1)) (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) (conv2): Sequential( (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(6, 6), dilation=(6, 6)) (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) (conv3): Sequential( (0): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1)) (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) ) (3): Bottleneck( (conv1): Sequential( (0): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1)) (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) (conv2): Sequential( (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(8, 8), dilation=(8, 8)) (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) (conv3): Sequential( (0): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1)) (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) ) ) ) (decoder): Decoder( (cls_subnet): Sequential( (0): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) (3): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (4): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (5): ReLU(inplace=True) ) (bbox_subnet): Sequential( (0): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) (3): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (4): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (5): ReLU(inplace=True) (6): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (7): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (8): ReLU(inplace=True) (9): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (10): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (11): ReLU(inplace=True) ) (cls_score): Conv2d(512, 400, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (bbox_pred): Conv2d(512, 20, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (object_pred): Conv2d(512, 5, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) ) (anchor_generator): DefaultAnchorGenerator( (cell_anchors): BufferList() ) (anchor_matcher): UniformMatcher() ) [03/26 21:45:18] d2.data.datasets.coco INFO: Loading datasets/coco/annotations/instances_train2017.json takes 21.19 seconds. [03/26 21:45:19] d2.data.datasets.coco INFO: Loaded 118287 images in COCO format from datasets/coco/annotations/instances_train2017.json [03/26 21:45:31] d2.data.build INFO: Removed 1021 images with no usable annotations. 117266 images left. [03/26 21:45:38] d2.data.build INFO: Distribution of instances among all 80 categories:  category #instances category #instances category #instances
person 257253 bicycle 7056 car 43533
motorcycle 8654 airplane 5129 bus 6061
train 4570 truck 9970 boat 10576
traffic light 12842 fire hydrant 1865 stop sign 1983
parking meter 1283 bench 9820 bird 10542
cat 4766 dog 5500 horse 6567
sheep 9223 cow 8014 elephant 5484
bear 1294 zebra 5269 giraffe 5128
backpack 8714 umbrella 11265 handbag 12342
tie 6448 suitcase 6112 frisbee 2681
skis 6623 snowboard 2681 sports ball 6299
kite 8802 baseball bat 3273 baseball gl.. 3747
skateboard 5536 surfboard 6095 tennis racket 4807
bottle 24070 wine glass 7839 cup 20574
fork 5474 knife 7760 spoon 6159
bowl 14323 banana 9195 apple 5776
sandwich 4356 orange 6302 broccoli 7261
carrot 7758 hot dog 2884 pizza 5807
donut 7005 cake 6296 chair 38073
couch 5779 potted plant 8631 bed 4192
dining table 15695 toilet 4149 tv 5803
laptop 4960 mouse 2261 remote 5700
keyboard 2854 cell phone 6422 microwave 1672
oven 3334 toaster 225 sink 5609
refrigerator 2634 book 24077 clock 6320
vase 6577 scissors 1464 teddy bear 4729
hair drier 198 toothbrush 1945
total 849949 
[03/26 21:45:38] d2.data.build INFO: Using training sampler TrainingSampler [03/26 21:45:40] d2.data.common INFO: Serializing 117266 elements to byte tensors and concatenating them all ... [03/26 21:45:46] d2.data.common INFO: Serialized dataset takes 451.21 MiB [03/26 21:45:54] fvcore.common.checkpoint INFO: Loading checkpoint from detectron2://ImageNetPretrained/MSRA/R-50.pkl [03/26 21:45:54] d2.checkpoint.c2_model_loading INFO: Renaming Caffe2 weights ...... [03/26 21:45:54] d2.checkpoint.c2_model_loading INFO: Following weights matched with submodule backbone: Names in Model Names in Checkpoint Shapes
res2.0.conv1.* res2_0branch2a{bn_*,w} (64,) (64,) (64,) (64,) (64,64,1,1)
res2.0.conv2.* res2_0branch2b{bn_*,w} (64,) (64,) (64,) (64,) (64,64,3,3)
res2.0.conv3.* res2_0branch2c{bn_*,w} (256,) (256,) (256,) (256,) (256,64,1,1)
res2.0.shortcut.* res2_0branch1{bn_*,w} (256,) (256,) (256,) (256,) (256,64,1,1)
res2.1.conv1.* res2_1branch2a{bn_*,w} (64,) (64,) (64,) (64,) (64,256,1,1)
res2.1.conv2.* res2_1branch2b{bn_*,w} (64,) (64,) (64,) (64,) (64,64,3,3)
res2.1.conv3.* res2_1branch2c{bn_*,w} (256,) (256,) (256,) (256,) (256,64,1,1)
res2.2.conv1.* res2_2branch2a{bn_*,w} (64,) (64,) (64,) (64,) (64,256,1,1)
res2.2.conv2.* res2_2branch2b{bn_*,w} (64,) (64,) (64,) (64,) (64,64,3,3)
res2.2.conv3.* res2_2branch2c{bn_*,w} (256,) (256,) (256,) (256,) (256,64,1,1)
res3.0.conv1.* res3_0branch2a{bn_*,w} (128,) (128,) (128,) (128,) (128,256,1,1)
res3.0.conv2.* res3_0branch2b{bn_*,w} (128,) (128,) (128,) (128,) (128,128,3,3)
res3.0.conv3.* res3_0branch2c{bn_*,w} (512,) (512,) (512,) (512,) (512,128,1,1)
res3.0.shortcut.* res3_0branch1{bn_*,w} (512,) (512,) (512,) (512,) (512,256,1,1)
res3.1.conv1.* res3_1branch2a{bn_*,w} (128,) (128,) (128,) (128,) (128,512,1,1)
res3.1.conv2.* res3_1branch2b{bn_*,w} (128,) (128,) (128,) (128,) (128,128,3,3)
res3.1.conv3.* res3_1branch2c{bn_*,w} (512,) (512,) (512,) (512,) (512,128,1,1)
res3.2.conv1.* res3_2branch2a{bn_*,w} (128,) (128,) (128,) (128,) (128,512,1,1)
res3.2.conv2.* res3_2branch2b{bn_*,w} (128,) (128,) (128,) (128,) (128,128,3,3)
res3.2.conv3.* res3_2branch2c{bn_*,w} (512,) (512,) (512,) (512,) (512,128,1,1)
res3.3.conv1.* res3_3branch2a{bn_*,w} (128,) (128,) (128,) (128,) (128,512,1,1)
res3.3.conv2.* res3_3branch2b{bn_*,w} (128,) (128,) (128,) (128,) (128,128,3,3)
res3.3.conv3.* res3_3branch2c{bn_*,w} (512,) (512,) (512,) (512,) (512,128,1,1)
res4.0.conv1.* res4_0branch2a{bn_*,w} (256,) (256,) (256,) (256,) (256,512,1,1)
res4.0.conv2.* res4_0branch2b{bn_*,w} (256,) (256,) (256,) (256,) (256,256,3,3)
res4.0.conv3.* res4_0branch2c{bn_*,w} (1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res4.0.shortcut.* res4_0branch1{bn_*,w} (1024,) (1024,) (1024,) (1024,) (1024,512,1,1)
res4.1.conv1.* res4_1branch2a{bn_*,w} (256,) (256,) (256,) (256,) (256,1024,1,1)
res4.1.conv2.* res4_1branch2b{bn_*,w} (256,) (256,) (256,) (256,) (256,256,3,3)
res4.1.conv3.* res4_1branch2c{bn_*,w} (1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res4.2.conv1.* res4_2branch2a{bn_*,w} (256,) (256,) (256,) (256,) (256,1024,1,1)
res4.2.conv2.* res4_2branch2b{bn_*,w} (256,) (256,) (256,) (256,) (256,256,3,3)
res4.2.conv3.* res4_2branch2c{bn_*,w} (1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res4.3.conv1.* res4_3branch2a{bn_*,w} (256,) (256,) (256,) (256,) (256,1024,1,1)
res4.3.conv2.* res4_3branch2b{bn_*,w} (256,) (256,) (256,) (256,) (256,256,3,3)
res4.3.conv3.* res4_3branch2c{bn_*,w} (1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res4.4.conv1.* res4_4branch2a{bn_*,w} (256,) (256,) (256,) (256,) (256,1024,1,1)
res4.4.conv2.* res4_4branch2b{bn_*,w} (256,) (256,) (256,) (256,) (256,256,3,3)
res4.4.conv3.* res4_4branch2c{bn_*,w} (1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res4.5.conv1.* res4_5branch2a{bn_*,w} (256,) (256,) (256,) (256,) (256,1024,1,1)
res4.5.conv2.* res4_5branch2b{bn_*,w} (256,) (256,) (256,) (256,) (256,256,3,3)
res4.5.conv3.* res4_5branch2c{bn_*,w} (1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res5.0.conv1.* res5_0branch2a{bn_*,w} (512,) (512,) (512,) (512,) (512,1024,1,1)
res5.0.conv2.* res5_0branch2b{bn_*,w} (512,) (512,) (512,) (512,) (512,512,3,3)
res5.0.conv3.* res5_0branch2c{bn_*,w} (2048,) (2048,) (2048,) (2048,) (2048,512,1,1)
res5.0.shortcut.* res5_0branch1{bn_*,w} (2048,) (2048,) (2048,) (2048,) (2048,1024,1,1)
res5.1.conv1.* res5_1branch2a{bn_*,w} (512,) (512,) (512,) (512,) (512,2048,1,1)
res5.1.conv2.* res5_1branch2b{bn_*,w} (512,) (512,) (512,) (512,) (512,512,3,3)
res5.1.conv3.* res5_1branch2c{bn_*,w} (2048,) (2048,) (2048,) (2048,) (2048,512,1,1)
res5.2.conv1.* res5_2branch2a{bn_*,w} (512,) (512,) (512,) (512,) (512,2048,1,1)
res5.2.conv2.* res5_2branch2b{bn_*,w} (512,) (512,) (512,) (512,) (512,512,3,3)
res5.2.conv3.* res5_2branch2c{bn_*,w} (2048,) (2048,) (2048,) (2048,) (2048,512,1,1)
stem.conv1.norm.* res_conv1bn* (64,) (64,) (64,) (64,)
stem.conv1.weight conv1_w (64, 3, 7, 7)

[03/26 21:45:54] fvcore.common.checkpoint INFO: Some model parameters or buffers are not found in the checkpoint: anchor_generator.cell_anchors.0 decoder.bbox_pred.{bias, weight} decoder.bbox_subnet.0.{bias, weight} decoder.bbox_subnet.1.{bias, running_mean, running_var, weight} decoder.bbox_subnet.10.{bias, running_mean, running_var, weight} decoder.bbox_subnet.3.{bias, weight} decoder.bbox_subnet.4.{bias, running_mean, running_var, weight} decoder.bbox_subnet.6.{bias, weight} decoder.bbox_subnet.7.{bias, running_mean, running_var, weight} decoder.bbox_subnet.9.{bias, weight} decoder.cls_score.{bias, weight} decoder.cls_subnet.0.{bias, weight} decoder.cls_subnet.1.{bias, running_mean, running_var, weight} decoder.cls_subnet.3.{bias, weight} decoder.cls_subnet.4.{bias, running_mean, running_var, weight} decoder.object_pred.{bias, weight} encoder.dilated_encoder_blocks.0.conv1.0.{bias, weight} encoder.dilated_encoder_blocks.0.conv1.1.{bias, running_mean, running_var, weight} encoder.dilated_encoder_blocks.0.conv2.0.{bias, weight} encoder.dilated_encoder_blocks.0.conv2.1.{bias, running_mean, running_var, weight} encoder.dilated_encoder_blocks.0.conv3.0.{bias, weight} encoder.dilated_encoder_blocks.0.conv3.1.{bias, running_mean, running_var, weight} encoder.dilated_encoder_blocks.1.conv1.0.{bias, weight} encoder.dilated_encoder_blocks.1.conv1.1.{bias, running_mean, running_var, weight} encoder.dilated_encoder_blocks.1.conv2.0.{bias, weight} encoder.dilated_encoder_blocks.1.conv2.1.{bias, running_mean, running_var, weight} encoder.dilated_encoder_blocks.1.conv3.0.{bias, weight} encoder.dilated_encoder_blocks.1.conv3.1.{bias, running_mean, running_var, weight} encoder.dilated_encoder_blocks.2.conv1.0.{bias, weight} encoder.dilated_encoder_blocks.2.conv1.1.{bias, running_mean, running_var, weight} encoder.dilated_encoder_blocks.2.conv2.0.{bias, weight} encoder.dilated_encoder_blocks.2.conv2.1.{bias, running_mean, running_var, weight} encoder.dilated_encoder_blocks.2.conv3.0.{bias, weight} encoder.dilated_encoder_blocks.2.conv3.1.{bias, running_mean, running_var, weight} encoder.dilated_encoder_blocks.3.conv1.0.{bias, weight} encoder.dilated_encoder_blocks.3.conv1.1.{bias, running_mean, running_var, weight} encoder.dilated_encoder_blocks.3.conv2.0.{bias, weight} encoder.dilated_encoder_blocks.3.conv2.1.{bias, running_mean, running_var, weight} encoder.dilated_encoder_blocks.3.conv3.0.{bias, weight} encoder.dilated_encoder_blocks.3.conv3.1.{bias, running_mean, running_var, weight} encoder.fpn_conv.{bias, weight} encoder.fpn_norm.{bias, running_mean, running_var, weight} encoder.lateral_conv.{bias, weight} encoder.lateral_norm.{bias, running_mean, running_var, weight} [03/26 21:45:54] fvcore.common.checkpoint INFO: The checkpoint state_dict contains keys that are not used by the model: fc1000.{bias, weight} stem.conv1.bias [03/26 21:45:54] d2.engine.train_loop INFO: Starting training from iteration 0

Wei-i commented 3 years ago

[03/26 21:45:54 d2.engine.train_loop]: Starting training from iteration 0 /opt/conda/conda-bld/pytorch_1595629408163/work/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [120,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1595629408163/work/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [61,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1595629408163/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f4c7325377d in /home/cw/miniconda3/envs/py_dt2/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xb5d (0x7f4c734a3d9d in /home/cw/miniconda3/envs/py_dt2/lib/python3.7/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f4c7323fb1d in /home/cw/miniconda3/envs/py_dt2/lib/python3.7/site-packages/torch/lib/libc10.so) frame #3: + 0x53f0ea (0x7f4ca5ba30ea in /home/cw/miniconda3/envs/py_dt2/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #4: + 0x1809da (0x5589712629da in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #5: + 0xfc039 (0x5589711de039 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #6: + 0xfa678 (0x5589711dc678 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #7: + 0xfa938 (0x5589711dc938 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #8: + 0xfa938 (0x5589711dc938 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #9: + 0xfa348 (0x5589711dc348 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #10: + 0xfadd8 (0x5589711dcdd8 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #11: + 0xfadec (0x5589711dcdec in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #12: + 0xfadec (0x5589711dcdec in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #13: + 0xfadec (0x5589711dcdec in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #14: + 0xfadec (0x5589711dcdec in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #15: + 0xfadec (0x5589711dcdec in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #16: + 0xfadec (0x5589711dcdec in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #17: + 0xfadec (0x5589711dcdec in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #18: + 0xfadec (0x5589711dcdec in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #19: + 0xfb238 (0x5589711dd238 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #20: + 0xfb2db (0x5589711dd2db in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #21: + 0x1dc923 (0x5589712be923 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #22: _PyEval_EvalFrameDefault + 0x27b8 (0x5589712a6ea8 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #23: _PyFunction_FastCallKeywords + 0x187 (0x55897121a767 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #24: + 0x17f335 (0x558971261335 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #25: _PyEval_EvalFrameDefault + 0x611 (0x5589712a4d01 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #26: _PyFunction_FastCallKeywords + 0x187 (0x55897121a767 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #27: _PyEval_EvalFrameDefault + 0x3f5 (0x5589712a4ae5 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #28: _PyEval_EvalCodeWithName + 0x252 (0x5589711fadb2 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #29: _PyFunction_FastCallKeywords + 0x583 (0x55897121ab63 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #30: + 0x17f335 (0x558971261335 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #31: _PyEval_EvalFrameDefault + 0x13fe (0x5589712a5aee in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #32: _PyEval_EvalCodeWithName + 0x252 (0x5589711fadb2 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #33: PyEval_EvalCode + 0x23 (0x5589711fc1e3 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #34: + 0x2271d2 (0x5589713091d2 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #35: PyRun_StringFlags + 0x7a (0x55897131417a in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #36: PyRun_SimpleStringFlags + 0x3c (0x5589713141dc in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #37: + 0x2322d9 (0x5589713142d9 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #38: _Py_UnixMain + 0x3c (0x55897131467c in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #39: __libc_start_main + 0xf0 (0x7f4cbbde3830 in /lib/x86_64-linux-gnu/libc.so.6) frame #40: + 0x1d7101 (0x5589712b9101 in /home/cw/miniconda3/envs/py_dt2/bin/python)

Traceback (most recent call last): File "./tools/train_net.py", line 234, in args=(args,), File "/home/cw/detectron2/detectron2/engine/launch.py", line 59, in launch daemon=False, File "/home/cw/miniconda3/envs/py_dt2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/cw/miniconda3/envs/py_dt2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes while not context.join(): File "/home/cw/miniconda3/envs/py_dt2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join raise Exception(msg) Exception:

-- Process 1 terminated with the following error: Traceback (most recent call last): File "/home/cw/miniconda3/envs/py_dt2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, args) File "/home/cw/detectron2/detectron2/engine/launch.py", line 94, in _distributed_worker main_func(args) File "/home/cw/YOLOF/tools/train_net.py", line 221, in main return trainer.train() File "/home/cw/detectron2/detectron2/engine/defaults.py", line 431, in train super().train(self.start_iter, self.max_iter) File "/home/cw/detectron2/detectron2/engine/train_loop.py", line 140, in train self.run_step() File "/home/cw/detectron2/detectron2/engine/defaults.py", line 441, in run_step self._trainer.run_step() File "/home/cw/detectron2/detectron2/engine/train_loop.py", line 234, in run_step loss_dict = self.model(data) File "/home/cw/miniconda3/envs/py_dt2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, kwargs) File "/home/cw/miniconda3/envs/py_dt2/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 511, in forward output = self.module(*inputs[0], *kwargs[0]) File "/home/cw/miniconda3/envs/py_dt2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(input, kwargs) File "/home/cw/YOLOF/yolof/modeling/yolof.py", line 295, in forward pred_logits, pred_anchor_deltas)
File "/home/cw/YOLOF/yolof/modeling/yolof.py", line 394, in losses pred_class_logits[valid_idxs], RuntimeError: copy_if failed to synchronize: device-side assert triggered

Wei-i commented 3 years ago

command : python ./tools/train_net.py --num-gpus 2 --config-file ./configs/yolof_R_50_C5_1x.yaml OUTPUT_DIR /hdd2/wh/cw/train/yolof/R_50_C5_1x/

yaml: ` MODEL: META_ARCHITECTURE: "YOLOF" BACKBONE: NAME: "build_resnet_backbone" RESNETS: OUT_FEATURES: ["res5"] DATASETS: TRAIN: ("coco_2017_train",) TEST: ("coco_2017_val",) DATALOADER: NUM_WORKERS: 8 SOLVER:

IMS_PER_BATCH: 64

IMS_PER_BATCH: 16

BASE_LR: 0.12

BASE_LR: 0.03 WARMUP_FACTOR: 0.0002 # 0.00066667 WARMUP_ITERS: 5000 # 1500

STEPS: (15000, 20000)

STEPS: (60000, 80000)

MAX_ITER: 22500

MAX_ITER: 90000 CHECKPOINT_PERIOD: 2500 INPUT: MIN_SIZE_TRAIN: (800,)

OUTPUT_DIR: '/hdd2/wh/cw/train/yolof/R_50_C5_1x' `

Wei-i commented 3 years ago

作者您好,我粗浅的认为是跟数据有关的错误,可能是发生了数组越界等? 但我的数据集是coco2017,不应该有这个错误才对,其余的我也没有更改了,我原本的detectron2也重新升级了一下。 我发现部分输入的图片是没有错误的,网络还可以迭代几次,打印loss

chensnathan commented 3 years ago

看错误是在取index的时候越界了,但是很奇怪,我之前跑过这么多次都没有遇到过这个问题。log文件一眼看过去也并没有找到明显不对的地方,感觉没啥道理。。。你这个是每次一跑,必然会出现这个错误嘛?

Wei-i commented 3 years ago

pred_class_logits[valid_idxs].size() 是的每次一跑必定越界..... 我在yolof.py错误的地方调试了一下

Wei-i commented 3 years ago

print("gt_classes >= 0", gt_classes[gt_classes >= 0].size())

Wei-i commented 3 years ago

1.

gt_class.size() torch.Size([38000]) gt_classes >= 0 torch.Size([37818]) valid_idxs torch.Size([38000]) pred_class_logits.size() torch.Size([38000, 80]) pred_class_logits[valid_idxs] torch.Size([37818, 80])

2.

gt_class.size() torch.Size([42000]) gt_classes >= 0 torch.Size([41866]) valid_idxs torch.Size([42000]) pred_class_logits.size() torch.Size([42000, 80]) pred_class_logits[valid_idxs] torch.Size([41866, 80])

3.

gt_class.size() torch.Size([38000]) 没输出print("gt_classes >= 0", gt_classes[gt_classes >= 0].size())

/opt/conda/conda-bld/pytorch_1595629408163/work/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [9,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1595629408163/work/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [11,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.

chensnathan commented 3 years ago

感觉这里不应该有错,建议用

CUDA_LAUNCH_BLOCKING=1 python ./tools/train_net.py --num-gpus 2 --config-file ./configs/yolof_R_50_C5_1x.yaml OUTPUT_DIR /hdd2/wh/cw/train/yolof/R_50_C5_1x/

看看到底哪里出错了。 或者换个机器重新配一下环境,跑一下试试,按理来说能直接跑才对

Wei-i commented 3 years ago

还是不行,可能只能用其他机器试一下, 我不确定我的cuda 9.0是否会对这个有影响 。

Wei-i commented 3 years ago

作者您好 我换了一台cuda 10.1的然后重复了我之前的操作,(基本就是直接安装了,数据集也是直接从原来的服务器上传输的)。 然后就没问题了。。。奇葩 我暂时只能归咎于是我之前cuda的9.0,cudatoolkits 9.2不好适配您的代码? (我跑adelaidet还是没问题的,本来大半年没更新了 今天更新了下 简要修改了train.py没啥问题hhh)

Wei-i commented 3 years ago

另外有一个小地方,我觉得您可以考虑修改下 。 就是您建议安装的mish-cuda,好像不能够直接 build After git clone,you should movemish-cuda/external/CUDAApplyUtils.cuh to csrc/ before python setup.py build install

link-issue