Wei-i commented 3 years ago

Thanks for sharing your great work. I am sorry that I have a bug when I use python ./tools/train_net.py --num-gpus 1 --config-file ./configs/yolof_R_50_C5_1x.yaml

Bug log below as :

[03/26 07:38:03 d2.data.build]: Using training sampler TrainingSampler [03/26 07:38:03 d2.data.common]: Serializing 117266 elements to byte tensors and concatenating them all ... [03/26 07:38:10 d2.data.common]: Serialized dataset takes 451.21 MiB [03/26 07:38:15 fvcore.common.checkpoint]: Loading checkpoint from detectron2://ImageNetPretrained/MSRA/R-50.pkl Traceback (most recent call last): File "./tools/train_net.py", line 234, in args=(args,), File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/launch.py", line 62, in launch main_func(*args) File "./tools/train_net.py", line 215, in main trainer.resume_or_load(resume=args.resume) File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/defaults.py", line 353, in resume_or_load checkpoint = self.checkpointer.resume_or_load(self.cfg.MODEL.WEIGHTS, resume=resume) File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/fvcore/common/checkpoint.py", line 215, in resume_or_load return self.load(path, checkpointables=[]) File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/fvcore/common/checkpoint.py", line 140, in load path = self.path_manager.get_local_path(path) File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/iopath/common/file_io.py", line 1100, in get_local_path path, force=force, kwargs File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/utils/file_io.py", line 29, in _get_local_path return PathManager.get_local_path(self.S3_DETECTRON2_PREFIX + name, kwargs) File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/iopath/common/file_io.py", line 1100, in get_local_path path, force=force, **kwargs File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/iopath/common/file_io.py", line 755, in _get_local_path with file_lock(cached): File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/iopath/common/file_io.py", line 82, in file_lock return portalocker.Lock(path + ".lock", timeout=3600) # type: ignore AttributeError: module 'portalocker' has no attribute 'Lock'

I woule be grateful if you could give me some advice. Thanks.

Wei-i commented 3 years ago

It seems that there may be wrong with 'Loading checkpoint from detectron2://ImageNetPretrained/MSRA/R-50.pkl '?

chensnathan commented 3 years ago

Could you check the version of portalocker in your environment? And run the following code snippets to verify whether the portalocker has the attribute Lock or not:

>>> import portalocker
>>> portalocker.__version__
>>> portalocker.Lock

Wei-i commented 3 years ago

Sorry, there must be something wrong with my portalocker?

(yolof) cw@MAC-3DGroup:~$ python
Python 3.6.13 | packaged by conda-forge | (default, Feb 19 2021, 05:36:01) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import fvcore
>>> import portalocker
>>> 
>>> portalocker.__version__
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: module 'portalocker' has no attribute '__version__'

chensnathan commented 3 years ago

Try to re-install portalocker?

Wei-i commented 3 years ago

first pip uninstall portalocker and conda install portalocker then bug will be fixed.

Wei-i commented 3 years ago

[03/26 08:50:52 d2.engine.hooks]: Total training time: 0:00:01 (0:00:00 on hooks) [03/26 08:50:52 d2.utils.events]: iter: 0 lr: N/A max_mem: 622M Traceback (most recent call last): File "./tools/train_net.py", line 234, in <module> args=(args,), File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/launch.py", line 62, in launch main_func(*args) File "./tools/train_net.py", line 221, in main return trainer.train() File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/defaults.py", line 431, in train super().train(self.start_iter, self.max_iter) File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/train_loop.py", line 140, in train self.run_step() File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/defaults.py", line 441, in run_step self._trainer.run_step() File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/train_loop.py", line 234, in run_step loss_dict = self.model(data) File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/cw/YOLOF/yolof/modeling/yolof.py", line 294, in forward pred_logits, pred_anchor_deltas) File "/home/cw/YOLOF/yolof/modeling/yolof.py", line 387, in losses dist.all_reduce(num_foreground) File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 953, in all_reduce _check_default_pg() File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 211, in _check_default_pg "Default process group is not initialized" AssertionError: Default process group is not initialized

How to close DDP？

Wei-i commented 3 years ago

I think this is the last bug before I can train YOLOF...

chensnathan commented 3 years ago

Comment out the lines with dist in the yolof.py file.

BTW, when you train with only one GPU, you should adjust the learning rate and batch size. Refer to this response.

chensnathan commented 3 years ago

Support training with one GPU in this commit.

Wei-i commented 3 years ago

Thanks ！ [03/26 09:10:09 d2.engine.hooks]: Overall training speed: 57 iterations in 0:00:18 (0.3159 s / it) [03/26 09:10:09 d2.engine.hooks]: Total training time: 0:00:18 (0:00:00 on hooks) [03/26 09:10:09 d2.utils.events]: eta: 2:01:11 iter: 59 total_loss: 2.067 loss_cls: 1.342 loss_box_reg: 0.7438 time: 0.3139 data_time: 0.0022 lr: 3.9308e-06 max_mem: 1076M Traceback (most recent call last): File "tools/train_net.py", line 234, in args=(args,), File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/launch.py", line 62, in launch main_func(args) File "tools/train_net.py", line 221, in main return trainer.train() File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/defaults.py", line 431, in train super().train(self.start_iter, self.max_iter) File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/train_loop.py", line 140, in train self.run_step() File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/defaults.py", line 441, in run_step self._trainer.run_step() File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/train_loop.py", line 234, in run_step loss_dict = self.model(data) File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(input, **kwargs) File "/home/cw/YOLOF/yolof/modeling/yolof.py", line 295, in forward pred_logits, pred_anchor_deltas) File "/home/cw/YOLOF/yolof/modeling/yolof.py", line 397, in losses pred_class_logits[valid_idxs], RuntimeError: CUDA error: device-side assert triggered

Wei-i commented 3 years ago

真的很抱歉，作者，由于我的水平太低，这又出现了新的bug。请问这个应该怎么改？

chensnathan commented 3 years ago

Could you give more details about what command you use?

Wei-i commented 3 years ago

command : CUDA_VISIBLE_DEVICES=1 python tools/train_net.py --num-gpus 1 --config-file ./configs/yolof_R_50_C5_1x.yaml

yaml:

MODEL: META_ARCHITECTURE: "YOLOF" BACKBONE: NAME: "build_resnet_backbone" RESNETS: OUT_FEATURES: ["res5"] DATASETS: TRAIN: ("coco_2017_train",) TEST: ("coco_2017_val",) DATALOADER:

NUM_WORKERS: 8

NUM_WORKERS: 4 SOLVER:

IMS_PER_BATCH: 64

IMS_PER_BATCH: 2

BASE_LR: 0.12

BASE_LR: 0.00001 WARMUP_FACTOR: 0.00066667 WARMUP_ITERS: 1500

STEPS: (15000, 20000)

STEPS: (480000, 640000)

MAX_ITER: 22500

MAX_ITER: 720000 CHECKPOINT_PERIOD: 2500 INPUT: MIN_SIZE_TRAIN: (800,)

chensnathan commented 3 years ago

Can you try this setting?

IMS_PER_BATCH: 8
BASE_LR: 0.03
WARMUP_FACTOR: 0.00066667
WARMUP_ITERS: 1500
STEPS: (120000, 160000)
MAX_ITER: 180000

Wei-i commented 3 years ago

I left lab and I will try tomorrow. thanks a lot from bottom of my heart ！

Wei-i commented 3 years ago

Good Morning! When I tried your setting, it stiill remains the same bug as:

[03/26 17:38:53 d2.engine.train_loop]: Starting training from iteration 0

/opt/conda/conda-bld/pytorch_1607370169888/work/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [11,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.

ERROR [03/26 17:39:00 d2.engine.train_loop]: Exception during training: Traceback (most recent call last): File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/train_loop.py", line 140, in train self.run_step() File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/defaults.py", line 441, in run_step self._trainer.run_step() File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/detectron2/engine/train_loop.py", line 234, in run_step loss_dict = self.model(data) File "/home/cw/miniconda3/envs/yolof/lib/python3.6/site-packages/torch/nn/modules/module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "/home/cw/YOLOF/yolof/modeling/yolof.py", line 295, in forward pred_logits, pred_anchor_deltas) File "/home/cw/YOLOF/yolof/modeling/yolof.py", line 397, in losses pred_class_logits[valid_idxs], RuntimeError: CUDA error: device-side assert triggered

chensnathan commented 3 years ago

Sorry, the BASE_LR should be 0.015. But I can run with one GPU with an initial learning rate of both 0.03 and 0.015. I can not reproduce your error on my side.

Try to warm up more iterations, e.g.,

WARMUP_FACTOR: 0.0002
WARMUP_ITERS: 5000

Wei-i commented 3 years ago

Thanks. It stil does not work...

chensnathan commented 3 years ago

Could you upload your training log file?

Wei-i commented 3 years ago

[03/26 21:44:54] detectron2 INFO: Rank of current process: 0. World size: 2 [03/26 21:44:55] detectron2 INFO: Environment info:

sys.platform linux Python 3.7.8	packaged by conda-forge	(default, Jul 31 2020, 02:25:08) [GCC 7.5.0] numpy 1.19.1 detectron2 0.4 @/home/cw/detectron2/detectron2 Compiler GCC 5.4 CUDA compiler CUDA 9.0 detectron2 arch flags 6.1 DETECTRON2_ENV_MODULE PyTorch 1.6.0 @/home/cw/miniconda3/envs/py_dt2/lib/python3.7/site-packages/torch PyTorch debug build False GPU available True GPU 0,1,2 GeForce GTX 1080 Ti (arch=6.1) CUDA_HOME /usr/local/cuda-9.0 Pillow 7.2.0 torchvision 0.7.0 @/home/cw/miniconda3/envs/py_dt2/lib/python3.7/site-packages/torchvision torchvision arch flags 3.5, 5.0, 6.0, 7.0 fvcore 0.1.5.post20210327 iopath 0.1.7 cv2 4.4.0

PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) Math Kernel Library Version 2020.0.2 Product Build 20200624 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v1.5.0 (Git Hash e2ac1fac44c5078ca927cb9b90e1b3066a0b2ed0)
OpenMP 201511 (a.k.a. OpenMP 4.5)
NNPACK is enabled
CPU capability usage: AVX2
CUDA Runtime 9.2
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_37,code=compute_37
CuDNN 7.6.3
Magma 2.5.2
Build settings: BLAS=MKL, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DUSE_VULKAN_WRAPPER -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

[03/26 21:44:55] detectron2 INFO: Command line arguments: Namespace(config_file='./configs/yolof_R_50_C5_1x.yaml', dist_url='tcp://127.0.0.1:50159', eval_only=False, machine_rank=0, num_gpus=2, num_machines=1, opts=['OUTPUT_DIR', '/hdd2/wh/cw/train/yolof/R_50_C5_1x/'], resume=False) [03/26 21:44:55] detectron2 INFO: Contents of args.config_file=./configs/yolof_R_50_C5_1x.yaml: BASE: "Base-YOLOF.yaml" MODEL: WEIGHTS: "detectron2://ImageNetPretrained/MSRA/R-50.pkl" RESNETS: DEPTH: 50 OUTPUT_DIR: "output/yolof/R_50_C5_1x"

[03/26 21:44:55] detectron2 INFO: Running with full config: CUDNN_BENCHMARK: False DATALOADER: ASPECT_RATIO_GROUPING: True FILTER_EMPTY_ANNOTATIONS: True NUM_WORKERS: 8 REPEAT_THRESHOLD: 0.0 SAMPLER_TRAIN: TrainingSampler DATASETS: PRECOMPUTED_PROPOSAL_TOPK_TEST: 1000 PRECOMPUTED_PROPOSAL_TOPK_TRAIN: 2000 PROPOSAL_FILES_TEST: () PROPOSAL_FILES_TRAIN: () TEST: ('coco_2017_val',) TRAIN: ('coco_2017_train',) GLOBAL: HACK: 1.0 INPUT: CROP: ENABLED: False SIZE: [0.9, 0.9] TYPE: relative_range DISTORTION: ENABLED: False EXPOSURE: 1.5 HUE: 0.1 SATURATION: 1.5 FORMAT: BGR JITTER_CROP: ENABLED: False JITTER_RATIO: 0.3 MASK_FORMAT: polygon MAX_SIZE_TEST: 1333 MAX_SIZE_TRAIN: 1333 MIN_SIZE_TEST: 800 MIN_SIZE_TRAIN: (800,) MIN_SIZE_TRAIN_SAMPLING: choice MOSAIC: ENABLED: False MIN_OFFSET: 0.2 MOSAIC_HEIGHT: 640 MOSAIC_WIDTH: 640 NUM_IMAGES: 4 POOL_CAPACITY: 1000 RANDOM_FLIP: horizontal RESIZE: ENABLED: False SCALE_JITTER: (0.8, 1.2) SHAPE: (640, 640) TEST_SHAPE: (608, 608) SHIFT: SHIFT_PIXELS: 32 MODEL: ANCHOR_GENERATOR: ANGLES: [[-90, 0, 90]] ASPECT_RATIOS: [[1.0]] NAME: DefaultAnchorGenerator OFFSET: 0.0 SIZES: [[32, 64, 128, 256, 512]] BACKBONE: FREEZE_AT: 2 NAME: build_resnet_backbone DARKNET: DEPTH: 53 NORM: BN OUT_FEATURES: ['res5'] RES5_DILATION: 1 WITH_CSP: True DEVICE: cuda FPN: FUSE_TYPE: sum IN_FEATURES: [] NORM: OUT_CHANNELS: 256 KEYPOINT_ON: False LOAD_PROPOSALS: False MASK_ON: False META_ARCHITECTURE: YOLOF PANOPTIC_FPN: COMBINE: ENABLED: True INSTANCES_CONFIDENCE_THRESH: 0.5 OVERLAP_THRESH: 0.5 STUFF_AREA_LIMIT: 4096 INSTANCE_LOSS_WEIGHT: 1.0 PIXEL_MEAN: [103.53, 116.28, 123.675] PIXEL_STD: [1.0, 1.0, 1.0] PROPOSAL_GENERATOR: MIN_SIZE: 0 NAME: RPN RESNETS: DEFORM_MODULATED: False DEFORM_NUM_GROUPS: 1 DEFORM_ON_PER_STAGE: [False, False, False, False] DEPTH: 50 NORM: FrozenBN NUM_GROUPS: 1 OUT_FEATURES: ['res5'] RES2_OUT_CHANNELS: 256 RES5_DILATION: 1 STEM_OUT_CHANNELS: 64 STRIDE_IN_1X1: True WIDTH_PER_GROUP: 64 RETINANET: BBOX_REG_LOSS_TYPE: smooth_l1 BBOX_REG_WEIGHTS: (1.0, 1.0, 1.0, 1.0) FOCAL_LOSS_ALPHA: 0.25 FOCAL_LOSS_GAMMA: 2.0 IN_FEATURES: ['p3', 'p4', 'p5', 'p6', 'p7'] IOU_LABELS: [0, -1, 1] IOU_THRESHOLDS: [0.4, 0.5] NMS_THRESH_TEST: 0.5 NORM: NUM_CLASSES: 80 NUM_CONVS: 4 PRIOR_PROB: 0.01 SCORE_THRESH_TEST: 0.05 SMOOTH_L1_LOSS_BETA: 0.1 TOPK_CANDIDATES_TEST: 1000 ROI_BOX_CASCADE_HEAD: BBOX_REG_WEIGHTS: ((10.0, 10.0, 5.0, 5.0), (20.0, 20.0, 10.0, 10.0), (30.0, 30.0, 15.0, 15.0)) IOUS: (0.5, 0.6, 0.7) ROI_BOX_HEAD: BBOX_REG_LOSS_TYPE: smooth_l1 BBOX_REG_LOSS_WEIGHT: 1.0 BBOX_REG_WEIGHTS: (10.0, 10.0, 5.0, 5.0) CLS_AGNOSTIC_BBOX_REG: False CONV_DIM: 256 FC_DIM: 1024 NAME: NORM: NUM_CONV: 0 NUM_FC: 0 POOLER_RESOLUTION: 14 POOLER_SAMPLING_RATIO: 0 POOLER_TYPE: ROIAlignV2 SMOOTH_L1_BETA: 0.0 TRAIN_ON_PRED_BOXES: False ROI_HEADS: BATCH_SIZE_PER_IMAGE: 512 IN_FEATURES: ['res4'] IOU_LABELS: [0, 1] IOU_THRESHOLDS: [0.5] NAME: Res5ROIHeads NMS_THRESH_TEST: 0.5 NUM_CLASSES: 80 POSITIVE_FRACTION: 0.25 PROPOSAL_APPEND_GT: True SCORE_THRESH_TEST: 0.05 ROI_KEYPOINT_HEAD: CONV_DIMS: (512, 512, 512, 512, 512, 512, 512, 512) LOSS_WEIGHT: 1.0 MIN_KEYPOINTS_PER_IMAGE: 1 NAME: KRCNNConvDeconvUpsampleHead NORMALIZE_LOSS_BY_VISIBLE_KEYPOINTS: True NUM_KEYPOINTS: 17 POOLER_RESOLUTION: 14 POOLER_SAMPLING_RATIO: 0 POOLER_TYPE: ROIAlignV2 ROI_MASK_HEAD: CLS_AGNOSTIC_MASK: False CONV_DIM: 256 NAME: MaskRCNNConvUpsampleHead NORM: NUM_CONV: 0 POOLER_RESOLUTION: 14 POOLER_SAMPLING_RATIO: 0 POOLER_TYPE: ROIAlignV2 RPN: BATCH_SIZE_PER_IMAGE: 256 BBOX_REG_LOSS_TYPE: smooth_l1 BBOX_REG_LOSS_WEIGHT: 1.0 BBOX_REG_WEIGHTS: (1.0, 1.0, 1.0, 1.0) BOUNDARY_THRESH: -1 HEAD_NAME: StandardRPNHead IN_FEATURES: ['res4'] IOU_LABELS: [0, -1, 1] IOU_THRESHOLDS: [0.3, 0.7] LOSS_WEIGHT: 1.0 NMS_THRESH: 0.7 POSITIVE_FRACTION: 0.5 POST_NMS_TOPK_TEST: 1000 POST_NMS_TOPK_TRAIN: 2000 PRE_NMS_TOPK_TEST: 6000 PRE_NMS_TOPK_TRAIN: 12000 SMOOTH_L1_BETA: 0.0 SEM_SEG_HEAD: COMMON_STRIDE: 4 CONVS_DIM: 128 IGNORE_VALUE: 255 IN_FEATURES: ['p2', 'p3', 'p4', 'p5'] LOSS_WEIGHT: 1.0 NAME: SemSegFPNHead NORM: GN NUM_CLASSES: 54 WEIGHTS: detectron2://ImageNetPretrained/MSRA/R-50.pkl YOLOF: BOX_TRANSFORM: ADD_CTR_CLAMP: True BBOX_REG_WEIGHTS: (1.0, 1.0, 1.0, 1.0) CTR_CLAMP: 32 DECODER: ACTIVATION: ReLU CLS_NUM_CONVS: 2 IN_CHANNELS: 512 NORM: BN NUM_ANCHORS: 5 NUM_CLASSES: 80 PRIOR_PROB: 0.01 REG_NUM_CONVS: 4 DETECTIONS_PER_IMAGE: 100 ENCODER: ACTIVATION: ReLU BACKBONE_LEVEL: res5 BLOCK_DILATIONS: [2, 4, 6, 8] BLOCK_MID_CHANNELS: 128 IN_CHANNELS: 2048 NORM: BN NUM_CHANNELS: 512 NUM_RESIDUAL_BLOCKS: 4 LOSSES: BBOX_REG_LOSS_TYPE: giou FOCAL_LOSS_ALPHA: 0.25 FOCAL_LOSS_GAMMA: 2.0 MATCHER: TOPK: 4 NEG_IGNORE_THRESHOLD: 0.7 NMS_THRESH_TEST: 0.6 POS_IGNORE_THRESHOLD: 0.15 SCORE_THRESH_TEST: 0.05 TOPK_CANDIDATES_TEST: 1000 OUTPUT_DIR: /hdd2/wh/cw/train/yolof/R_50_C5_1x/ SEED: -1 SOLVER: AMP: ENABLED: False BACKBONE_MULTIPLIER: 0.334 BASE_LR: 0.003 BIAS_LR_FACTOR: 1.0 CHECKPOINT_PERIOD: 2500 CLIP_GRADIENTS: CLIP_TYPE: value CLIP_VALUE: 1.0 ENABLED: False NORM_TYPE: 2.0 GAMMA: 0.1 IMS_PER_BATCH: 16 LR_SCHEDULER_NAME: WarmupMultiStepLR MAX_ITER: 90000 MOMENTUM: 0.9 NESTEROV: False REFERENCE_WORLD_SIZE: 0 STEPS: (60000, 80000) WARMUP_FACTOR: 0.0002 WARMUP_ITERS: 5000 WARMUP_METHOD: linear WEIGHT_DECAY: 0.0001 WEIGHT_DECAY_BIAS: 0.0001 WEIGHT_DECAY_NORM: 0.0 TEST: AUG: ENABLED: False FLIP: True MAX_SIZE: 4000 MIN_SIZES: (400, 500, 600, 700, 800, 900, 1000, 1100, 1200) DETECTIONS_PER_IMAGE: 100 EVAL_PERIOD: 0 EXPECTED_RESULTS: [] KEYPOINT_OKS_SIGMAS: [] PRECISE_BN: ENABLED: False NUM_ITER: 200 VERSION: 2 VIS_PERIOD: 0 [03/26 21:44:55] detectron2 INFO: Full config saved to /hdd2/wh/cw/train/yolof/R_50_C5_1x/config.yaml [03/26 21:44:55] d2.utils.env INFO: Using a generated random seed 55931624 [03/26 21:44:56] d2.engine.defaults INFO: Model: YOLOF( (backbone): ResNet( (stem): BasicStem( (conv1): Conv2d( 3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) ) (res2): Sequential( (0): BottleneckBlock( (shortcut): Conv2d( 64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv1): Conv2d( 64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv2): Conv2d( 64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv3): Conv2d( 64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) ) (1): BottleneckBlock( (conv1): Conv2d( 256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv2): Conv2d( 64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv3): Conv2d( 64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) ) (2): BottleneckBlock( (conv1): Conv2d( 256, 64, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv2): Conv2d( 64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=64, eps=1e-05) ) (conv3): Conv2d( 64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) ) ) (res3): Sequential( (0): BottleneckBlock( (shortcut): Conv2d( 256, 512, kernel_size=(1, 1), stride=(2, 2), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv1): Conv2d( 256, 128, kernel_size=(1, 1), stride=(2, 2), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv2): Conv2d( 128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv3): Conv2d( 128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) ) (1): BottleneckBlock( (conv1): Conv2d( 512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv2): Conv2d( 128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv3): Conv2d( 128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) ) (2): BottleneckBlock( (conv1): Conv2d( 512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv2): Conv2d( 128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv3): Conv2d( 128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) ) (3): BottleneckBlock( (conv1): Conv2d( 512, 128, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv2): Conv2d( 128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=128, eps=1e-05) ) (conv3): Conv2d( 128, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) ) ) (res4): Sequential( (0): BottleneckBlock( (shortcut): Conv2d( 512, 1024, kernel_size=(1, 1), stride=(2, 2), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) (conv1): Conv2d( 512, 256, kernel_size=(1, 1), stride=(2, 2), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) (1): BottleneckBlock( (conv1): Conv2d( 1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) (2): BottleneckBlock( (conv1): Conv2d( 1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) (3): BottleneckBlock( (conv1): Conv2d( 1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) (4): BottleneckBlock( (conv1): Conv2d( 1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) (5): BottleneckBlock( (conv1): Conv2d( 1024, 256, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv2): Conv2d( 256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=256, eps=1e-05) ) (conv3): Conv2d( 256, 1024, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=1024, eps=1e-05) ) ) ) (res5): Sequential( (0): BottleneckBlock( (shortcut): Conv2d( 1024, 2048, kernel_size=(1, 1), stride=(2, 2), bias=False (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05) ) (conv1): Conv2d( 1024, 512, kernel_size=(1, 1), stride=(2, 2), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv2): Conv2d( 512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv3): Conv2d( 512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05) ) ) (1): BottleneckBlock( (conv1): Conv2d( 2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv2): Conv2d( 512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv3): Conv2d( 512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05) ) ) (2): BottleneckBlock( (conv1): Conv2d( 2048, 512, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv2): Conv2d( 512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=512, eps=1e-05) ) (conv3): Conv2d( 512, 2048, kernel_size=(1, 1), stride=(1, 1), bias=False (norm): FrozenBatchNorm2d(num_features=2048, eps=1e-05) ) ) ) ) (encoder): DilatedEncoder( (lateral_conv): Conv2d(2048, 512, kernel_size=(1, 1), stride=(1, 1)) (lateral_norm): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (fpn_conv): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (fpn_norm): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (dilated_encoder_blocks): Sequential( (0): Bottleneck( (conv1): Sequential( (0): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1)) (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) (conv2): Sequential( (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(2, 2), dilation=(2, 2)) (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) (conv3): Sequential( (0): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1)) (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) ) (1): Bottleneck( (conv1): Sequential( (0): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1)) (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) (conv2): Sequential( (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(4, 4), dilation=(4, 4)) (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) (conv3): Sequential( (0): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1)) (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) ) (2): Bottleneck( (conv1): Sequential( (0): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1)) (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) (conv2): Sequential( (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(6, 6), dilation=(6, 6)) (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) (conv3): Sequential( (0): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1)) (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) ) (3): Bottleneck( (conv1): Sequential( (0): Conv2d(512, 128, kernel_size=(1, 1), stride=(1, 1)) (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) (conv2): Sequential( (0): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(8, 8), dilation=(8, 8)) (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) (conv3): Sequential( (0): Conv2d(128, 512, kernel_size=(1, 1), stride=(1, 1)) (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) ) ) ) ) (decoder): Decoder( (cls_subnet): Sequential( (0): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) (3): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (4): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (5): ReLU(inplace=True) ) (bbox_subnet): Sequential( (0): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (1): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU(inplace=True) (3): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (4): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (5): ReLU(inplace=True) (6): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (7): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (8): ReLU(inplace=True) (9): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (10): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (11): ReLU(inplace=True) ) (cls_score): Conv2d(512, 400, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (bbox_pred): Conv2d(512, 20, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (object_pred): Conv2d(512, 5, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) ) (anchor_generator): DefaultAnchorGenerator( (cell_anchors): BufferList() ) (anchor_matcher): UniformMatcher() ) [03/26 21:45:18] d2.data.datasets.coco INFO: Loading datasets/coco/annotations/instances_train2017.json takes 21.19 seconds. [03/26 21:45:19] d2.data.datasets.coco INFO: Loaded 118287 images in COCO format from datasets/coco/annotations/instances_train2017.json [03/26 21:45:31] d2.data.build INFO: Removed 1021 images with no usable annotations. 117266 images left. [03/26 21:45:38] d2.data.build INFO: Distribution of instances among all 80 categories: [36m	category	#instances	category	#instances	category	#instances
person	257253	bicycle	7056	car	43533
motorcycle	8654	airplane	5129	bus	6061
train	4570	truck	9970	boat	10576
traffic light	12842	fire hydrant	1865	stop sign	1983
parking meter	1283	bench	9820	bird	10542
cat	4766	dog	5500	horse	6567
sheep	9223	cow	8014	elephant	5484
bear	1294	zebra	5269	giraffe	5128
backpack	8714	umbrella	11265	handbag	12342
tie	6448	suitcase	6112	frisbee	2681
skis	6623	snowboard	2681	sports ball	6299
kite	8802	baseball bat	3273	baseball gl..	3747
skateboard	5536	surfboard	6095	tennis racket	4807
bottle	24070	wine glass	7839	cup	20574
fork	5474	knife	7760	spoon	6159
bowl	14323	banana	9195	apple	5776
sandwich	4356	orange	6302	broccoli	7261
carrot	7758	hot dog	2884	pizza	5807
donut	7005	cake	6296	chair	38073
couch	5779	potted plant	8631	bed	4192
dining table	15695	toilet	4149	tv	5803
laptop	4960	mouse	2261	remote	5700
keyboard	2854	cell phone	6422	microwave	1672
oven	3334	toaster	225	sink	5609
refrigerator	2634	book	24077	clock	6320
vase	6577	scissors	1464	teddy bear	4729
hair drier	198	toothbrush	1945
total	849949					[0m

[03/26 21:45:38] d2.data.build INFO: Using training sampler TrainingSampler [03/26 21:45:40] d2.data.common INFO: Serializing 117266 elements to byte tensors and concatenating them all ... [03/26 21:45:46] d2.data.common INFO: Serialized dataset takes 451.21 MiB [03/26 21:45:54] fvcore.common.checkpoint INFO: Loading checkpoint from detectron2://ImageNetPretrained/MSRA/R-50.pkl [03/26 21:45:54] d2.checkpoint.c2_model_loading INFO: Renaming Caffe2 weights ...... [03/26 21:45:54] d2.checkpoint.c2_model_loading INFO: Following weights matched with submodule backbone:	Names in Model	Names in Checkpoint
res2.0.conv1.*	res2_0branch2a{bn_*,w}	(64,) (64,) (64,) (64,) (64,64,1,1)
res2.0.conv2.*	res2_0branch2b{bn_*,w}	(64,) (64,) (64,) (64,) (64,64,3,3)
res2.0.conv3.*	res2_0branch2c{bn_*,w}	(256,) (256,) (256,) (256,) (256,64,1,1)
res2.0.shortcut.*	res2_0branch1{bn_*,w}	(256,) (256,) (256,) (256,) (256,64,1,1)
res2.1.conv1.*	res2_1branch2a{bn_*,w}	(64,) (64,) (64,) (64,) (64,256,1,1)
res2.1.conv2.*	res2_1branch2b{bn_*,w}	(64,) (64,) (64,) (64,) (64,64,3,3)
res2.1.conv3.*	res2_1branch2c{bn_*,w}	(256,) (256,) (256,) (256,) (256,64,1,1)
res2.2.conv1.*	res2_2branch2a{bn_*,w}	(64,) (64,) (64,) (64,) (64,256,1,1)
res2.2.conv2.*	res2_2branch2b{bn_*,w}	(64,) (64,) (64,) (64,) (64,64,3,3)
res2.2.conv3.*	res2_2branch2c{bn_*,w}	(256,) (256,) (256,) (256,) (256,64,1,1)
res3.0.conv1.*	res3_0branch2a{bn_*,w}	(128,) (128,) (128,) (128,) (128,256,1,1)
res3.0.conv2.*	res3_0branch2b{bn_*,w}	(128,) (128,) (128,) (128,) (128,128,3,3)
res3.0.conv3.*	res3_0branch2c{bn_*,w}	(512,) (512,) (512,) (512,) (512,128,1,1)
res3.0.shortcut.*	res3_0branch1{bn_*,w}	(512,) (512,) (512,) (512,) (512,256,1,1)
res3.1.conv1.*	res3_1branch2a{bn_*,w}	(128,) (128,) (128,) (128,) (128,512,1,1)
res3.1.conv2.*	res3_1branch2b{bn_*,w}	(128,) (128,) (128,) (128,) (128,128,3,3)
res3.1.conv3.*	res3_1branch2c{bn_*,w}	(512,) (512,) (512,) (512,) (512,128,1,1)
res3.2.conv1.*	res3_2branch2a{bn_*,w}	(128,) (128,) (128,) (128,) (128,512,1,1)
res3.2.conv2.*	res3_2branch2b{bn_*,w}	(128,) (128,) (128,) (128,) (128,128,3,3)
res3.2.conv3.*	res3_2branch2c{bn_*,w}	(512,) (512,) (512,) (512,) (512,128,1,1)
res3.3.conv1.*	res3_3branch2a{bn_*,w}	(128,) (128,) (128,) (128,) (128,512,1,1)
res3.3.conv2.*	res3_3branch2b{bn_*,w}	(128,) (128,) (128,) (128,) (128,128,3,3)
res3.3.conv3.*	res3_3branch2c{bn_*,w}	(512,) (512,) (512,) (512,) (512,128,1,1)
res4.0.conv1.*	res4_0branch2a{bn_*,w}	(256,) (256,) (256,) (256,) (256,512,1,1)
res4.0.conv2.*	res4_0branch2b{bn_*,w}	(256,) (256,) (256,) (256,) (256,256,3,3)
res4.0.conv3.*	res4_0branch2c{bn_*,w}	(1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res4.0.shortcut.*	res4_0branch1{bn_*,w}	(1024,) (1024,) (1024,) (1024,) (1024,512,1,1)
res4.1.conv1.*	res4_1branch2a{bn_*,w}	(256,) (256,) (256,) (256,) (256,1024,1,1)
res4.1.conv2.*	res4_1branch2b{bn_*,w}	(256,) (256,) (256,) (256,) (256,256,3,3)
res4.1.conv3.*	res4_1branch2c{bn_*,w}	(1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res4.2.conv1.*	res4_2branch2a{bn_*,w}	(256,) (256,) (256,) (256,) (256,1024,1,1)
res4.2.conv2.*	res4_2branch2b{bn_*,w}	(256,) (256,) (256,) (256,) (256,256,3,3)
res4.2.conv3.*	res4_2branch2c{bn_*,w}	(1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res4.3.conv1.*	res4_3branch2a{bn_*,w}	(256,) (256,) (256,) (256,) (256,1024,1,1)
res4.3.conv2.*	res4_3branch2b{bn_*,w}	(256,) (256,) (256,) (256,) (256,256,3,3)
res4.3.conv3.*	res4_3branch2c{bn_*,w}	(1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res4.4.conv1.*	res4_4branch2a{bn_*,w}	(256,) (256,) (256,) (256,) (256,1024,1,1)
res4.4.conv2.*	res4_4branch2b{bn_*,w}	(256,) (256,) (256,) (256,) (256,256,3,3)
res4.4.conv3.*	res4_4branch2c{bn_*,w}	(1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res4.5.conv1.*	res4_5branch2a{bn_*,w}	(256,) (256,) (256,) (256,) (256,1024,1,1)
res4.5.conv2.*	res4_5branch2b{bn_*,w}	(256,) (256,) (256,) (256,) (256,256,3,3)
res4.5.conv3.*	res4_5branch2c{bn_*,w}	(1024,) (1024,) (1024,) (1024,) (1024,256,1,1)
res5.0.conv1.*	res5_0branch2a{bn_*,w}	(512,) (512,) (512,) (512,) (512,1024,1,1)
res5.0.conv2.*	res5_0branch2b{bn_*,w}	(512,) (512,) (512,) (512,) (512,512,3,3)
res5.0.conv3.*	res5_0branch2c{bn_*,w}	(2048,) (2048,) (2048,) (2048,) (2048,512,1,1)
res5.0.shortcut.*	res5_0branch1{bn_*,w}	(2048,) (2048,) (2048,) (2048,) (2048,1024,1,1)
res5.1.conv1.*	res5_1branch2a{bn_*,w}	(512,) (512,) (512,) (512,) (512,2048,1,1)
res5.1.conv2.*	res5_1branch2b{bn_*,w}	(512,) (512,) (512,) (512,) (512,512,3,3)
res5.1.conv3.*	res5_1branch2c{bn_*,w}	(2048,) (2048,) (2048,) (2048,) (2048,512,1,1)
res5.2.conv1.*	res5_2branch2a{bn_*,w}	(512,) (512,) (512,) (512,) (512,2048,1,1)
res5.2.conv2.*	res5_2branch2b{bn_*,w}	(512,) (512,) (512,) (512,) (512,512,3,3)
res5.2.conv3.*	res5_2branch2c{bn_*,w}	(2048,) (2048,) (2048,) (2048,) (2048,512,1,1)
stem.conv1.norm.*	res_conv1bn*	(64,) (64,) (64,) (64,)
stem.conv1.weight	conv1_w	(64, 3, 7, 7)

[03/26 21:45:54] fvcore.common.checkpoint INFO: Some model parameters or buffers are not found in the checkpoint: [34manchor_generator.cell_anchors.0[0m [34mdecoder.bbox_pred.{bias, weight}[0m [34mdecoder.bbox_subnet.0.{bias, weight}[0m [34mdecoder.bbox_subnet.1.{bias, running_mean, running_var, weight}[0m [34mdecoder.bbox_subnet.10.{bias, running_mean, running_var, weight}[0m [34mdecoder.bbox_subnet.3.{bias, weight}[0m [34mdecoder.bbox_subnet.4.{bias, running_mean, running_var, weight}[0m [34mdecoder.bbox_subnet.6.{bias, weight}[0m [34mdecoder.bbox_subnet.7.{bias, running_mean, running_var, weight}[0m [34mdecoder.bbox_subnet.9.{bias, weight}[0m [34mdecoder.cls_score.{bias, weight}[0m [34mdecoder.cls_subnet.0.{bias, weight}[0m [34mdecoder.cls_subnet.1.{bias, running_mean, running_var, weight}[0m [34mdecoder.cls_subnet.3.{bias, weight}[0m [34mdecoder.cls_subnet.4.{bias, running_mean, running_var, weight}[0m [34mdecoder.object_pred.{bias, weight}[0m [34mencoder.dilated_encoder_blocks.0.conv1.0.{bias, weight}[0m [34mencoder.dilated_encoder_blocks.0.conv1.1.{bias, running_mean, running_var, weight}[0m [34mencoder.dilated_encoder_blocks.0.conv2.0.{bias, weight}[0m [34mencoder.dilated_encoder_blocks.0.conv2.1.{bias, running_mean, running_var, weight}[0m [34mencoder.dilated_encoder_blocks.0.conv3.0.{bias, weight}[0m [34mencoder.dilated_encoder_blocks.0.conv3.1.{bias, running_mean, running_var, weight}[0m [34mencoder.dilated_encoder_blocks.1.conv1.0.{bias, weight}[0m [34mencoder.dilated_encoder_blocks.1.conv1.1.{bias, running_mean, running_var, weight}[0m [34mencoder.dilated_encoder_blocks.1.conv2.0.{bias, weight}[0m [34mencoder.dilated_encoder_blocks.1.conv2.1.{bias, running_mean, running_var, weight}[0m [34mencoder.dilated_encoder_blocks.1.conv3.0.{bias, weight}[0m [34mencoder.dilated_encoder_blocks.1.conv3.1.{bias, running_mean, running_var, weight}[0m [34mencoder.dilated_encoder_blocks.2.conv1.0.{bias, weight}[0m [34mencoder.dilated_encoder_blocks.2.conv1.1.{bias, running_mean, running_var, weight}[0m [34mencoder.dilated_encoder_blocks.2.conv2.0.{bias, weight}[0m [34mencoder.dilated_encoder_blocks.2.conv2.1.{bias, running_mean, running_var, weight}[0m [34mencoder.dilated_encoder_blocks.2.conv3.0.{bias, weight}[0m [34mencoder.dilated_encoder_blocks.2.conv3.1.{bias, running_mean, running_var, weight}[0m [34mencoder.dilated_encoder_blocks.3.conv1.0.{bias, weight}[0m [34mencoder.dilated_encoder_blocks.3.conv1.1.{bias, running_mean, running_var, weight}[0m [34mencoder.dilated_encoder_blocks.3.conv2.0.{bias, weight}[0m [34mencoder.dilated_encoder_blocks.3.conv2.1.{bias, running_mean, running_var, weight}[0m [34mencoder.dilated_encoder_blocks.3.conv3.0.{bias, weight}[0m [34mencoder.dilated_encoder_blocks.3.conv3.1.{bias, running_mean, running_var, weight}[0m [34mencoder.fpn_conv.{bias, weight}[0m [34mencoder.fpn_norm.{bias, running_mean, running_var, weight}[0m [34mencoder.lateral_conv.{bias, weight}[0m [34mencoder.lateral_norm.{bias, running_mean, running_var, weight}[0m [03/26 21:45:54] fvcore.common.checkpoint INFO: The checkpoint state_dict contains keys that are not used by the model: [35mfc1000.{bias, weight}[0m [35mstem.conv1.bias[0m [03/26 21:45:54] d2.engine.train_loop INFO: Starting training from iteration 0

Wei-i commented 3 years ago

[03/26 21:45:54 d2.engine.train_loop]: Starting training from iteration 0 /opt/conda/conda-bld/pytorch_1595629408163/work/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [120,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1595629408163/work/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [61,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered Exception raised from create_event_internal at /opt/conda/conda-bld/pytorch_1595629408163/work/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x4d (0x7f4c7325377d in /home/cw/miniconda3/envs/py_dt2/lib/python3.7/site-packages/torch/lib/libc10.so) frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xb5d (0x7f4c734a3d9d in /home/cw/miniconda3/envs/py_dt2/lib/python3.7/site-packages/torch/lib/libc10_cuda.so) frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f4c7323fb1d in /home/cw/miniconda3/envs/py_dt2/lib/python3.7/site-packages/torch/lib/libc10.so) frame #3: + 0x53f0ea (0x7f4ca5ba30ea in /home/cw/miniconda3/envs/py_dt2/lib/python3.7/site-packages/torch/lib/libtorch_python.so) frame #4: + 0x1809da (0x5589712629da in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #5: + 0xfc039 (0x5589711de039 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #6: + 0xfa678 (0x5589711dc678 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #7: + 0xfa938 (0x5589711dc938 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #8: + 0xfa938 (0x5589711dc938 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #9: + 0xfa348 (0x5589711dc348 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #10: + 0xfadd8 (0x5589711dcdd8 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #11: + 0xfadec (0x5589711dcdec in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #12: + 0xfadec (0x5589711dcdec in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #13: + 0xfadec (0x5589711dcdec in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #14: + 0xfadec (0x5589711dcdec in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #15: + 0xfadec (0x5589711dcdec in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #16: + 0xfadec (0x5589711dcdec in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #17: + 0xfadec (0x5589711dcdec in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #18: + 0xfadec (0x5589711dcdec in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #19: + 0xfb238 (0x5589711dd238 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #20: + 0xfb2db (0x5589711dd2db in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #21: + 0x1dc923 (0x5589712be923 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #22: _PyEval_EvalFrameDefault + 0x27b8 (0x5589712a6ea8 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #23: _PyFunction_FastCallKeywords + 0x187 (0x55897121a767 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #24: + 0x17f335 (0x558971261335 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #25: _PyEval_EvalFrameDefault + 0x611 (0x5589712a4d01 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #26: _PyFunction_FastCallKeywords + 0x187 (0x55897121a767 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #27: _PyEval_EvalFrameDefault + 0x3f5 (0x5589712a4ae5 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #28: _PyEval_EvalCodeWithName + 0x252 (0x5589711fadb2 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #29: _PyFunction_FastCallKeywords + 0x583 (0x55897121ab63 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #30: + 0x17f335 (0x558971261335 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #31: _PyEval_EvalFrameDefault + 0x13fe (0x5589712a5aee in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #32: _PyEval_EvalCodeWithName + 0x252 (0x5589711fadb2 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #33: PyEval_EvalCode + 0x23 (0x5589711fc1e3 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #34: + 0x2271d2 (0x5589713091d2 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #35: PyRun_StringFlags + 0x7a (0x55897131417a in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #36: PyRun_SimpleStringFlags + 0x3c (0x5589713141dc in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #37: + 0x2322d9 (0x5589713142d9 in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #38: _Py_UnixMain + 0x3c (0x55897131467c in /home/cw/miniconda3/envs/py_dt2/bin/python) frame #39: __libc_start_main + 0xf0 (0x7f4cbbde3830 in /lib/x86_64-linux-gnu/libc.so.6) frame #40: + 0x1d7101 (0x5589712b9101 in /home/cw/miniconda3/envs/py_dt2/bin/python)

Traceback (most recent call last): File "./tools/train_net.py", line 234, in args=(args,), File "/home/cw/detectron2/detectron2/engine/launch.py", line 59, in launch daemon=False, File "/home/cw/miniconda3/envs/py_dt2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 200, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/cw/miniconda3/envs/py_dt2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 158, in start_processes while not context.join(): File "/home/cw/miniconda3/envs/py_dt2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 119, in join raise Exception(msg) Exception:

-- Process 1 terminated with the following error: Traceback (most recent call last): File "/home/cw/miniconda3/envs/py_dt2/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 20, in _wrap fn(i, args) File "/home/cw/detectron2/detectron2/engine/launch.py", line 94, in _distributed_worker main_func(args) File "/home/cw/YOLOF/tools/train_net.py", line 221, in main return trainer.train() File "/home/cw/detectron2/detectron2/engine/defaults.py", line 431, in train super().train(self.start_iter, self.max_iter) File "/home/cw/detectron2/detectron2/engine/train_loop.py", line 140, in train self.run_step() File "/home/cw/detectron2/detectron2/engine/defaults.py", line 441, in run_step self._trainer.run_step() File "/home/cw/detectron2/detectron2/engine/train_loop.py", line 234, in run_step loss_dict = self.model(data) File "/home/cw/miniconda3/envs/py_dt2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(*input, kwargs) File "/home/cw/miniconda3/envs/py_dt2/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 511, in forward output = self.module(*inputs[0], *kwargs[0]) File "/home/cw/miniconda3/envs/py_dt2/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl result = self.forward(input, kwargs) File "/home/cw/YOLOF/yolof/modeling/yolof.py", line 295, in forward pred_logits, pred_anchor_deltas)
File "/home/cw/YOLOF/yolof/modeling/yolof.py", line 394, in losses pred_class_logits[valid_idxs], RuntimeError: copy_if failed to synchronize: device-side assert triggered

Wei-i commented 3 years ago

command : python ./tools/train_net.py --num-gpus 2 --config-file ./configs/yolof_R_50_C5_1x.yaml OUTPUT_DIR /hdd2/wh/cw/train/yolof/R_50_C5_1x/

yaml: ` MODEL: META_ARCHITECTURE: "YOLOF" BACKBONE: NAME: "build_resnet_backbone" RESNETS: OUT_FEATURES: ["res5"] DATASETS: TRAIN: ("coco_2017_train",) TEST: ("coco_2017_val",) DATALOADER: NUM_WORKERS: 8 SOLVER:

IMS_PER_BATCH: 64

IMS_PER_BATCH: 16

BASE_LR: 0.12

BASE_LR: 0.03 WARMUP_FACTOR: 0.0002 # 0.00066667 WARMUP_ITERS: 5000 # 1500

STEPS: (15000, 20000)

STEPS: (60000, 80000)

MAX_ITER: 22500

MAX_ITER: 90000 CHECKPOINT_PERIOD: 2500 INPUT: MIN_SIZE_TRAIN: (800,)

OUTPUT_DIR: '/hdd2/wh/cw/train/yolof/R_50_C5_1x' `

Wei-i commented 3 years ago

作者您好，我粗浅的认为是跟数据有关的错误，可能是发生了数组越界等？但我的数据集是coco2017，不应该有这个错误才对，其余的我也没有更改了，我原本的detectron2也重新升级了一下。我发现部分输入的图片是没有错误的，网络还可以迭代几次，打印loss

chensnathan commented 3 years ago

看错误是在取index的时候越界了，但是很奇怪，我之前跑过这么多次都没有遇到过这个问题。log文件一眼看过去也并没有找到明显不对的地方，感觉没啥道理。。。你这个是每次一跑，必然会出现这个错误嘛？

Wei-i commented 3 years ago

pred_class_logits[valid_idxs].size() 是的每次一跑必定越界..... 我在yolof.py错误的地方调试了一下

Wei-i commented 3 years ago

print("gt_classes >= 0", gt_classes[gt_classes >= 0].size())

Wei-i commented 3 years ago

1.

gt_class.size() torch.Size([38000]) gt_classes >= 0 torch.Size([37818]) valid_idxs torch.Size([38000]) pred_class_logits.size() torch.Size([38000, 80]) pred_class_logits[valid_idxs] torch.Size([37818, 80])

2.

gt_class.size() torch.Size([42000]) gt_classes >= 0 torch.Size([41866]) valid_idxs torch.Size([42000]) pred_class_logits.size() torch.Size([42000, 80]) pred_class_logits[valid_idxs] torch.Size([41866, 80])

3.

gt_class.size() torch.Size([38000]) 没输出print("gt_classes >= 0", gt_classes[gt_classes >= 0].size())

/opt/conda/conda-bld/pytorch_1595629408163/work/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [9,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed. /opt/conda/conda-bld/pytorch_1595629408163/work/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [11,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.

chensnathan commented 3 years ago

感觉这里不应该有错，建议用

CUDA_LAUNCH_BLOCKING=1 python ./tools/train_net.py --num-gpus 2 --config-file ./configs/yolof_R_50_C5_1x.yaml OUTPUT_DIR /hdd2/wh/cw/train/yolof/R_50_C5_1x/

看看到底哪里出错了。或者换个机器重新配一下环境，跑一下试试，按理来说能直接跑才对

Wei-i commented 3 years ago

还是不行，可能只能用其他机器试一下，我不确定我的cuda 9.0是否会对这个有影响。

Wei-i commented 3 years ago

作者您好我换了一台cuda 10.1的然后重复了我之前的操作，（基本就是直接安装了,数据集也是直接从原来的服务器上传输的）。然后就没问题了。。。奇葩我暂时只能归咎于是我之前cuda的9.0，cudatoolkits 9.2不好适配您的代码？（我跑adelaidet还是没问题的，本来大半年没更新了今天更新了下简要修改了train.py没啥问题hhh）

Wei-i commented 3 years ago

另外有一个小地方，我觉得您可以考虑修改下。就是您建议安装的mish-cuda，好像不能够直接 build After git clone，you should movemish-cuda/external/CUDAApplyUtils.cuh to csrc/ before python setup.py build install

link-issue

chensnathan / YOLOF

AttributeError: module 'portalocker' has no attribute 'Lock' #4

NUM_WORKERS: 8

IMS_PER_BATCH: 64

BASE_LR: 0.12

STEPS: (15000, 20000)

MAX_ITER: 22500

IMS_PER_BATCH: 64

BASE_LR: 0.12

STEPS: (15000, 20000)

MAX_ITER: 22500

1.

2.

3.