Open Iamal1 opened 5 years ago
How did you change the backbone from ResNet 50 to ResNext101?
hey, I also got same problem but in ResNet-50
RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58
change the yaml file in configs/panet like this and change in command python tools/train_net_step.py --dataset coco2017 --cfg configs/panet/e2e_panet_R-101-FPN_2x_mask.yaml
MODEL: TYPE: generalized_rcnn CONV_BODY: FPN.fpn_ResNet101_conv5_body FASTER_RCNN: True MASK_ON: True NUM_GPUS: 1 SOLVER: WEIGHT_DECAY: 0.0001 LR_POLICY: steps_with_decay BASE_LR: 0.02 GAMMA: 0.1 MAX_ITER: 20000 STEPS: [0, 120000, 160000] FPN: FPN_ON: True MULTILEVEL_ROIS: True MULTILEVEL_RPN: True USE_GN: True # Note: use GN on the FPN-specific layers RESNETS: IMAGENET_PRETRAINED_WEIGHTS: 'data/pretrained_model/resnet101_caffe.pth' FAST_RCNN: ROI_BOX_HEAD: fast_rcnn_heads.roi_Xconv1fc_gn_head_panet # Note: this is a Conv GN head ROI_XFORM_METHOD: RoIAlign ROI_XFORM_RESOLUTION: 7 ROI_XFORM_SAMPLING_RATIO: 2 MRCNN: ROI_MASK_HEAD: mask_rcnn_heads.mask_rcnn_fcn_head_v1up4convs_gn_adp_ff # Note: this is a GN mask head RESOLUTION: 28 # (output mask resolution) default 14 ROI_XFORM_METHOD: RoIAlign ROI_XFORM_RESOLUTION: 14 # default 7 ROI_XFORM_SAMPLING_RATIO: 2 # default 0 DILATION: 1 # default 2 CONV_INIT: MSRAFill # default GaussianFill TRAIN: SCALES: (1200, 1200, 1000, 800, 600, 400) MAX_SIZE: 1400 BATCH_SIZE_PER_IM: 64 RPN_PRE_NMS_TOP_N: 2000 # Per FPN level TEST: SCALE: 1000 MAX_SIZE: 1400 NMS: 0.5 RPN_PRE_NMS_TOP_N: 1000 # Per FPN level RPN_POST_NMS_TOP_N: 1000
Hi all, First, I managed to train the PANet with ResNet-50 with batch_size = 8 on 8 GTX 1080 GPUs. But when I tried to change the backbone from R-50 to ResNeXt101, I met out of memory problem. It can train for several steps, but always at the edge of OOM, and will eventually get OOM. Could I use 4 GPUs to do this? I tried to change the code to make it work, but get into other problems. The code uses all gpus by default so I comment it.
will get into index out of range problem. Is there anyone who succeed on this?