Closed cool-xuan closed 4 years ago
@cool-xuan indeed, 10 AP is quite a big difference. Could you please try to compensate for the smaller batch size and increase the total number of iterations to see whether the results get improved? You run about 3 epochs less than the standard schedule. The results in the model zoo are expected to be reproducible with standard deviation in DP AP GPSm metric being about 0.1.
Thanks for your advice. I 'll set the total number of iterations larger and decrease learning rate as you said. Could you tell me how many epoches are needed to be reproducible
@cool-xuan it's hard to give an exact recipe on how to adjust your training schedule to match the one from the model zoo. In terms of pure image counts, you've got 130000 iterations with 15 images per batch, it's 130000 images less than the standard schedule. So you need to add 8667 iterations to match in terms of the amount of data. Then you'll also need to adjust the schedule, notably the learning rate (SOLVER. BASE_LR
) and warmup factor (SOLVER. WARMUP_FACTOR
).
Do you train the model in model zoo just use the provided config : MAX_ITER: 130000 STEPS: (100000, 120000)
I don't think the results can be reproducible just after adding 8667 iterations.
Are there some other suggestions about my problem? I just copy the conda env from one server to my server, because the download speed of my server is very very slow. Dose this cause some problem?
@cool-xuan yes, I train the model using 1 machine with 8 GPUs with the exact config from the model zoo. Changing the batch size requires readjusting the training schedule (e.g. the learning rate curve). Copying conda env should not be an issue
@cool-xuan I tried relaunching the training on my side (8 GPUs, 16 images per batch) and have got results similar to yours: | bbox AP | dp AP GPS | dp AP GPSm |
---|---|---|---|
58.6741 | 54.7701 | 58.4207 |
This is obviously too low and unexpected. I'm going to investigate and post here an update. Thank you for flagging the issue!
@vkhalidov Thank you for your relaunching. I am a beginer in DensePose and not very familiar with detectron2. But I'm trying to find the bug too. Waiting for your update. It is really a startling and messive work. Thanks a lot again.
@cool-xuan while I'm investigating the issue, you can use aef142769953de9b8c117138d15e146633949ea2 for which the scores correspond to the ones reported in the model zoo
@cool-xuan 4921a51f10ca196fb9741e91878da4ccb20d511f should have fixed the issue, all baselines from the model zoo should now be reproducible
@vkhalidov Thanks again for your work.
If you do not know the root cause of the problem, and wish someone to help you, please post according to this template:
Instructions To Reproduce the Issue:
Check https://stackoverflow.com/help/minimal-reproducible-example for how to ask good questions. Simplify the steps to reproduce the issue using suggestions from the above link, and provide them below:
git diff
)OUTPUT_DIR: "/raid/zyx/detectron2_torch1.5/output/densepose_rcnn_R_50_FPN_s1x" VERSION: 2
full config: BOOTSTRAP_DATASETS: [] BOOTSTRAP_MODEL: DEVICE: cuda WEIGHTS: CUDNN_BENCHMARK: False DATALOADER: ASPECT_RATIO_GROUPING: True FILTER_EMPTY_ANNOTATIONS: True NUM_WORKERS: 4 REPEAT_THRESHOLD: 0.0 SAMPLER_TRAIN: TrainingSampler DATASETS: CATEGORY_MAPS:
PRECOMPUTED_PROPOSAL_TOPK_TEST: 1000 PRECOMPUTED_PROPOSAL_TOPK_TRAIN: 2000 PROPOSAL_FILES_TEST: () PROPOSAL_FILES_TRAIN: () TEST: ('densepose_coco_2014_minival',) TRAIN: ('densepose_coco_2014_train', 'densepose_coco_2014_valminusminival') WHITELISTED_CATEGORIES:
GLOBAL: HACK: 1.0 INPUT: CROP: ENABLED: False SIZE: [0.9, 0.9] TYPE: relative_range FORMAT: BGR MASK_FORMAT: polygon MAX_SIZE_TEST: 1333 MAX_SIZE_TRAIN: 1333 MIN_SIZE_TEST: 800 MIN_SIZE_TRAIN: (640, 672, 704, 736, 768, 800) MIN_SIZE_TRAIN_SAMPLING: choice ROTATION_ANGLES: [0] MODEL: ANCHOR_GENERATOR: ANGLES: [[-90, 0, 90]] ASPECT_RATIOS: [[0.5, 1.0, 2.0]] NAME: DefaultAnchorGenerator OFFSET: 0.0 SIZES: [[32], [64], [128], [256], [512]] BACKBONE: FREEZE_AT: 2 NAME: build_resnet_fpn_backbone DENSEPOSE_ON: True DEVICE: cuda FPN: FUSE_TYPE: sum IN_FEATURES: ['res2', 'res3', 'res4', 'res5'] NORM: OUT_CHANNELS: 256 HRNET: HRFPN: OUT_CHANNELS: 256 STAGE2: BLOCK: BASIC FUSE_METHOD: SUM NUM_BLOCKS: [4, 4] NUM_BRANCHES: 2 NUM_CHANNELS: [32, 64] NUM_MODULES: 1 STAGE3: BLOCK: BASIC FUSE_METHOD: SUM NUM_BLOCKS: [4, 4, 4] NUM_BRANCHES: 3 NUM_CHANNELS: [32, 64, 128] NUM_MODULES: 4 STAGE4: BLOCK: BASIC FUSE_METHOD: SUM NUM_BLOCKS: [4, 4, 4, 4] NUM_BRANCHES: 4 NUM_CHANNELS: [32, 64, 128, 256] NUM_MODULES: 3 STEM_INPLANES: 64 KEYPOINT_ON: False LOAD_PROPOSALS: False MASK_ON: False META_ARCHITECTURE: GeneralizedRCNN PANOPTIC_FPN: COMBINE: ENABLED: True INSTANCES_CONFIDENCE_THRESH: 0.5 OVERLAP_THRESH: 0.5 STUFF_AREA_LIMIT: 4096 INSTANCE_LOSS_WEIGHT: 1.0 PIXEL_MEAN: [103.53, 116.28, 123.675] PIXEL_STD: [1.0, 1.0, 1.0] PROPOSAL_GENERATOR: MIN_SIZE: 0 NAME: RPN RESNETS: DEFORM_MODULATED: False DEFORM_NUM_GROUPS: 1 DEFORM_ON_PER_STAGE: [False, False, False, False] DEPTH: 50 NORM: FrozenBN NUM_GROUPS: 1 OUT_FEATURES: ['res2', 'res3', 'res4', 'res5'] RES2_OUT_CHANNELS: 256 RES5_DILATION: 1 STEM_OUT_CHANNELS: 64 STRIDE_IN_1X1: True WIDTH_PER_GROUP: 64 RETINANET: BBOX_REG_WEIGHTS: (1.0, 1.0, 1.0, 1.0) FOCAL_LOSS_ALPHA: 0.25 FOCAL_LOSS_GAMMA: 2.0 IN_FEATURES: ['p3', 'p4', 'p5', 'p6', 'p7'] IOU_LABELS: [0, -1, 1] IOU_THRESHOLDS: [0.4, 0.5] NMS_THRESH_TEST: 0.5 NUM_CLASSES: 80 NUM_CONVS: 4 PRIOR_PROB: 0.01 SCORE_THRESH_TEST: 0.05 SMOOTH_L1_LOSS_BETA: 0.1 TOPK_CANDIDATES_TEST: 1000 ROI_BOX_CASCADE_HEAD: BBOX_REG_WEIGHTS: ((10.0, 10.0, 5.0, 5.0), (20.0, 20.0, 10.0, 10.0), (30.0, 30.0, 15.0, 15.0)) IOUS: (0.5, 0.6, 0.7) ROI_BOX_HEAD: BBOX_REG_LOSS_TYPE: smooth_l1 BBOX_REG_LOSS_WEIGHT: 1.0 BBOX_REG_WEIGHTS: (10.0, 10.0, 5.0, 5.0) CLS_AGNOSTIC_BBOX_REG: False CONV_DIM: 256 FC_DIM: 1024 NAME: FastRCNNConvFCHead NORM: NUM_CONV: 0 NUM_FC: 2 POOLER_RESOLUTION: 7 POOLER_SAMPLING_RATIO: 2 POOLER_TYPE: ROIAlign SMOOTH_L1_BETA: 0.0 TRAIN_ON_PRED_BOXES: False ROI_DENSEPOSE_HEAD: COARSE_SEGM_TRAINED_BY_MASKS: False CONV_HEAD_DIM: 512 CONV_HEAD_KERNEL: 3 DECODER_COMMON_STRIDE: 4 DECODER_CONV_DIMS: 256 DECODER_NORM: DECODER_NUM_CLASSES: 256 DECODER_ON: True DECONV_KERNEL: 4 DEEPLAB: NONLOCAL_ON: 0 NORM: GN FG_IOU_THRESHOLD: 0.7 HEATMAP_SIZE: 112 INDEX_WEIGHTS: 5.0 NAME: DensePoseV1ConvXHead NUM_COARSE_SEGM_CHANNELS: 2 NUM_PATCHES: 24 NUM_STACKED_CONVS: 8 PART_WEIGHTS: 1.0 POINT_REGRESSION_WEIGHTS: 0.01 POOLER_RESOLUTION: 28 POOLER_SAMPLING_RATIO: 2 POOLER_TYPE: ROIAlign SEGM_CONFIDENCE: ENABLED: False EPSILON: 0.01 UP_SCALE: 2 UV_CONFIDENCE: ENABLED: False EPSILON: 0.01 TYPE: iid_iso ROI_HEADS: BATCH_SIZE_PER_IMAGE: 512 IN_FEATURES: ['p2', 'p3', 'p4', 'p5'] IOU_LABELS: [0, 1] IOU_THRESHOLDS: [0.5] NAME: DensePoseROIHeads NMS_THRESH_TEST: 0.5 NUM_CLASSES: 1 POSITIVE_FRACTION: 0.25 PROPOSAL_APPEND_GT: True SCORE_THRESH_TEST: 0.05 ROI_KEYPOINT_HEAD: CONV_DIMS: (512, 512, 512, 512, 512, 512, 512, 512) LOSS_WEIGHT: 1.0 MIN_KEYPOINTS_PER_IMAGE: 1 NAME: KRCNNConvDeconvUpsampleHead NORMALIZE_LOSS_BY_VISIBLE_KEYPOINTS: True NUM_KEYPOINTS: 17 POOLER_RESOLUTION: 14 POOLER_SAMPLING_RATIO: 0 POOLER_TYPE: ROIAlignV2 ROI_MASK_HEAD: CLS_AGNOSTIC_MASK: False CONV_DIM: 256 NAME: MaskRCNNConvUpsampleHead NORM: NUM_CONV: 0 POOLER_RESOLUTION: 14 POOLER_SAMPLING_RATIO: 0 POOLER_TYPE: ROIAlignV2 RPN: BATCH_SIZE_PER_IMAGE: 256 BBOX_REG_LOSS_TYPE: smooth_l1 BBOX_REG_LOSS_WEIGHT: 1.0 BBOX_REG_WEIGHTS: (1.0, 1.0, 1.0, 1.0) BOUNDARY_THRESH: -1 HEAD_NAME: StandardRPNHead IN_FEATURES: ['p2', 'p3', 'p4', 'p5', 'p6'] IOU_LABELS: [0, -1, 1] IOU_THRESHOLDS: [0.3, 0.7] LOSS_WEIGHT: 1.0 NMS_THRESH: 0.7 POSITIVE_FRACTION: 0.5 POST_NMS_TOPK_TEST: 1000 POST_NMS_TOPK_TRAIN: 1000 PRE_NMS_TOPK_TEST: 1000 PRE_NMS_TOPK_TRAIN: 2000 SMOOTH_L1_BETA: 0.0 SEM_SEG_HEAD: COMMON_STRIDE: 4 CONVS_DIM: 128 IGNORE_VALUE: 255 IN_FEATURES: ['p2', 'p3', 'p4', 'p5'] LOSS_WEIGHT: 1.0 NAME: SemSegFPNHead NORM: GN NUM_CLASSES: 54 WEIGHTS: pretrained_weights/ImageNetPretrained/R-50.pkl OUTPUT_DIR: /raid/zyx/detectron2/output/densepose_rcnn_R_50_FPN_s1x SEED: -1 SOLVER: BASE_LR: 0.01 BIAS_LR_FACTOR: 1.0 CHECKPOINT_PERIOD: 10000 CLIP_GRADIENTS: CLIP_TYPE: value CLIP_VALUE: 1.0 ENABLED: False NORM_TYPE: 2.0 GAMMA: 0.1 IMS_PER_BATCH: 18 LR_SCHEDULER_NAME: WarmupMultiStepLR MAX_ITER: 130000 MOMENTUM: 0.9 NESTEROV: False REFERENCE_WORLD_SIZE: 0 STEPS: (100000, 120000) WARMUP_FACTOR: 0.1 WARMUP_ITERS: 1000 WARMUP_METHOD: linear WEIGHT_DECAY: 0.0001 WEIGHT_DECAY_BIAS: 0.0001 WEIGHT_DECAY_NORM: 0.0 TEST: AUG: ENABLED: False FLIP: True MAX_SIZE: 4000 MIN_SIZES: (400, 500, 600, 700, 800, 900, 1000, 1100, 1200) ROTATION_ANGLES: () DETECTIONS_PER_IMAGE: 100 EVAL_PERIOD: 0 EXPECTED_RESULTS: [] KEYPOINT_OKS_SIGMAS: [] PRECISE_BN: ENABLED: False NUM_ITER: 200 VERSION: 2 VIS_PERIOD: 0
sys.platform linux Python 3.7.7 (default, May 7 2020, 21:25:33) [GCC 7.3.0] numpy 1.19.1 detectron2 0.2.1 @/home/zhouyixuan/detectron2_torch1.5/detectron2 Compiler GCC 7.4 CUDA compiler CUDA 10.1 detectron2 arch flags sm_70 DETECTRON2_ENV_MODULE
PyTorch 1.5.0 @/home/zhouyixuan/anaconda3/envs/detectron2_torch1.5/lib/python3.7/site-packages/torch
PyTorch debug build False
GPU available True
GPU 0,1,2,3,4,5,6,7 Tesla V100-SXM2-32GB
CUDA_HOME /home/zhouyixuan/cuda-10.1
Pillow 7.2.0
torchvision 0.6.0a0+82fd1c8 @/home/zhouyixuan/anaconda3/envs/detectron2_torch1.5/lib/python3.7/site-packages/torchvision
torchvision arch flags sm_35, sm_50, sm_60, sm_70, sm_75
fvcore 0.1.1.post20200716
cv2 4.3.0
PyTorch built with:
When I train the densepose model just with your provided config(batchsize changed 16 to 15) , I get a pretty low precision.
please give me some suggestions, I am really confused