BlendMask RT trained AP got 0 in BN

lucasjinreal commented 4 years ago

Here is the train command and eval command I am use:

 python3 tools/train_net.py \
    --config-file configs/BlendMask/RT_R_50_4x_bn.yaml \
    --num-gpus 3 --eval-only
  MODEL.WEIGHTS output/blendmask/RT_R_50_4x/model_0294999.pth

python3 tools/train_net.py \
    --config-file configs/BlendMask/RT_R_50_4x_bn.yaml \
    --num-gpus 3

I am using 3 GPUs to train, and I have changed lr along with batch size:

cat configs/BlendMask/Base-BlendMask.yaml 
MODEL:
  META_ARCHITECTURE: "BlendMask"
  MASK_ON: True
  BACKBONE:
    NAME: "build_fcos_resnet_fpn_backbone"
  RESNETS:
    OUT_FEATURES: ["res3", "res4", "res5"]
  FPN:
    IN_FEATURES: ["res3", "res4", "res5"]
  PROPOSAL_GENERATOR:
    NAME: "FCOS"
  BASIS_MODULE:
    LOSS_ON: True
  PANOPTIC_FPN:
    COMBINE:
      ENABLED: False
  FCOS:
    THRESH_WITH_CTR: True
    USE_SCALE: False
DATASETS:
  TRAIN: ("coco_2017_train",)
  TEST: ("coco_2017_val",)
SOLVER:
  IMS_PER_BATCH: 6
  BASE_LR: 0.005  # Note that RetinaNet uses a different default learning rate
  STEPS: (60000, 80000)
  MAX_ITER: 90000
INPUT:
  MIN_SIZE_TRAIN: (640, 672, 704, 736, 768, 800)

the model I changed:

_BASE_: "Base-550.yaml"
INPUT:
  MIN_SIZE_TRAIN: (256, 288, 320, 352, 384, 416, 448, 480, 512, 544, 576, 608)
  MAX_SIZE_TRAIN: 900
  MAX_SIZE_TEST: 736
  MIN_SIZE_TEST: 512
MODEL:
  WEIGHTS: "detectron2://ImageNetPretrained/MSRA/R-50.pkl"
  RESNETS:
    DEPTH: 50
    NORM: "BN"
  BACKBONE:
    FREEZE_AT: -1
SOLVER:
  STEPS: (300000, 340000)
  MAX_ITER: 360000
OUTPUT_DIR: "output/blendmask/RT_R_50_4x"

Is BlendMask RT really works?

stan-haochen commented 4 years ago

First of all, I do not think you understand what BN head means in this discussion: https://github.com/aim-uofa/AdelaiDet/issues/43.

The part needs to modification is the HEAD instead of backbone.

For the training problem you have met, I suggest you first try training a BN head FCOS. It will help you figure out what's wrong.

AP will not be zero unless you haven't correctly loaded the checkpoint. But after all, you need either larger batch size or longer training schedule to get the performance we reported.

p.s. BN with batch size 2 per gpu can be unstable. I am not sure because we never tried that or claimed it will work. I suggest using syncBN, which is the same for export and migration.

lucasjinreal commented 4 years ago

Then how to alter head bn? convert model to onnx need eliminate all gn include backbone.. batch size 2 per gpu isn't same with batch size 16 for 8 gpus?

I think AP 0 here is wrong combination with these configuations, would u suggest a working version of blendmask RT model config? (with 3 gpus perhaps?)

stan-haochen commented 4 years ago

Use the config I linked in the last comment, BN head FCOS. GN is not in the backbone but in FCOS head.

Use syncbn instead of BN.

lucasjinreal commented 4 years ago

_BASE_: "Base-BlendMask.yaml"
MODEL:
  FCOS:
    TOP_LEVELS: 1
    IN_FEATURES: ["p3", "p4", "p5", "p6"]
    FPN_STRIDES: [8, 16, 32, 64]
    SIZES_OF_INTEREST: [64, 128, 256]
    NUM_SHARE_CONVS: 3
    NUM_CLS_CONVS: 0
    NUM_BOX_CONVS: 0
    -> NORM: "SyncBN"
  BASIS_MODULE:
    NUM_CONVS: 2
INPUT:
  MIN_SIZE_TRAIN: (440, 462, 484, 506, 528, 550)
  MAX_SIZE_TRAIN: 916
  MIN_SIZE_TEST: 550
  MAX_SIZE_TEST: 916

I am sorry, do u mean this one?

stan-haochen commented 4 years ago

Yes, for information related to onnx export, please refer to this page: https://github.com/aim-uofa/AdelaiDet/tree/master/onnx

engrjav commented 2 years ago

@jinfagang did you happen to solve this?

aim-uofa / AdelaiDet

BlendMask RT trained AP got 0 in BN #65