JosephKJ / OWOD

(CVPR 2021 Oral) Open World Object Detection
https://josephkj.in
Apache License 2.0
1.03k stars 155 forks source link

ValueError: All failure and censoring times must be greater than zero. #14

Closed kendyChina closed 3 years ago

kendyChina commented 3 years ago

Hello, dear author When I use train.yaml to train the model and use val.yaml to train the EBUI component based on the verification set, the following error is reported in the Fit_Weibull_3P:

ValueError: All failure and censoring times must be greater than zero.

I printed my unk and known variables and found that they did have values less than zero. As I am not familiar with Helmholtz free energy formulation, I hope to get your help. Thank you!

JosephKJ commented 3 years ago

Never faced this issue, can you post the complete log after running python tools/train_net.py command, along with the command?

kendyChina commented 3 years ago

Yes, of course. This is the deleted log after I run python tools/train_net.py --num-gpus 2 --config-file ./configs/CGSH_train.yaml

[03/18 11:45:32] detectron2 INFO: Rank of current process: 0. World size: 2
[03/18 11:45:32] detectron2 INFO: Environment info:
----------------------  -------------------------------------------------------------------------------------------------------------
sys.platform            linux
Python                  3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19) [GCC 7.2.0]
numpy                   1.19.5
detectron2              0.2.1 @/home/ma-user/anaconda3/lib/python3.6/site-packages/detectron2-0.2.1-py3.6-linux-x86_64.egg/detectron2
Compiler                GCC 5.4
CUDA compiler           CUDA 10.1
detectron2 arch flags   7.0
DETECTRON2_ENV_MODULE   <not set>
PyTorch                 1.8.0 @/home/ma-user/anaconda3/lib/python3.6/site-packages/torch
PyTorch debug build     False
GPU available           True
GPU 0,1                 Tesla V100-PCIE-32GB (arch=7.0)
CUDA_HOME               /usr/local/cuda
Pillow                  8.1.2
torchvision             0.9.0 @/home/ma-user/anaconda3/lib/python3.6/site-packages/torchvision
torchvision arch flags  3.5, 5.0, 6.0, 7.0, 7.5
fvcore                  0.1.3.post20210311
cv2                     3.4.0
----------------------  -------------------------------------------------------------------------------------------------------------
PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 10.2
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70
  - CuDNN 7.6.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

[03/18 11:45:32] detectron2 INFO: Command line arguments: Namespace(config_file='./configs/CGSH_train.yaml', dist_url='tcp://127.0.0.1:50152', eval_only=False, machine_rank=0, num_gpus=2, num_machines=1, opts=[], resume=False)
[03/18 11:45:32] detectron2 INFO: Running with full config:
CUDNN_BENCHMARK: True
DATALOADER:
  ASPECT_RATIO_GROUPING: True
  FILTER_EMPTY_ANNOTATIONS: False
  NUM_WORKERS: 16
  REPEAT_THRESHOLD: 0.0
  SAMPLER_TRAIN: TrainingSampler
DATASETS:
  PRECOMPUTED_PROPOSAL_TOPK_TEST: 1000
  PRECOMPUTED_PROPOSAL_TOPK_TRAIN: 2000
  PROPOSAL_FILES_TEST: ()
  PROPOSAL_FILES_TRAIN: ()
  TEST: ('cgsh_test',)
  TRAIN: ('cgsh_train',)
GLOBAL:
  HACK: 1.0
INPUT:
  CROP:
    ENABLED: False
    SIZE: [0.9, 0.9]
    TYPE: relative_range
  FORMAT: BGR
  MASK_FORMAT: polygon
  MAX_SIZE_TEST: 1333
  MAX_SIZE_TRAIN: 1333
  MIN_SIZE_TEST: 800
  MIN_SIZE_TRAIN: (480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800)
  MIN_SIZE_TRAIN_SAMPLING: choice
  RANDOM_FLIP: horizontal
MODEL:
  ANCHOR_GENERATOR:
    ANGLES: [[-90, 0, 90]]
    ASPECT_RATIOS: [[0.5, 1.0, 2.0]]
    NAME: DefaultAnchorGenerator
    OFFSET: 0.0
    SIZES: [[32, 64, 128, 256, 512]]
  BACKBONE:
    FREEZE_AT: 2
    NAME: build_resnet_backbone
  DEVICE: cuda
  FPN:
    FUSE_TYPE: sum
    IN_FEATURES: []
    NORM: 
    OUT_CHANNELS: 256
  KEYPOINT_ON: False
  LOAD_PROPOSALS: False
  MASK_ON: False
  META_ARCHITECTURE: GeneralizedRCNN
  PANOPTIC_FPN:
    COMBINE:
      ENABLED: True
      INSTANCES_CONFIDENCE_THRESH: 0.5
      OVERLAP_THRESH: 0.5
      STUFF_AREA_LIMIT: 4096
    INSTANCE_LOSS_WEIGHT: 1.0
  PIXEL_MEAN: [0, 0, 0]
  PIXEL_STD: [1, 1, 1]
  PROPOSAL_GENERATOR:
    MIN_SIZE: 0
    NAME: RPN
  RESNETS:
    DEFORM_MODULATED: False
    DEFORM_NUM_GROUPS: 1
    DEFORM_ON_PER_STAGE: [False, False, False, False]
    DEPTH: 50
    NORM: FrozenBN
    NUM_GROUPS: 1
    OUT_FEATURES: ['res4']
    RES2_OUT_CHANNELS: 256
    RES5_DILATION: 1
    STEM_OUT_CHANNELS: 64
    STRIDE_IN_1X1: True
    WIDTH_PER_GROUP: 64
  RETINANET:
    BBOX_REG_LOSS_TYPE: smooth_l1
    BBOX_REG_WEIGHTS: (1.0, 1.0, 1.0, 1.0)
    FOCAL_LOSS_ALPHA: 0.25
    FOCAL_LOSS_GAMMA: 2.0
    IN_FEATURES: ['p3', 'p4', 'p5', 'p6', 'p7']
    IOU_LABELS: [0, -1, 1]
    IOU_THRESHOLDS: [0.4, 0.5]
    NMS_THRESH_TEST: 0.5
    NORM: 
    NUM_CLASSES: 80
    NUM_CONVS: 4
    PRIOR_PROB: 0.01
    SCORE_THRESH_TEST: 0.05
    SMOOTH_L1_LOSS_BETA: 0.1
    TOPK_CANDIDATES_TEST: 1000
  ROI_BOX_CASCADE_HEAD:
    BBOX_REG_WEIGHTS: ((10.0, 10.0, 5.0, 5.0), (20.0, 20.0, 10.0, 10.0), (30.0, 30.0, 15.0, 15.0))
    IOUS: (0.5, 0.6, 0.7)
  ROI_BOX_HEAD:
    BBOX_REG_LOSS_TYPE: smooth_l1
    BBOX_REG_LOSS_WEIGHT: 1.0
    BBOX_REG_WEIGHTS: (10.0, 10.0, 5.0, 5.0)
    CLS_AGNOSTIC_BBOX_REG: False
    CONV_DIM: 256
    FC_DIM: 1024
    NAME: 
    NORM: 
    NUM_CONV: 0
    NUM_FC: 0
    POOLER_RESOLUTION: 14
    POOLER_SAMPLING_RATIO: 0
    POOLER_TYPE: ROIAlignV2
    SMOOTH_L1_BETA: 0.0
    TRAIN_ON_PRED_BOXES: False
  ROI_HEADS:
    BATCH_SIZE_PER_IMAGE: 512
    IN_FEATURES: ['res4']
    IOU_LABELS: [0, 1]
    IOU_THRESHOLDS: [0.5]
    NAME: Res5ROIHeads
    NMS_THRESH_TEST: 0.5
    NUM_CLASSES: 10
    POSITIVE_FRACTION: 0.25
    PROPOSAL_APPEND_GT: True
    SCORE_THRESH_TEST: 0.05
  ROI_KEYPOINT_HEAD:
    CONV_DIMS: (512, 512, 512, 512, 512, 512, 512, 512)
    LOSS_WEIGHT: 1.0
    MIN_KEYPOINTS_PER_IMAGE: 1
    NAME: KRCNNConvDeconvUpsampleHead
    NORMALIZE_LOSS_BY_VISIBLE_KEYPOINTS: True
    NUM_KEYPOINTS: 17
    POOLER_RESOLUTION: 14
    POOLER_SAMPLING_RATIO: 0
    POOLER_TYPE: ROIAlignV2
  ROI_MASK_HEAD:
    CLS_AGNOSTIC_MASK: False
    CONV_DIM: 256
    NAME: MaskRCNNConvUpsampleHead
    NORM: 
    NUM_CONV: 0
    POOLER_RESOLUTION: 14
    POOLER_SAMPLING_RATIO: 0
    POOLER_TYPE: ROIAlignV2
  RPN:
    BATCH_SIZE_PER_IMAGE: 256
    BBOX_REG_LOSS_TYPE: smooth_l1
    BBOX_REG_LOSS_WEIGHT: 1.0
    BBOX_REG_WEIGHTS: (1.0, 1.0, 1.0, 1.0)
    BOUNDARY_THRESH: -1
    HEAD_NAME: StandardRPNHead
    IN_FEATURES: ['res4']
    IOU_LABELS: [0, -1, 1]
    IOU_THRESHOLDS: [0.3, 0.7]
    LOSS_WEIGHT: 1.0
    NMS_THRESH: 0.7
    POSITIVE_FRACTION: 0.5
    POST_NMS_TOPK_TEST: 1000
    POST_NMS_TOPK_TRAIN: 2000
    PRE_NMS_TOPK_TEST: 6000
    PRE_NMS_TOPK_TRAIN: 12000
    SMOOTH_L1_BETA: 0.0
  SEM_SEG_HEAD:
    COMMON_STRIDE: 4
    CONVS_DIM: 128
    IGNORE_VALUE: 255
    IN_FEATURES: ['p2', 'p3', 'p4', 'p5']
    LOSS_WEIGHT: 1.0
    NAME: SemSegFPNHead
    NORM: GN
    NUM_CLASSES: 54
  WEIGHTS: detectron2/model_zoo/R-50.pkl
OUTPUT_DIR: ./output/1_unk
OWOD:
  CLUSTERING:
    ITEMS_PER_CLASS: 20
    MARGIN: 10.0
    MOMENTUM: 0.99
    START_ITER: 1000
    UPDATE_MU_ITER: 3000
    Z_DIMENSION: 128
  COMPUTE_ENERGY: False
  CUR_INTRODUCED_CLS: 1
  ENABLE_CLUSTERING: True
  ENABLE_THRESHOLD_AUTOLABEL_UNK: True
  ENABLE_UNCERTAINITY_AUTOLABEL_UNK: False
  ENERGY_SAVE_PATH: energy
  FEATURE_STORE_SAVE_PATH: feature_store
  NUM_UNK_PER_IMAGE: 1
  PREV_INTRODUCED_CLS: 0
  SKIP_TRAINING_WHILE_EVAL: False
  TEMPERATURE: 1.5
SEED: -1
SOLVER:
  BASE_LR: 0.02
  BIAS_LR_FACTOR: 1.0
  CHECKPOINT_PERIOD: 5000
  CLIP_GRADIENTS:
    CLIP_TYPE: value
    CLIP_VALUE: 1.0
    ENABLED: False
    NORM_TYPE: 2.0
  GAMMA: 0.1
  IMS_PER_BATCH: 16
  LR_SCHEDULER_NAME: WarmupMultiStepLR
  MAX_ITER: 18000
  MOMENTUM: 0.9
  NESTEROV: False
  REFERENCE_WORLD_SIZE: 0
  STEPS: (12000, 16000)
  WARMUP_FACTOR: 0.001
  WARMUP_ITERS: 1000
  WARMUP_METHOD: linear
  WEIGHT_DECAY: 0.0001
  WEIGHT_DECAY_BIAS: 0.0001
  WEIGHT_DECAY_NORM: 0.0
TEST:
  AUG:
    ENABLED: False
    FLIP: True
    MAX_SIZE: 4000
    MIN_SIZES: (400, 500, 600, 700, 800, 900, 1000, 1100, 1200)
  DETECTIONS_PER_IMAGE: 100
  EVAL_PERIOD: 0
  EXPECTED_RESULTS: []
  KEYPOINT_OKS_SIGMAS: []
  PRECISE_BN:
    ENABLED: False
    NUM_ITER: 200
VERSION: 2
VIS_PERIOD: 0
[03/18 11:45:32] detectron2 INFO: Full config saved to ./output/1_unk/config.yaml
[03/18 11:45:32] d2.utils.env INFO: Using a generated random seed 33050046
[03/18 11:45:33] d2.modeling.roi_heads.fast_rcnn INFO: Invalid class range: [1, 2, 3, 4, 5, 6, 7, 8]
[03/18 11:45:33] d2.modeling.roi_heads.fast_rcnn INFO: Feature store not found in ./output/1_unk/feature_store/feat.pt. Creating new feature store.
[03/18 11:45:33] d2.engine.defaults INFO: Model:
GeneralizedRCNN(
  (backbone): ResNet(
    (box_predictor): FastRCNNOutputLayers(
      (cls_score): Linear(in_features=2048, out_features=11, bias=True)
      (bbox_pred): Linear(in_features=2048, out_features=40, bias=True)
      (hingeloss): HingeEmbeddingLoss()
    )
  )
)
[03/18 11:45:34] d2.data.build INFO: Valid classes: range(0, 1)
[03/18 11:45:34] d2.data.build INFO: Removing earlier seen class objects and the unknown objects...
[03/18 11:45:34] d2.data.build INFO: Distribution of instances among all 2 categories:
|  category  | #instances   |  category  | #instances   |
|:----------:|:-------------|:----------:|:-------------|
|   hawker   | 9810         |  unknown   | 0            |
|            |              |            |              |
|   total    | 9810         |            |              |
[03/18 11:45:34] d2.data.build INFO: Number of datapoints: 8704
[03/18 11:45:34] d2.data.common INFO: Serializing 8704 elements to byte tensors and concatenating them all ...
[03/18 11:45:34] d2.data.common INFO: Serialized dataset takes 3.65 MiB
[03/18 11:45:34] d2.data.dataset_mapper INFO: Augmentations used in training: [ResizeShortestEdge(short_edge_length=(480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800), max_size=1333, sample_style='choice'), RandomFlip()]
[03/18 11:45:34] d2.data.build INFO: Using training sampler TrainingSampler
[03/18 11:45:34] fvcore.common.checkpoint INFO: Loading checkpoint from detectron2/model_zoo/R-50.pkl
[03/18 11:45:35] d2.checkpoint.c2_model_loading INFO: Remapping C2 weights ......
[03/18 11:45:35] d2.checkpoint.c2_model_loading INFO: backbone.res2.0.conv1.norm.bias                      loaded from res2_0_branch2a_bn_beta           of shape (64,)
[03/18 11:45:35] d2.checkpoint.c2_model_loading INFO: backbone.res2.0.conv1.norm.running_mean              loaded from res2_0_branch2a_bn_running_mean   of shape (64,)
[03/18 11:45:35] d2.checkpoint.c2_model_loading INFO: backbone.res2.0.conv1.norm.running_var               loaded from res2_0_branch2a_bn_running_var    of shape (64,)
[03/18 11:45:35] d2.checkpoint.c2_model_loading INFO: backbone.res2.0.conv1.norm.weight                    loaded from res2_0_branch2a_bn_gamma          of shape (64,)
[03/18 11:45:35] d2.checkpoint.c2_model_loading INFO: roi_heads.res5.2.conv3.norm.running_mean             loaded from res5_2_branch2c_bn_running_mean   of shape (2048,)
[03/18 11:45:35] d2.checkpoint.c2_model_loading INFO: roi_heads.res5.2.conv3.norm.running_var              loaded from res5_2_branch2c_bn_running_var    of shape (2048,)
[03/18 11:45:35] d2.checkpoint.c2_model_loading INFO: roi_heads.res5.2.conv3.norm.weight                   loaded from res5_2_branch2c_bn_gamma          of shape (2048,)
[03/18 11:45:35] d2.checkpoint.c2_model_loading INFO: roi_heads.res5.2.conv3.weight                        loaded from res5_2_branch2c_w                 of shape (2048, 512, 1, 1)
[03/18 11:45:35] d2.checkpoint.c2_model_loading INFO: Some model parameters or buffers are not found in the checkpoint:
pixel_mean
pixel_std
proposal_generator.anchor_generator.cell_anchors.0
proposal_generator.rpn_head.anchor_deltas.{bias, weight}
proposal_generator.rpn_head.conv.{bias, weight}
proposal_generator.rpn_head.objectness_logits.{bias, weight}
roi_heads.box_predictor.bbox_pred.{bias, weight}
roi_heads.box_predictor.cls_score.{bias, weight}
[03/18 11:45:35] d2.checkpoint.c2_model_loading INFO: The checkpoint state_dict contains keys that are not used by the model:
  fc1000_b
  fc1000_w
  conv1_b
[03/18 11:45:35] d2.engine.train_loop INFO: Starting training from iteration 0
[03/18 11:47:10] d2.utils.events INFO:  eta: 11:23:51  iter: 19  total_loss: 1.29  loss_cls: 0.2751  loss_box_reg: 0.2749  loss_clustering: 0  loss_rpn_cls: 0.6791  loss_rpn_loc: 0.04721  time: 3.1832  data_time: 1.2863  lr: 0.00039962  max_mem: 19392M
[03/18 23:24:48] d2.utils.events INFO:  eta: 0:00:46  iter: 17979  total_loss: 0.1125  loss_cls: 0.02595  loss_box_reg: 0.05373  loss_clustering: 0.01532  loss_rpn_cls: 0.004229  loss_rpn_loc: 0.01234  time: 2.3307  data_time: 0.0148  lr: 0.0002  max_mem: 25820M
[03/18 23:25:34] d2.modeling.roi_heads.fast_rcnn INFO: Saving image store at iteration 17999 to ./output/1_unk/feature_store/feat.pt
[03/18 23:25:35] fvcore.common.checkpoint INFO: Saving checkpoint to ./output/1_unk/model_final.pth
[03/18 23:25:36] d2.utils.events INFO:  eta: 0:00:00  iter: 17999  total_loss: 0.1131  loss_cls: 0.02487  loss_box_reg: 0.05478  loss_clustering: 0.01534  loss_rpn_cls: 0.006022  loss_rpn_loc: 0.01257  time: 2.3307  data_time: 0.0153  lr: 0.0002  max_mem: 25820M
[03/18 23:25:36] d2.engine.hooks INFO: Overall training speed: 17998 iterations in 11:39:08 (2.3307 s / it)
[03/18 23:25:36] d2.engine.hooks INFO: Total training time: 11:39:23 (0:00:15 on hooks)
[03/18 23:25:36] d2.data.build INFO: Known classes: range(0, 1)
[03/18 23:25:36] d2.data.build INFO: Labelling known instances the corresponding label, and unknown instances as unknown...
[03/18 23:25:36] d2.data.build INFO: Distribution of instances among all 2 categories:
|  category  | #instances   |  category  | #instances   |
|:----------:|:-------------|:----------:|:-------------|
|   hawker   | 1624         |  unknown   | 0            |
|            |              |            |              |
|   total    | 1624         |            |              |
[03/18 23:25:36] d2.data.build INFO: Number of datapoints: 1440
[03/18 23:25:36] d2.data.common INFO: Serializing 1440 elements to byte tensors and concatenating them all ...
[03/18 23:25:36] d2.data.common INFO: Serialized dataset takes 0.61 MiB
[03/18 23:25:36] d2.data.dataset_mapper INFO: Augmentations used in training: [ResizeShortestEdge(short_edge_length=(800, 800), max_size=1333, sample_style='choice')]
[03/18 23:25:36] d2.evaluation.pascal_voc_evaluation INFO: Energy distribution is not found at ./output/1_unk/energy_dist_1.pkl
[03/18 23:25:36] d2.evaluation.evaluator INFO: Start inference on 720 images

My CGSH_train.yaml is:

CUDNN_BENCHMARK: True
OUTPUT_DIR: "./output/1_unk"
MODEL:
  PIXEL_MEAN: [0, 0, 0]
  PIXEL_STD: [1, 1, 1]
  META_ARCHITECTURE: "GeneralizedRCNN"
  WEIGHTS: "detectron2/model_zoo/R-50.pkl"
  RPN:
    PRE_NMS_TOPK_TEST: 6000
    POST_NMS_TOPK_TEST: 1000
  ROI_HEADS:
    NUM_CLASSES: 10
    NAME: "Res5ROIHeads"
  MASK_ON: False
  RESNETS:
    DEPTH: 50
INPUT:
  MIN_SIZE_TRAIN: (480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800)
  MIN_SIZE_TEST: 800
DATASETS:
  TRAIN: ("cgsh_train",)
  TEST: ("cgsh_test",)
DATALOADER:
  FILTER_EMPTY_ANNOTATIONS: False
  NUM_WORKERS: 16
SOLVER:
  # gpus 2
  IMS_PER_BATCH: 16
  # 0.02 * bs / 16
  BASE_LR: 0.02
  STEPS: (12000, 16000)
  GAMMA: 0.1
  MAX_ITER: 18000
  LR_SCHEDULER_NAME: WarmupMultiStepLR
  WARMUP_FACTOR: 0.001
  WARMUP_ITERS: 1000
  WARMUP_METHOD: linear
TEST:
  EXPECTED_RESULTS: []
VERSION: 2
OWOD:
  ENABLE_THRESHOLD_AUTOLABEL_UNK: True
  NUM_UNK_PER_IMAGE: 1
  ENABLE_UNCERTAINITY_AUTOLABEL_UNK: False
  ENABLE_CLUSTERING: True
  FEATURE_STORE_SAVE_PATH: 'feature_store'
  SKIP_TRAINING_WHILE_EVAL: False

  PREV_INTRODUCED_CLS: 0
  CUR_INTRODUCED_CLS: 1
  COMPUTE_ENERGY: False
  ENERGY_SAVE_PATH: 'energy'
  SKIP_TRAINING_WHILE_EVAL: False
  TEMPERATURE: 1.5

  CLUSTERING:
    ITEMS_PER_CLASS: 20
    START_ITER: 1000
    UPDATE_MU_ITER: 3000
    MOMENTUM: 0.99
    Z_DIMENSION: 128

And this is the simple log after I run python tools/train_net.py --num-gpus 2 --config-file ./configs/CGSH_val.yaml

[03/19 08:59:19] detectron2 INFO: Rank of current process: 0. World size: 2
[03/19 08:59:20] detectron2 INFO: Environment info:
----------------------  -------------------------------------------------------------------------------------------------------------
sys.platform            linux
Python                  3.6.4 |Anaconda, Inc.| (default, Jan 16 2018, 18:10:19) [GCC 7.2.0]
numpy                   1.19.5
detectron2              0.2.1 @/home/ma-user/anaconda3/lib/python3.6/site-packages/detectron2-0.2.1-py3.6-linux-x86_64.egg/detectron2
Compiler                GCC 5.4
CUDA compiler           CUDA 10.1
detectron2 arch flags   7.0
DETECTRON2_ENV_MODULE   <not set>
PyTorch                 1.8.0 @/home/ma-user/anaconda3/lib/python3.6/site-packages/torch
PyTorch debug build     False
GPU available           True
GPU 0,1                 Tesla V100-PCIE-32GB (arch=7.0)
CUDA_HOME               /usr/local/cuda
Pillow                  8.1.2
torchvision             0.9.0 @/home/ma-user/anaconda3/lib/python3.6/site-packages/torchvision
torchvision arch flags  3.5, 5.0, 6.0, 7.0, 7.5
fvcore                  0.1.3.post20210311
cv2                     3.4.0
----------------------  -------------------------------------------------------------------------------------------------------------
PyTorch built with:
  - GCC 7.3
  - C++ Version: 201402
  - Intel(R) Math Kernel Library Version 2020.0.0 Product Build 20191122 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v1.7.0 (Git Hash 7aed236906b1f7a05c0917e5257a1af05e9ff683)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CPU capability usage: AVX2
  - CUDA Runtime 10.2
  - NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70
  - CuDNN 7.6.5
  - Magma 2.5.2
  - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=10.2, CUDNN_VERSION=7.6.5, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.8.0, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, 

[03/19 08:59:20] detectron2 INFO: Command line arguments: Namespace(config_file='./configs/CGSH_val.yaml', dist_url='tcp://127.0.0.1:50152', eval_only=False, machine_rank=0, num_gpus=2, num_machines=1, opts=[], resume=False)
[03/19 08:59:20] detectron2 INFO: Running with full config:
CUDNN_BENCHMARK: True
DATALOADER:
  ASPECT_RATIO_GROUPING: True
  FILTER_EMPTY_ANNOTATIONS: False
  NUM_WORKERS: 16
  REPEAT_THRESHOLD: 0.0
  SAMPLER_TRAIN: TrainingSampler
DATASETS:
  PRECOMPUTED_PROPOSAL_TOPK_TEST: 1000
  PRECOMPUTED_PROPOSAL_TOPK_TRAIN: 2000
  PROPOSAL_FILES_TEST: ()
  PROPOSAL_FILES_TRAIN: ()
  TEST: ('cgsh_val',)
  TRAIN: ('cgsh_val',)
GLOBAL:
  HACK: 1.0
INPUT:
  CROP:
    ENABLED: False
    SIZE: [0.9, 0.9]
    TYPE: relative_range
  FORMAT: BGR
  MASK_FORMAT: polygon
  MAX_SIZE_TEST: 1333
  MAX_SIZE_TRAIN: 1333
  MIN_SIZE_TEST: 800
  MIN_SIZE_TRAIN: (480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800)
  MIN_SIZE_TRAIN_SAMPLING: choice
  RANDOM_FLIP: horizontal
MODEL:
  ANCHOR_GENERATOR:
    ANGLES: [[-90, 0, 90]]
    ASPECT_RATIOS: [[0.5, 1.0, 2.0]]
    NAME: DefaultAnchorGenerator
    OFFSET: 0.0
    SIZES: [[32, 64, 128, 256, 512]]
  BACKBONE:
    FREEZE_AT: 2
    NAME: build_resnet_backbone
  DEVICE: cuda
  FPN:
    FUSE_TYPE: sum
    IN_FEATURES: []
    NORM: 
    OUT_CHANNELS: 256
  KEYPOINT_ON: False
  LOAD_PROPOSALS: False
  MASK_ON: False
  META_ARCHITECTURE: GeneralizedRCNN
  PANOPTIC_FPN:
    COMBINE:
      ENABLED: True
      INSTANCES_CONFIDENCE_THRESH: 0.5
      OVERLAP_THRESH: 0.5
      STUFF_AREA_LIMIT: 4096
    INSTANCE_LOSS_WEIGHT: 1.0
  PIXEL_MEAN: [0, 0, 0]
  PIXEL_STD: [1, 1, 1]
  PROPOSAL_GENERATOR:
    MIN_SIZE: 0
    NAME: RPN
  RESNETS:
    DEFORM_MODULATED: False
    DEFORM_NUM_GROUPS: 1
    DEFORM_ON_PER_STAGE: [False, False, False, False]
    DEPTH: 50
    NORM: FrozenBN
    NUM_GROUPS: 1
    OUT_FEATURES: ['res4']
    RES2_OUT_CHANNELS: 256
    RES5_DILATION: 1
    STEM_OUT_CHANNELS: 64
    STRIDE_IN_1X1: True
    WIDTH_PER_GROUP: 64
  RETINANET:
    BBOX_REG_LOSS_TYPE: smooth_l1
    BBOX_REG_WEIGHTS: (1.0, 1.0, 1.0, 1.0)
    FOCAL_LOSS_ALPHA: 0.25
    FOCAL_LOSS_GAMMA: 2.0
    IN_FEATURES: ['p3', 'p4', 'p5', 'p6', 'p7']
    IOU_LABELS: [0, -1, 1]
    IOU_THRESHOLDS: [0.4, 0.5]
    NMS_THRESH_TEST: 0.5
    NORM: 
    NUM_CLASSES: 80
    NUM_CONVS: 4
    PRIOR_PROB: 0.01
    SCORE_THRESH_TEST: 0.05
    SMOOTH_L1_LOSS_BETA: 0.1
    TOPK_CANDIDATES_TEST: 1000
  ROI_BOX_CASCADE_HEAD:
    BBOX_REG_WEIGHTS: ((10.0, 10.0, 5.0, 5.0), (20.0, 20.0, 10.0, 10.0), (30.0, 30.0, 15.0, 15.0))
    IOUS: (0.5, 0.6, 0.7)
  ROI_BOX_HEAD:
    BBOX_REG_LOSS_TYPE: smooth_l1
    BBOX_REG_LOSS_WEIGHT: 1.0
    BBOX_REG_WEIGHTS: (10.0, 10.0, 5.0, 5.0)
    CLS_AGNOSTIC_BBOX_REG: False
    CONV_DIM: 256
    FC_DIM: 1024
    NAME: 
    NORM: 
    NUM_CONV: 0
    NUM_FC: 0
    POOLER_RESOLUTION: 14
    POOLER_SAMPLING_RATIO: 0
    POOLER_TYPE: ROIAlignV2
    SMOOTH_L1_BETA: 0.0
    TRAIN_ON_PRED_BOXES: False
  ROI_HEADS:
    BATCH_SIZE_PER_IMAGE: 512
    IN_FEATURES: ['res4']
    IOU_LABELS: [0, 1]
    IOU_THRESHOLDS: [0.5]
    NAME: Res5ROIHeads
    NMS_THRESH_TEST: 0.5
    NUM_CLASSES: 10
    POSITIVE_FRACTION: 0.25
    PROPOSAL_APPEND_GT: True
    SCORE_THRESH_TEST: 0.05
  ROI_KEYPOINT_HEAD:
    CONV_DIMS: (512, 512, 512, 512, 512, 512, 512, 512)
    LOSS_WEIGHT: 1.0
    MIN_KEYPOINTS_PER_IMAGE: 1
    NAME: KRCNNConvDeconvUpsampleHead
    NORMALIZE_LOSS_BY_VISIBLE_KEYPOINTS: True
    NUM_KEYPOINTS: 17
    POOLER_RESOLUTION: 14
    POOLER_SAMPLING_RATIO: 0
    POOLER_TYPE: ROIAlignV2
  ROI_MASK_HEAD:
    CLS_AGNOSTIC_MASK: False
    CONV_DIM: 256
    NAME: MaskRCNNConvUpsampleHead
    NORM: 
    NUM_CONV: 0
    POOLER_RESOLUTION: 14
    POOLER_SAMPLING_RATIO: 0
    POOLER_TYPE: ROIAlignV2
  RPN:
    BATCH_SIZE_PER_IMAGE: 256
    BBOX_REG_LOSS_TYPE: smooth_l1
    BBOX_REG_LOSS_WEIGHT: 1.0
    BBOX_REG_WEIGHTS: (1.0, 1.0, 1.0, 1.0)
    BOUNDARY_THRESH: -1
    HEAD_NAME: StandardRPNHead
    IN_FEATURES: ['res4']
    IOU_LABELS: [0, -1, 1]
    IOU_THRESHOLDS: [0.3, 0.7]
    LOSS_WEIGHT: 1.0
    NMS_THRESH: 0.7
    POSITIVE_FRACTION: 0.5
    POST_NMS_TOPK_TEST: 1000
    POST_NMS_TOPK_TRAIN: 2000
    PRE_NMS_TOPK_TEST: 6000
    PRE_NMS_TOPK_TRAIN: 12000
    SMOOTH_L1_BETA: 0.0
  SEM_SEG_HEAD:
    COMMON_STRIDE: 4
    CONVS_DIM: 128
    IGNORE_VALUE: 255
    IN_FEATURES: ['p2', 'p3', 'p4', 'p5']
    LOSS_WEIGHT: 1.0
    NAME: SemSegFPNHead
    NORM: GN
    NUM_CLASSES: 54
  WEIGHTS: ./output/1_unk/model_final.pth
OUTPUT_DIR: ./output/1_unk_val
OWOD:
  CLUSTERING:
    ITEMS_PER_CLASS: 20
    MARGIN: 10.0
    MOMENTUM: 0.99
    START_ITER: 1000
    UPDATE_MU_ITER: 3000
    Z_DIMENSION: 128
  COMPUTE_ENERGY: True
  CUR_INTRODUCED_CLS: 1
  ENABLE_CLUSTERING: False
  ENABLE_THRESHOLD_AUTOLABEL_UNK: True
  ENABLE_UNCERTAINITY_AUTOLABEL_UNK: False
  ENERGY_SAVE_PATH: energy
  FEATURE_STORE_SAVE_PATH: feature_store
  NUM_UNK_PER_IMAGE: 1
  PREV_INTRODUCED_CLS: 0
  SKIP_TRAINING_WHILE_EVAL: False
  TEMPERATURE: 1.5
SEED: -1
SOLVER:
  BASE_LR: 0.02
  BIAS_LR_FACTOR: 1.0
  CHECKPOINT_PERIOD: 5000
  CLIP_GRADIENTS:
    CLIP_TYPE: value
    CLIP_VALUE: 1.0
    ENABLED: False
    NORM_TYPE: 2.0
  GAMMA: 0.1
  IMS_PER_BATCH: 16
  LR_SCHEDULER_NAME: WarmupMultiStepLR
  MAX_ITER: 700
  MOMENTUM: 0.9
  NESTEROV: False
  REFERENCE_WORLD_SIZE: 0
  STEPS: (12000, 16000)
  WARMUP_FACTOR: 0.001
  WARMUP_ITERS: 0
  WARMUP_METHOD: linear
  WEIGHT_DECAY: 0.0001
  WEIGHT_DECAY_BIAS: 0.0001
  WEIGHT_DECAY_NORM: 0.0
TEST:
  AUG:
    ENABLED: False
    FLIP: True
    MAX_SIZE: 4000
    MIN_SIZES: (400, 500, 600, 700, 800, 900, 1000, 1100, 1200)
  DETECTIONS_PER_IMAGE: 100
  EVAL_PERIOD: 0
  EXPECTED_RESULTS: []
  KEYPOINT_OKS_SIGMAS: []
  PRECISE_BN:
    ENABLED: False
    NUM_ITER: 200
VERSION: 2
VIS_PERIOD: 0
[03/19 08:59:20] detectron2 INFO: Full config saved to ./output/1_unk_val/config.yaml
[03/19 08:59:20] d2.utils.env INFO: Using a generated random seed 20213114
[03/19 08:59:20] d2.modeling.roi_heads.fast_rcnn INFO: Invalid class range: [1, 2, 3, 4, 5, 6, 7, 8]
[03/19 08:59:20] d2.modeling.roi_heads.fast_rcnn INFO: Feature store not found in ./output/1_unk_val/feature_store/feat.pt. Creating new feature store.
[03/19 08:59:20] d2.engine.defaults INFO: Model:
GeneralizedRCNN(
  (backbone): ResNet(
    (box_predictor): FastRCNNOutputLayers(
      (cls_score): Linear(in_features=2048, out_features=11, bias=True)
      (bbox_pred): Linear(in_features=2048, out_features=40, bias=True)
      (hingeloss): HingeEmbeddingLoss()
    )
  )
)
[03/19 08:59:20] d2.data.build INFO: Known classes: range(0, 1)
[03/19 08:59:20] d2.data.build INFO: Labelling known instances the corresponding label, and unknown instances as unknown...
[03/19 08:59:20] d2.data.build INFO: Distribution of instances among all 2 categories:
|  category  | #instances   |  category  | #instances   |
|:----------:|:-------------|:----------:|:-------------|
|   hawker   | 1324         |  unknown   | 0            |
|            |              |            |              |
|   total    | 1324         |            |              |
[03/19 08:59:20] d2.data.build INFO: Number of datapoints: 422
[03/19 08:59:20] d2.data.common INFO: Serializing 422 elements to byte tensors and concatenating them all ...
[03/19 08:59:20] d2.data.common INFO: Serialized dataset takes 0.20 MiB
[03/19 08:59:20] d2.data.dataset_mapper INFO: Augmentations used in training: [ResizeShortestEdge(short_edge_length=(480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800), max_size=1333, sample_style='choice'), RandomFlip()]
[03/19 08:59:20] d2.data.build INFO: Using training sampler TrainingSampler
[03/19 08:59:20] fvcore.common.checkpoint INFO: Loading checkpoint from ./output/1_unk/model_final.pth
[03/19 08:59:21] d2.engine.train_loop INFO: Starting training from iteration 0
[03/19 09:00:57] d2.utils.events INFO:  eta: 0:24:44  iter: 19  total_loss: 0.7832  loss_cls: 0.2363  loss_box_reg: 0.3492  loss_clustering: 0  loss_rpn_cls: 0.1057  loss_rpn_loc: 0.05144  time: 3.1343  data_time: 1.2971  lr: 0.02  max_mem: 19360M
[03/19 09:25:13] d2.utils.events INFO:  eta: 0:00:43  iter: 679  total_loss: 0.1585  loss_cls: 0.0311  loss_box_reg: 0.0772  loss_clustering: 0  loss_rpn_cls: 0.01072  loss_rpn_loc: 0.03215  time: 2.2289  data_time: 0.0163  lr: 0.02  max_mem: 19360M
[03/19 09:25:56] fvcore.common.checkpoint INFO: Saving checkpoint to ./output/1_unk_val/model_final.pth
[03/19 09:25:57] d2.utils.events INFO:  eta: 0:00:00  iter: 699  total_loss: 0.165  loss_cls: 0.03248  loss_box_reg: 0.08219  loss_clustering: 0  loss_rpn_cls: 0.01096  loss_rpn_loc: 0.03482  time: 2.2276  data_time: 0.0149  lr: 0.02  max_mem: 19360M
[03/19 09:25:57] d2.engine.train_loop INFO: Going to analyse the energy files...
[03/19 09:25:57] d2.engine.train_loop INFO: Temperature value: 1.5
[03/19 09:25:57] d2.engine.train_loop INFO: Analysing 0 / 1400
[03/19 09:26:48] d2.engine.train_loop INFO: Analysing 100 / 1400
[03/19 09:27:35] d2.engine.train_loop INFO: Analysing 200 / 1400
[03/19 09:28:17] d2.engine.train_loop INFO: Analysing 300 / 1400
[03/19 09:29:10] d2.engine.train_loop INFO: Analysing 400 / 1400
[03/19 09:29:57] d2.engine.train_loop INFO: Analysing 500 / 1400
[03/19 09:30:49] d2.engine.train_loop INFO: Analysing 600 / 1400
[03/19 09:31:36] d2.engine.train_loop INFO: Analysing 700 / 1400
[03/19 09:32:26] d2.engine.train_loop INFO: Analysing 800 / 1400
[03/19 09:33:15] d2.engine.train_loop INFO: Analysing 900 / 1400
[03/19 09:34:04] d2.engine.train_loop INFO: Analysing 1000 / 1400
[03/19 09:34:58] d2.engine.train_loop INFO: Analysing 1100 / 1400
[03/19 09:35:46] d2.engine.train_loop INFO: Analysing 1200 / 1400
[03/19 09:36:30] d2.engine.train_loop INFO: Analysing 1300 / 1400

My CGSH_val.yaml is:

CUDNN_BENCHMARK: True
OUTPUT_DIR: "./output/1_unk_val"
MODEL:
  PIXEL_MEAN: [0, 0, 0]
  PIXEL_STD: [1, 1, 1]
  META_ARCHITECTURE: "GeneralizedRCNN"
  WEIGHTS: "./output/1_unk/model_final.pth"
  RPN:
    PRE_NMS_TOPK_TEST: 6000
    POST_NMS_TOPK_TEST: 1000
  ROI_HEADS:
    #    NUM_CLASSES: 2 # 0~30 Known class; 31 -> Unknown; 32 -> Background.
    NUM_CLASSES: 10
    NAME: "Res5ROIHeads"
  MASK_ON: False
  RESNETS:
    DEPTH: 50
INPUT:
  MIN_SIZE_TRAIN: (480, 512, 544, 576, 608, 640, 672, 704, 736, 768, 800)
  MIN_SIZE_TEST: 800
DATASETS:
  TRAIN: ("cgsh_val",)
  TEST: ("cgsh_val",)
DATALOADER:
  FILTER_EMPTY_ANNOTATIONS: False
  NUM_WORKERS: 16
SOLVER:
  # gpus 2
  IMS_PER_BATCH: 16
  # 0.02 * bs / 16
  BASE_LR: 0.02
  STEPS: (12000, 16000)
  GAMMA: 0.1
  MAX_ITER: 700
  LR_SCHEDULER_NAME: WarmupMultiStepLR
  WARMUP_FACTOR: 0.001
  WARMUP_ITERS: 0
  WARMUP_METHOD: linear
TEST:
  EXPECTED_RESULTS: []
VERSION: 2
OWOD:
  ENABLE_THRESHOLD_AUTOLABEL_UNK: True
  NUM_UNK_PER_IMAGE: 1
  ENABLE_UNCERTAINITY_AUTOLABEL_UNK: False
  ENABLE_CLUSTERING: False
  FEATURE_STORE_SAVE_PATH: 'feature_store'
  SKIP_TRAINING_WHILE_EVAL: False

  PREV_INTRODUCED_CLS: 0
  CUR_INTRODUCED_CLS: 1
  COMPUTE_ENERGY: True
  ENERGY_SAVE_PATH: 'energy'
  SKIP_TRAINING_WHILE_EVAL: False
  TEMPERATURE: 1.5

  CLUSTERING:
    ITEMS_PER_CLASS: 20
    START_ITER: 1000
    UPDATE_MU_ITER: 3000
    MOMENTUM: 0.99
    Z_DIMENSION: 128
JosephKJ commented 3 years ago

@kendyChina : Were you able to resolve the issue? I think it is related to #16

zxiaoran commented 2 years ago

hi,Did you work it out?

liangbingzhao commented 2 years ago

Hi, can you share how to solve it? I came into exactly the same issue.

luckychay commented 2 years ago

@kendyChina Hello, did you resolve it?