facebookresearch / detectron2

Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.
Apache License 2.0
30.1k stars 7.42k forks source link

Normal at the beginning but raise the KeyError during training at iter 29979 #1153

Closed holyYodu closed 4 years ago

holyYodu commented 4 years ago

If you do not know the root cause of the problem / bug, and wish someone to help you, please post according to this template:

Instructions To Reproduce the Issue:

  1. what changes you made (git diff) or what code you wrote
    # regist custom dataset to detectron
    train_json = '/train/coco2017/annotations/instances_train2017_obj.json'
    val_json = '/train/coco2017/annotations/instances_val2017_obj.json'
    train_img = '/train/coco2017/train2017'
    val_img = '/train/coco2017/val2017'

register_coco_instances("coco_obj_train", {}, train_json, train_img) register_coco_instances("coco_obj_val", {}, val_json, val_img)

def set_config(): cfg = get_cfg() cfg.merge_from_file(model_zoo.get_config_file("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml")) cfg.DATASETS.TRAIN = ("coco_obj_train",) cfg.DATASETS.TEST = ("coco_obj_val")

Let training initialize from model zoo

cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml")
cfg.SOLVER.BASE_LR = 0.00025  # pick a good LR
cfg.SOLVER.MAX_ITER = 30000   # 300 iterations seems good enough for this toy dataset; you may need to train longer for a practical dataset
# cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 128   # faster, and good enough for this toy dataset (default: 512)
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 1  # only has one class
os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
return cfg

start train

def train(cfg): trainer = DefaultTrainer(cfg) trainer.resume_or_load(resume=True) trainer.train() evaluator = COCOEvaluator("coco_obj_val", cfg, False, output_dir="./output/") val_loader = build_detection_test_loader(cfg, "coco_obj_val") inference_on_dataset(trainer.model, val_loader, evaluator)

2. what exact command you run:
I just run the above code to train the modified coco2017 dataset, and if I set `cfg.DATASETS.TEST = ()  ` just like the colab tutorial, then the code will run normally. And if I set `cfg.DATASETS.TEST = ("coco_obj_val")`, here rises error  when the code finishes training phase and prepare to evaluate.
3. what you observed (including __full logs__):

[04/05 06:15:40 d2.utils.events]: eta: 0:00:20 iter: 29979 total_loss: 0.667 loss_cls: 0.129 loss_box_reg: 0.203 loss_mask: 0.264 loss_rpn_cls: 0.016 loss_rpn_loc: 0.052 time: 0.9709 data_time: 0.0188 lr: 0.000250 max_mem: 11079M ERROR [04/05 06:16:01 d2.engine.train_loop]: Exception during training: Traceback (most recent call last): File "/home/mhy/anaconda3/envs/detectron/lib/python3.8/site-packages/detectron2/data/catalog.py", line 55, in get f = DatasetCatalog._REGISTERED[name] KeyError: 'c'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/mhy/anaconda3/envs/detectron/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 133, in train self.after_step() File "/home/mhy/anaconda3/envs/detectron/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 153, in after_step h.after_step() File "/home/mhy/anaconda3/envs/detectron/lib/python3.8/site-packages/detectron2/engine/hooks.py", line 347, in after_step self._do_eval() File "/home/mhy/anaconda3/envs/detectron/lib/python3.8/site-packages/detectron2/engine/hooks.py", line 321, in _do_eval results = self._func() File "/home/mhy/anaconda3/envs/detectron/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 331, in test_and_save_results self._last_eval_results = self.test(self.cfg, self.model) File "/home/mhy/anaconda3/envs/detectron/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 479, in test data_loader = cls.build_test_loader(cfg, dataset_name) File "/home/mhy/anaconda3/envs/detectron/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 441, in build_test_loader return build_detection_test_loader(cfg, dataset_name) File "/home/mhy/anaconda3/envs/detectron/lib/python3.8/site-packages/detectron2/data/build.py", line 359, in build_detection_test_loader dataset_dicts = get_detection_dataset_dicts( File "/home/mhy/anaconda3/envs/detectron/lib/python3.8/site-packages/detectron2/data/build.py", line 223, in get_detection_dataset_dicts dataset_dicts = [DatasetCatalog.get(dataset_name) for dataset_name in dataset_names] File "/home/mhy/anaconda3/envs/detectron/lib/python3.8/site-packages/detectron2/data/build.py", line 223, in dataset_dicts = [DatasetCatalog.get(dataset_name) for dataset_name in dataset_names] File "/home/mhy/anaconda3/envs/detectron/lib/python3.8/site-packages/detectron2/data/catalog.py", line 57, in get raise KeyError( KeyError: "Dataset 'c' is not registered! Available datasets are: coco_2014_train, coco_2014_val, coco_2014_minival, coco_2014_minival_100, coco_2014_valminusminival, coco_2017_train, coco_2017_val, coco_2017_test, coco_2017_test-dev, coco_2017_val_100, keypoints_coco_2014_train, keypoints_coco_2014_val, keypoints_coco_2014_minival, keypoints_coco_2014_valminusminival, keypoints_coco_2014_minival_100, keypoints_coco_2017_train, keypoints_coco_2017_val, keypoints_coco_2017_val_100, coco_2017_train_panoptic_separated, coco_2017_train_panoptic_stuffonly, coco_2017_val_panoptic_separated, coco_2017_val_panoptic_stuffonly, coco_2017_val_100_panoptic_separated, coco_2017_val_100_panoptic_stuffonly, lvis_v0.5_train, lvis_v0.5_val, lvis_v0.5_val_rand_100, lvis_v0.5_test, cityscapes_fine_instance_seg_train, cityscapes_fine_sem_seg_train, cityscapes_fine_instance_seg_val, cityscapes_fine_sem_seg_val, cityscapes_fine_instance_seg_test, cityscapes_fine_sem_seg_test, voc_2007_trainval, voc_2007_train, voc_2007_val, voc_2007_test, voc_2012_trainval, voc_2012_train, voc_2012_val, coco_obj_train, coco_obj_val"

4. please also simplify the steps as much as possible so they do not require additional resources to
     run, such as a private dataset.

## Expected behavior:

If there are no obvious error in "what you observed" provided above,
please tell us the expected behavior.

If you expect the model to converge / work better, note that we do not give suggestions
on how to train a new model.
Only in one of the two conditions we will help with it:
(1) You're unable to reproduce the results in detectron2 model zoo.
(2) It indicates a detectron2 bug.

## Environment:

Provide your environment information using the following command:

sys.platform linux Python 3.8.1 packaged by conda-forge (default, Jan 29 2020, 14:55:04) [GCC 7.3.0] numpy 1.18.1 detectron2 0.1 @/home/mhy/anaconda3/envs/detectron/lib/python3.8/site-packages/detectron2 detectron2 compiler GCC 7.3 detectron2 CUDA compiler 10.1 detectron2 arch flags sm_35, sm_37, sm_50, sm_52, sm_60, sm_61, sm_70, sm_75 DETECTRON2_ENV_MODULE PyTorch 1.4.0 @/home/mhy/anaconda3/envs/detectron/lib/python3.8/site-packages/torch PyTorch debug build False CUDA available True GPU 0,1 TITAN RTX CUDA_HOME /usr/local/cuda NVCC Cuda compilation tools, release 10.0, V10.0.326 Pillow 7.0.0 torchvision 0.5.0 @/home/mhy/anaconda3/envs/detectron/lib/python3.8/site-packages/torchvision torchvision arch flags sm_35, sm_50, sm_60, sm_70, sm_75 cv2 4.2.0

PyTorch built with:

If your issue looks like an installation issue / environment issue, please first try to solve it yourself with the instructions in https://github.com/facebookresearch/detectron2/blob/master/INSTALL.md#common-installation-issues

ppwwyyxx commented 4 years ago
cfg.DATASETS.TEST = ("coco_obj_val")  

The value should be a list or a tuple. See reference in docs: https://detectron2.readthedocs.io/modules/config.html