facebookresearch / detectron2

Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.
https://detectron2.readthedocs.io/en/latest/
Apache License 2.0
30.49k stars 7.48k forks source link

Problem with register_coco_instances while registering a COCO dataset #253

Closed sayakpaul closed 5 years ago

sayakpaul commented 5 years ago

Hi, I am following this getting started Colab notebook. I am trying to train a custom model using the TACO dataset which comes as a COCO-formatted dataset.

I prepared this Colab notebook for doing the experiments with the dataset. After I registered the dataset using register_coco_instances I am not able to start the training process and the error I get looks like so:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/content/detectron2_repo/detectron2/data/catalog.py in get(name)
     51         try:
---> 52             f = DatasetCatalog._REGISTERED[name]
     53         except KeyError:

KeyError: 'd'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
6 frames
/content/detectron2_repo/detectron2/data/catalog.py in get(name)
     54             raise KeyError(
     55                 "Dataset '{}' is not registered! Available datasets are: {}".format(
---> 56                     name, ", ".join(DatasetCatalog._REGISTERED.keys())
     57                 )
     58             )

KeyError: "Dataset 'd' is not registered! Available datasets are: coco_2014_train, coco_2014_val, coco_2014_minival, coco_2014_minival_100, coco_2014_valminusminival, coco_2017_train, coco_2017_val, coco_2017_val_100, keypoints_coco_2014_train, keypoints_coco_2014_val, keypoints_coco_2014_minival, keypoints_coco_2014_valminusminival, keypoints_coco_2014_minival_100, keypoints_coco_2017_train, keypoints_coco_2017_val, keypoints_coco_2017_val_100, coco_2017_train_panoptic_separated, coco_2017_train_panoptic_stuffonly, coco_2017_val_panoptic_separated, coco_2017_val_panoptic_stuffonly, coco_2017_val_100_panoptic_separated, coco_2017_val_100_panoptic_stuffonly, lvis_v0.5_train, lvis_v0.5_val, lvis_v0.5_val_rand_100, lvis_v0.5_test, cityscapes_fine_instance_seg_train, cityscapes_fine_sem_seg_train, cityscapes_fine_instance_seg_val, cityscapes_fine_sem_seg_val, cityscapes_fine_instance_seg_test, cityscapes_fine_sem_seg_test, voc_2007_trainval, voc_2007_train, voc_2007_val, voc_2007_test, voc_2012_trainval, voc_2012_train, voc_2012_val, my_dataset, taco_dataset"

The above-mentioned notebook can be used to reproduce the issue.

ppwwyyxx commented 5 years ago

cfg.DATASETS.TRAIN should be a tuple of strings however yours is a string.

AvivSham commented 5 years ago

I face the same error and according to your tutorial it should be a string (path) Here are lines from your code:

cfg = get_cfg()
cfg.merge_from_file("./detectron2_repo/configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml")
cfg.DATASETS.TRAIN = ("balloon/train",)
cfg.DATASETS.TEST = ()   # no metrics implemented for this dataset
cfg.DATALOADER.NUM_WORKERS = 2
cfg.MODEL.WEIGHTS = "detectron2://COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x/137849600/model_final_f10217.pkl"  # initialize from model zoo
cfg.SOLVER.IMS_PER_BATCH = 2
cfg.SOLVER.BASE_LR = 0.00025
cfg.SOLVER.MAX_ITER = 300    # 300 iterations seems good enough, but you can certainly train longer
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 128   # faster, and good enough for this toy dataset
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 1  # only has one class (ballon)

os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
trainer = DefaultTrainer(cfg) 
trainer.resume_or_load(resume=False)
trainer.train()
ppwwyyxx commented 5 years ago

No. If you run it with python, ("balloon/train",) is not a string. It's a tuple.

AvivSham commented 5 years ago

oh sorry I have missed the , after the string!

sayakpaul commented 5 years ago

@ppwwyyxx I updated to a tuple but still it does not help:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/content/detectron2_repo/detectron2/data/catalog.py in get(name)
     51         try:
---> 52             f = DatasetCatalog._REGISTERED[name]
     53         except KeyError:

KeyError: 'data/'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
6 frames
/content/detectron2_repo/detectron2/data/catalog.py in get(name)
     54             raise KeyError(
     55                 "Dataset '{}' is not registered! Available datasets are: {}".format(
---> 56                     name, ", ".join(DatasetCatalog._REGISTERED.keys())
     57                 )
     58             )

KeyError: "Dataset 'data/' is not registered! Available datasets are: coco_2014_train, coco_2014_val, coco_2014_minival, coco_2014_minival_100, coco_2014_valminusminival, coco_2017_train, coco_2017_val, coco_2017_test, coco_2017_test-dev, coco_2017_val_100, keypoints_coco_2014_train, keypoints_coco_2014_val, keypoints_coco_2014_minival, keypoints_coco_2014_valminusminival, keypoints_coco_2014_minival_100, keypoints_coco_2017_train, keypoints_coco_2017_val, keypoints_coco_2017_val_100, coco_2017_train_panoptic_separated, coco_2017_train_panoptic_stuffonly, coco_2017_val_panoptic_separated, coco_2017_val_panoptic_stuffonly, coco_2017_val_100_panoptic_separated, coco_2017_val_100_panoptic_stuffonly, lvis_v0.5_train, lvis_v0.5_val, lvis_v0.5_val_rand_100, lvis_v0.5_test, cityscapes_fine_instance_seg_train, cityscapes_fine_sem_seg_train, cityscapes_fine_instance_seg_val, cityscapes_fine_sem_seg_val, cityscapes_fine_instance_seg_test, cityscapes_fine_sem_seg_test, voc_2007_trainval, voc_2007_train, voc_2007_val, voc_2007_test, voc_2012_trainval, voc_2012_train, voc_2012_val, taco_dataset"

The Colab notebook's been updated.

ppwwyyxx commented 5 years ago

cfg.DATASETS.TRAIN should contain names of your dataset as you register them. Not the directory.

sayakpaul commented 5 years ago

Thanks for your help throughout @ppwwyyxx. I was able to get the model to train:

[11/10 09:26:37 d2.engine.train_loop]: Starting training from iteration 0
[11/10 09:27:05 d2.utils.events]: eta: 0:06:23  iter: 19  total_loss: 5.656  loss_cls: 4.131  loss_box_reg: 0.188  loss_mask: 0.693  loss_rpn_cls: 0.531  loss_rpn_loc: 0.048  time: 1.3740  data_time: 0.0053  lr: 0.000005  max_mem: 2446M
[11/10 09:27:32 d2.utils.events]: eta: 0:05:45  iter: 39  total_loss: 5.314  loss_cls: 3.944  loss_box_reg: 0.308  loss_mask: 0.692  loss_rpn_cls: 0.208  loss_rpn_loc: 0.032  time: 1.3528  data_time: 0.0047  lr: 0.000010  max_mem: 2446M
[11/10 09:27:59 d2.utils.events]: eta: 0:05:21  iter: 59  total_loss: 5.189  loss_cls: 3.548  loss_box_reg: 0.281  loss_mask: 0.690  loss_rpn_cls: 0.457  loss_rpn_loc: 0.047  time: 1.3535  data_time: 0.0049  lr: 0.000015  max_mem: 2446M
[11/10 09:28:25 d2.utils.events]: eta: 0:04:56  iter: 79  total_loss: 4.186  loss_cls: 2.773  loss_box_reg: 0.151  loss_mask: 0.687  loss_rpn_cls: 0.318  loss_rpn_loc: 0.028  time: 1.3474  data_time: 0.0047  lr: 0.000020  max_mem: 2446M
[11/10 09:28:52 d2.utils.events]: eta: 0:04:30  iter: 99  total_loss: 3.981  loss_cls: 2.038  loss_box_reg: 0.327  loss_mask: 0.686  loss_rpn_cls: 0.427  loss_rpn_loc: 0.053  time: 1.3479  data_time: 0.0043  lr: 0.000025  max_mem: 2471M
[11/10 09:29:21 d2.utils.events]: eta: 0:04:04  iter: 119  total_loss: 2.759  loss_cls: 1.108  loss_box_reg: 0.179  loss_mask: 0.684  loss_rpn_cls: 0.427  loss_rpn_loc: 0.050  time: 1.3643  data_time: 0.0051  lr: 0.000030  max_mem: 2552M
[11/10 09:29:48 d2.utils.events]: eta: 0:03:37  iter: 139  total_loss: 2.177  loss_cls: 0.762  loss_box_reg: 0.128  loss_mask: 0.678  loss_rpn_cls: 0.422  loss_rpn_loc: 0.057  time: 1.3598  data_time: 0.0047  lr: 0.000035  max_mem: 2552M
[11/10 09:30:14 d2.utils.events]: eta: 0:03:09  iter: 159  total_loss: 2.534  loss_cls: 0.803  loss_box_reg: 0.077  loss_mask: 0.670  loss_rpn_cls: 0.375  loss_rpn_loc: 0.076  time: 1.3544  data_time: 0.0045  lr: 0.000040  max_mem: 2552M
[11/10 09:30:42 d2.utils.events]: eta: 0:02:42  iter: 179  total_loss: 1.567  loss_cls: 0.507  loss_box_reg: 0.053  loss_mask: 0.656  loss_rpn_cls: 0.212  loss_rpn_loc: 0.043  time: 1.3572  data_time: 0.0047  lr: 0.000045  max_mem: 2552M
[11/10 09:31:09 d2.utils.events]: eta: 0:02:15  iter: 199  total_loss: 1.537  loss_cls: 0.516  loss_box_reg: 0.068  loss_mask: 0.658  loss_rpn_cls: 0.229  loss_rpn_loc: 0.043  time: 1.3580  data_time: 0.0045  lr: 0.000050  max_mem: 2552M
[11/10 09:31:37 d2.utils.events]: eta: 0:01:49  iter: 219  total_loss: 1.717  loss_cls: 0.639  loss_box_reg: 0.008  loss_mask: 0.653  loss_rpn_cls: 0.169  loss_rpn_loc: 0.022  time: 1.3586  data_time: 0.0048  lr: 0.000055  max_mem: 2552M
[11/10 09:32:04 d2.utils.events]: eta: 0:01:22  iter: 239  total_loss: 1.438  loss_cls: 0.479  loss_box_reg: 0.024  loss_mask: 0.632  loss_rpn_cls: 0.168  loss_rpn_loc: 0.043  time: 1.3592  data_time: 0.0044  lr: 0.000060  max_mem: 2552M
[11/10 09:32:31 d2.utils.events]: eta: 0:00:55  iter: 259  total_loss: 2.169  loss_cls: 0.794  loss_box_reg: 0.052  loss_mask: 0.626  loss_rpn_cls: 0.350  loss_rpn_loc: 0.093  time: 1.3583  data_time: 0.0043  lr: 0.000065  max_mem: 2552M
[11/10 09:32:59 d2.utils.events]: eta: 0:00:28  iter: 279  total_loss: 1.572  loss_cls: 0.559  loss_box_reg: 0.047  loss_mask: 0.605  loss_rpn_cls: 0.213  loss_rpn_loc: 0.037  time: 1.3609  data_time: 0.0043  lr: 0.000070  max_mem: 2552M
[11/10 09:33:26 d2.utils.events]: eta: 0:00:01  iter: 299  total_loss: 1.832  loss_cls: 0.683  loss_box_reg: 0.170  loss_mask: 0.570  loss_rpn_cls: 0.196  loss_rpn_loc: 0.041  time: 1.3593  data_time: 0.0043  lr: 0.000075  max_mem: 2552M
[11/10 09:33:27 d2.engine.hooks]: Overall training speed: 297 iterations in 0:06:45 (1.3639 s / it)
[11/10 09:33:27 d2.engine.hooks]: Total training time: 0:06:46 (0:00:01 on hooks)
OrderedDict()

But I am still confused about why the model does not infer anything. I have updated the Colab notebook with minimal code to reproduce the issue. I have also updated the notebook with TensorBoard.

ppwwyyxx commented 5 years ago

You most likely need to train longer. As the issue template says, we do not answer questions about how to train a better model.

sayakpaul commented 5 years ago

Cool. Thanks.

zsc1220 commented 4 years ago

Hi @ppwwyyxx When I use 1 gpu to train ,it can be run well ,such as : python tools/train_net.py --config-file configs/COCO-Detection/faster_rcnn_R_101_FPN_3x.yaml SOLVER.IMS_PER_BATCH 2 SOLVER.BASE_LR 0.0025

But if i use 4 gpu to train , such as : python tools/train_net.py --num-gpus 4 --config-file configs/COCO-Detection/faster_rcnn_R_101_FPN_3x.yaml It has the following error , its really very strange !

` Traceback (most recent call last): File "tools/train_net.py", line 166, in args=(args,), File "/home/zsc/pythoncode/detectron2/detectron2/engine/launch.py", line 49, in launch daemon=False, File "/home/zsc/anaconda3/envs/car/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn while not spawn_context.join(): File "/home/zsc/anaconda3/envs/car/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/zsc/pythoncode/detectron2/detectron2/data/catalog.py", line 52, in get f = DatasetCatalog._REGISTERED[name] KeyError: 'car'

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/zsc/anaconda3/envs/car/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, args) File "/home/zsc/pythoncode/detectron2/detectron2/engine/launch.py", line 84, in _distributed_worker main_func(args) File "/home/zsc/pythoncode/detectron2/tools/train_net.py", line 144, in main trainer = Trainer(cfg) File "/home/zsc/pythoncode/detectron2/detectron2/engine/defaults.py", line 223, in init data_loader = self.build_train_loader(cfg) File "/home/zsc/pythoncode/detectron2/detectron2/engine/defaults.py", line 397, in build_train_loader return build_detection_train_loader(cfg) File "/home/zsc/pythoncode/detectron2/detectron2/data/build.py", line 327, in build_detection_train_loader proposal_files=cfg.DATASETS.PROPOSAL_FILES_TRAIN if cfg.MODEL.LOAD_PROPOSALS else None, File "/home/zsc/pythoncode/detectron2/detectron2/data/build.py", line 256, in get_detection_dataset_dicts dataset_dicts = [DatasetCatalog.get(dataset_name) for dataset_name in dataset_names] File "/home/zsc/pythoncode/detectron2/detectron2/data/build.py", line 256, in dataset_dicts = [DatasetCatalog.get(dataset_name) for dataset_name in dataset_names] File "/home/zsc/pythoncode/detectron2/detectron2/data/catalog.py", line 56, in get name, ", ".join(DatasetCatalog._REGISTERED.keys()) KeyError: "Dataset 'car' is not registered! Available datasets are: coco_2014_train, coco_2014_val, coco_2014_minival, coco_2014_minival_100, coco_2014_valminusminival, coco_2017_train, coco_2017_val, coco_2017_val_100, keypoints_coco_2014_train, keypoints_coco_2014_val, keypoints_coco_2014_minival, keypoints_coco_2014_valminusminival, keypoints_coco_2014_minival_100, keypoints_coco_2017_train, keypoints_coco_2017_val, keypoints_coco_2017_val_100, coco_2017_train_panoptic_separated, coco_2017_train_panoptic_stuffonly, coco_2017_val_panoptic_separated, coco_2017_val_panoptic_stuffonly, coco_2017_val_100_panoptic_separated, coco_2017_val_100_panoptic_stuffonly, lvis_v0.5_train, lvis_v0.5_val, lvis_v0.5_val_rand_100, lvis_v0.5_test, cityscapes_fine_instance_seg_train, cityscapes_fine_sem_seg_train, cityscapes_fine_instance_seg_val, cityscapes_fine_sem_seg_val, cityscapes_fine_instance_seg_test, cityscapes_fine_sem_seg_test, voc_2007_trainval, voc_2007_train, voc_2007_val, voc_2007_test, voc_2012_trainval, voc_2012_train, voc_2012_val" `

ppwwyyxx commented 4 years ago

@zsc1220 it seems the dataset is not registered when you use multiple GPUs. Where do you register the dataset? In train_net you might need to register it in the main() function.

zsc1220 commented 4 years ago

Wow , well done , it works , Thanks a lot !!! @ppwwyyxx

GeNiaaz commented 4 years ago

cfg.DATASETS.TRAIN should contain names of your dataset as you register them. Not the directory.

sorry, what did you mean by this?

tand826 commented 4 years ago

cfg.DATASETS.TRAIN should contain names of your dataset as you register them. Not the directory.

sorry, what did you mean by this?

If you name your dataset as "cool_dataset" and this dataset is located in "${HOME}/cool_dataset_is_here",

ola0x commented 3 years ago

Thanks a lot, it helped

AidenFather commented 3 years ago

Wow , well done , it works , Thanks a lot !!! @ppwwyyxx

How did you do it? Can you share the script? Thanks!

madhurtripathi commented 2 years ago

Hi can someone help me with the issue. Much appreciated: I'm deploying this on Azure via Synapse Notebooks: to register COCO instances, i'm using

from detectron2.data.datasets import register_coco_instances register_coco_instances("my_dataset_train", {}, "Azure Data Lake path of my json/filename.json", "Azure Data Lake path of my images/dir") register_coco_instances("my_dataset_val", {}, "Azure Data Lake path of my json/filename.json", "Azure Data Lake path of my images/dir")

After this when i run:

from detectron2.engine import DefaultTrainer import os cfg = get_cfg() cfg.MODEL.DEVICE = "cpu" cfg.merge_from_file(model_zoo.get_config_file("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml")) cfg.DATASETS.TRAIN = ("my_dataset_train",) cfg.DATASETS.TEST = () cfg.DATALOADER.NUM_WORKERS = 2 cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml") cfg.SOLVER.IMS_PER_BATCH = 2 cfg.SOLVER.BASE_LR = 0.00025
cfg.SOLVER.MAX_ITER = 2000 cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 512 cfg.MODEL.ROI_HEADS.NUM_CLASSES = 225 os.makedirs(cfg.OUTPUT_DIR, exist_ok=True) trainer = DefaultTrainer(cfg) trainer.resume_or_load(resume=False) trainer.train()

It shoots up an error saying 'File not found, No directory at 'Azure Data Lake path of my json/filename.json'. Although i'm able to load and read the json seperately in my notebook, not sure why it cant fetch it from the url. Any help is much appreciated