Closed sayakpaul closed 5 years ago
cfg.DATASETS.TRAIN
should be a tuple of strings however yours is a string.
I face the same error and according to your tutorial it should be a string (path) Here are lines from your code:
cfg = get_cfg()
cfg.merge_from_file("./detectron2_repo/configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml")
cfg.DATASETS.TRAIN = ("balloon/train",)
cfg.DATASETS.TEST = () # no metrics implemented for this dataset
cfg.DATALOADER.NUM_WORKERS = 2
cfg.MODEL.WEIGHTS = "detectron2://COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x/137849600/model_final_f10217.pkl" # initialize from model zoo
cfg.SOLVER.IMS_PER_BATCH = 2
cfg.SOLVER.BASE_LR = 0.00025
cfg.SOLVER.MAX_ITER = 300 # 300 iterations seems good enough, but you can certainly train longer
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 128 # faster, and good enough for this toy dataset
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 1 # only has one class (ballon)
os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
trainer = DefaultTrainer(cfg)
trainer.resume_or_load(resume=False)
trainer.train()
No. If you run it with python, ("balloon/train",)
is not a string. It's a tuple.
oh sorry I have missed the ,
after the string!
@ppwwyyxx I updated to a tuple but still it does not help:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
/content/detectron2_repo/detectron2/data/catalog.py in get(name)
51 try:
---> 52 f = DatasetCatalog._REGISTERED[name]
53 except KeyError:
KeyError: 'data/'
During handling of the above exception, another exception occurred:
KeyError Traceback (most recent call last)
6 frames
/content/detectron2_repo/detectron2/data/catalog.py in get(name)
54 raise KeyError(
55 "Dataset '{}' is not registered! Available datasets are: {}".format(
---> 56 name, ", ".join(DatasetCatalog._REGISTERED.keys())
57 )
58 )
KeyError: "Dataset 'data/' is not registered! Available datasets are: coco_2014_train, coco_2014_val, coco_2014_minival, coco_2014_minival_100, coco_2014_valminusminival, coco_2017_train, coco_2017_val, coco_2017_test, coco_2017_test-dev, coco_2017_val_100, keypoints_coco_2014_train, keypoints_coco_2014_val, keypoints_coco_2014_minival, keypoints_coco_2014_valminusminival, keypoints_coco_2014_minival_100, keypoints_coco_2017_train, keypoints_coco_2017_val, keypoints_coco_2017_val_100, coco_2017_train_panoptic_separated, coco_2017_train_panoptic_stuffonly, coco_2017_val_panoptic_separated, coco_2017_val_panoptic_stuffonly, coco_2017_val_100_panoptic_separated, coco_2017_val_100_panoptic_stuffonly, lvis_v0.5_train, lvis_v0.5_val, lvis_v0.5_val_rand_100, lvis_v0.5_test, cityscapes_fine_instance_seg_train, cityscapes_fine_sem_seg_train, cityscapes_fine_instance_seg_val, cityscapes_fine_sem_seg_val, cityscapes_fine_instance_seg_test, cityscapes_fine_sem_seg_test, voc_2007_trainval, voc_2007_train, voc_2007_val, voc_2007_test, voc_2012_trainval, voc_2012_train, voc_2012_val, taco_dataset"
The Colab notebook's been updated.
cfg.DATASETS.TRAIN
should contain names of your dataset as you register them. Not the directory.
Thanks for your help throughout @ppwwyyxx. I was able to get the model to train:
[11/10 09:26:37 d2.engine.train_loop]: Starting training from iteration 0
[11/10 09:27:05 d2.utils.events]: eta: 0:06:23 iter: 19 total_loss: 5.656 loss_cls: 4.131 loss_box_reg: 0.188 loss_mask: 0.693 loss_rpn_cls: 0.531 loss_rpn_loc: 0.048 time: 1.3740 data_time: 0.0053 lr: 0.000005 max_mem: 2446M
[11/10 09:27:32 d2.utils.events]: eta: 0:05:45 iter: 39 total_loss: 5.314 loss_cls: 3.944 loss_box_reg: 0.308 loss_mask: 0.692 loss_rpn_cls: 0.208 loss_rpn_loc: 0.032 time: 1.3528 data_time: 0.0047 lr: 0.000010 max_mem: 2446M
[11/10 09:27:59 d2.utils.events]: eta: 0:05:21 iter: 59 total_loss: 5.189 loss_cls: 3.548 loss_box_reg: 0.281 loss_mask: 0.690 loss_rpn_cls: 0.457 loss_rpn_loc: 0.047 time: 1.3535 data_time: 0.0049 lr: 0.000015 max_mem: 2446M
[11/10 09:28:25 d2.utils.events]: eta: 0:04:56 iter: 79 total_loss: 4.186 loss_cls: 2.773 loss_box_reg: 0.151 loss_mask: 0.687 loss_rpn_cls: 0.318 loss_rpn_loc: 0.028 time: 1.3474 data_time: 0.0047 lr: 0.000020 max_mem: 2446M
[11/10 09:28:52 d2.utils.events]: eta: 0:04:30 iter: 99 total_loss: 3.981 loss_cls: 2.038 loss_box_reg: 0.327 loss_mask: 0.686 loss_rpn_cls: 0.427 loss_rpn_loc: 0.053 time: 1.3479 data_time: 0.0043 lr: 0.000025 max_mem: 2471M
[11/10 09:29:21 d2.utils.events]: eta: 0:04:04 iter: 119 total_loss: 2.759 loss_cls: 1.108 loss_box_reg: 0.179 loss_mask: 0.684 loss_rpn_cls: 0.427 loss_rpn_loc: 0.050 time: 1.3643 data_time: 0.0051 lr: 0.000030 max_mem: 2552M
[11/10 09:29:48 d2.utils.events]: eta: 0:03:37 iter: 139 total_loss: 2.177 loss_cls: 0.762 loss_box_reg: 0.128 loss_mask: 0.678 loss_rpn_cls: 0.422 loss_rpn_loc: 0.057 time: 1.3598 data_time: 0.0047 lr: 0.000035 max_mem: 2552M
[11/10 09:30:14 d2.utils.events]: eta: 0:03:09 iter: 159 total_loss: 2.534 loss_cls: 0.803 loss_box_reg: 0.077 loss_mask: 0.670 loss_rpn_cls: 0.375 loss_rpn_loc: 0.076 time: 1.3544 data_time: 0.0045 lr: 0.000040 max_mem: 2552M
[11/10 09:30:42 d2.utils.events]: eta: 0:02:42 iter: 179 total_loss: 1.567 loss_cls: 0.507 loss_box_reg: 0.053 loss_mask: 0.656 loss_rpn_cls: 0.212 loss_rpn_loc: 0.043 time: 1.3572 data_time: 0.0047 lr: 0.000045 max_mem: 2552M
[11/10 09:31:09 d2.utils.events]: eta: 0:02:15 iter: 199 total_loss: 1.537 loss_cls: 0.516 loss_box_reg: 0.068 loss_mask: 0.658 loss_rpn_cls: 0.229 loss_rpn_loc: 0.043 time: 1.3580 data_time: 0.0045 lr: 0.000050 max_mem: 2552M
[11/10 09:31:37 d2.utils.events]: eta: 0:01:49 iter: 219 total_loss: 1.717 loss_cls: 0.639 loss_box_reg: 0.008 loss_mask: 0.653 loss_rpn_cls: 0.169 loss_rpn_loc: 0.022 time: 1.3586 data_time: 0.0048 lr: 0.000055 max_mem: 2552M
[11/10 09:32:04 d2.utils.events]: eta: 0:01:22 iter: 239 total_loss: 1.438 loss_cls: 0.479 loss_box_reg: 0.024 loss_mask: 0.632 loss_rpn_cls: 0.168 loss_rpn_loc: 0.043 time: 1.3592 data_time: 0.0044 lr: 0.000060 max_mem: 2552M
[11/10 09:32:31 d2.utils.events]: eta: 0:00:55 iter: 259 total_loss: 2.169 loss_cls: 0.794 loss_box_reg: 0.052 loss_mask: 0.626 loss_rpn_cls: 0.350 loss_rpn_loc: 0.093 time: 1.3583 data_time: 0.0043 lr: 0.000065 max_mem: 2552M
[11/10 09:32:59 d2.utils.events]: eta: 0:00:28 iter: 279 total_loss: 1.572 loss_cls: 0.559 loss_box_reg: 0.047 loss_mask: 0.605 loss_rpn_cls: 0.213 loss_rpn_loc: 0.037 time: 1.3609 data_time: 0.0043 lr: 0.000070 max_mem: 2552M
[11/10 09:33:26 d2.utils.events]: eta: 0:00:01 iter: 299 total_loss: 1.832 loss_cls: 0.683 loss_box_reg: 0.170 loss_mask: 0.570 loss_rpn_cls: 0.196 loss_rpn_loc: 0.041 time: 1.3593 data_time: 0.0043 lr: 0.000075 max_mem: 2552M
[11/10 09:33:27 d2.engine.hooks]: Overall training speed: 297 iterations in 0:06:45 (1.3639 s / it)
[11/10 09:33:27 d2.engine.hooks]: Total training time: 0:06:46 (0:00:01 on hooks)
OrderedDict()
But I am still confused about why the model does not infer anything. I have updated the Colab notebook with minimal code to reproduce the issue. I have also updated the notebook with TensorBoard.
You most likely need to train longer. As the issue template says, we do not answer questions about how to train a better model.
Cool. Thanks.
Hi @ppwwyyxx
When I use 1 gpu to train ,it can be run well ,such as :
python tools/train_net.py --config-file configs/COCO-Detection/faster_rcnn_R_101_FPN_3x.yaml SOLVER.IMS_PER_BATCH 2 SOLVER.BASE_LR 0.0025
But if i use 4 gpu to train , such as :
python tools/train_net.py --num-gpus 4 --config-file configs/COCO-Detection/faster_rcnn_R_101_FPN_3x.yaml
It has the following error , its really very strange !
`
Traceback (most recent call last):
File "tools/train_net.py", line 166, in
-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/zsc/pythoncode/detectron2/detectron2/data/catalog.py", line 52, in get f = DatasetCatalog._REGISTERED[name] KeyError: 'car'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/zsc/anaconda3/envs/car/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
fn(i, args)
File "/home/zsc/pythoncode/detectron2/detectron2/engine/launch.py", line 84, in _distributed_worker
main_func(args)
File "/home/zsc/pythoncode/detectron2/tools/train_net.py", line 144, in main
trainer = Trainer(cfg)
File "/home/zsc/pythoncode/detectron2/detectron2/engine/defaults.py", line 223, in init
data_loader = self.build_train_loader(cfg)
File "/home/zsc/pythoncode/detectron2/detectron2/engine/defaults.py", line 397, in build_train_loader
return build_detection_train_loader(cfg)
File "/home/zsc/pythoncode/detectron2/detectron2/data/build.py", line 327, in build_detection_train_loader
proposal_files=cfg.DATASETS.PROPOSAL_FILES_TRAIN if cfg.MODEL.LOAD_PROPOSALS else None,
File "/home/zsc/pythoncode/detectron2/detectron2/data/build.py", line 256, in get_detection_dataset_dicts
dataset_dicts = [DatasetCatalog.get(dataset_name) for dataset_name in dataset_names]
File "/home/zsc/pythoncode/detectron2/detectron2/data/build.py", line 256, in
@zsc1220 it seems the dataset is not registered when you use multiple GPUs. Where do you register the dataset? In train_net
you might need to register it in the main()
function.
Wow , well done , it works , Thanks a lot !!! @ppwwyyxx
cfg.DATASETS.TRAIN
should contain names of your dataset as you register them. Not the directory.
sorry, what did you mean by this?
cfg.DATASETS.TRAIN
should contain names of your dataset as you register them. Not the directory.sorry, what did you mean by this?
If you name your dataset as "cool_dataset" and this dataset is located in "${HOME}/cool_dataset_is_here",
Good
cfg.DATASETS.TRAIN = ("cool_dataset", )
Bad
home = os.environ.get("HOME")
cfg.DATASETS.TRAIN = (os.path.join(home, "cool_dataset_is_here"), )
Thanks a lot, it helped
Wow , well done , it works , Thanks a lot !!! @ppwwyyxx
How did you do it? Can you share the script? Thanks!
Hi can someone help me with the issue. Much appreciated: I'm deploying this on Azure via Synapse Notebooks: to register COCO instances, i'm using
from detectron2.data.datasets import register_coco_instances register_coco_instances("my_dataset_train", {}, "Azure Data Lake path of my json/filename.json", "Azure Data Lake path of my images/dir") register_coco_instances("my_dataset_val", {}, "Azure Data Lake path of my json/filename.json", "Azure Data Lake path of my images/dir")
After this when i run:
from detectron2.engine import DefaultTrainer
import os
cfg = get_cfg()
cfg.MODEL.DEVICE = "cpu"
cfg.merge_from_file(model_zoo.get_config_file("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml"))
cfg.DATASETS.TRAIN = ("my_dataset_train",)
cfg.DATASETS.TEST = ()
cfg.DATALOADER.NUM_WORKERS = 2
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml")
cfg.SOLVER.IMS_PER_BATCH = 2
cfg.SOLVER.BASE_LR = 0.00025
cfg.SOLVER.MAX_ITER = 2000
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 512
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 225
os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
trainer = DefaultTrainer(cfg)
trainer.resume_or_load(resume=False)
trainer.train()
It shoots up an error saying 'File not found, No directory at 'Azure Data Lake path of my json/filename.json'. Although i'm able to load and read the json seperately in my notebook, not sure why it cant fetch it from the url. Any help is much appreciated
Hi, I am following this getting started Colab notebook. I am trying to train a custom model using the TACO dataset which comes as a COCO-formatted dataset.
I prepared this Colab notebook for doing the experiments with the dataset. After I registered the dataset using
register_coco_instances
I am not able to start the training process and the error I get looks like so:The above-mentioned notebook can be used to reproduce the issue.