Closed HuygheB closed 1 year ago
I encountered the same error a few weeks ago. The error was caused by an invalid dataset. There were items in the dataset that did not have annotations. I baked Datumaro into our data prep pipeline to clean this up.
# keep only annotated images
dataset.select(lambda item: len(item.annotations) != 0)
Error comes from AdelaiDet. Please report there instead.
Hi, I am trying to train SOLOv2 on Mapillary dataset. I manually converted the dataset to coco format annotations. When I run a training everything seems to work fine until a certain stage where I get an IndexError (see logs). I presume this might have something to do with the format of the annotation file but I'm not sure. Any advice would be appreciated. Thanks.
Instructions To Reproduce the Issue:
register mapillary dataset
register_coco_instances("mapillary_train", {}, "/home/benjaminh/detectron2/AdelaiDet/datasets/mapillary/annotations/filtered_instances_train2017.json", "/home/benjaminh/detectron2/AdelaiDet/datasets/mapillary/train2017") register_coco_instances("mapillary_val", {}, "/home/benjaminh/detectron2/AdelaiDet/datasets/mapillary/annotations/filtered_instances_val2017.json", "/home/benjaminh/detectron2/AdelaiDet/datasets/mapillary/val2017") print('Datasets registered')
get and set config file
cfg = get_cfg() cfg.merge_from_file("/home/benjaminh/detectron2/AdelaiDet/configs/SOLOv2/R50_3x.yaml") cfg.MODEL.WEIGHTS = "SOLOv2_R50_3x.pth" # Let training initialize from model zoo
cfg.DATASETS.TRAIN = ("mapillary_train",) cfg.DATASETS.TEST = ("mapillary_val",)
cfg.TEST.EVAL_PERIOD = 100
cfg.DATALOADER.NUM_WORKERS = 2
cfg.SOLVER.IMS_PER_BATCH = 2 # This is the real "batch size" commonly known to deep learning people cfg.SOLVER.BASE_LR = 0.00025 # pick a good LR cfg.SOLVER.MAX_ITER = 100000 # 300 iterations seems good enough for this toy dataset; you will need to train longer for a practical dataset cfg.SOLVER.STEPS = [] # do not decay learning rate cfg.MODEL.SOLOV2.NUM_CLASSES = 17 # 17 classes
Definition of a Validation hook to track val_loss during training:
cfg.DATASETS.VAL = ("mapillary_val",)
class ValidationLoss(HookBase): def init(self, cfg): super().init() self.cfg = cfg.clone() self.cfg.DATASETS.TRAIN = cfg.DATASETS.VAL self._loader = iter(build_detection_train_loader(self.cfg))
Define trainer that subclasses DefaultTrainer
class Trainer(DefaultTrainer): @classmethod def build_evaluator(cls, cfg, dataset_name, output_folder=None): if output_folder is None: output_folder = os.path.join(cfg.OUTPUT_DIR,"inference") return COCOEvaluator(dataset_name, cfg, True, output_folder)
Train
os.makedirs(cfg.OUTPUT_DIR, exist_ok=True) trainer = Trainer(cfg) val_loss = ValidationLoss(cfg)
trainer.register_hooks([val_loss])
swap the order of PeriodicWriter and ValidationLoss
trainer._hooks = trainer._hooks[:-2] + trainer._hooks[-2:][::-1] trainer.resume_or_load(resume=True) trainer.train()
ERROR [07/07 00:34:34 d2.engine.train_loop]: Exception during training: Traceback (most recent call last): File "/opt/conda/envs/detectron2/lib/python3.9/site-packages/detectron2/engine/train_loop.py", line 149, in train input, kwargs)
File "/home/benjaminh/detectron2/AdelaiDet/adet/modeling/solov2/solov2.py", line 136, in forward
self.run_step() File "/opt/conda/envs/detectron2/lib/python3.9/site-packages/detectron2/engine/defaults.py", line 494, in run_step
self._trainer.run_step() File "/opt/conda/envs/detectron2/lib/python3.9/site-packages/detectron2/engine/train_loop.py", line 274, in run_step
loss_dict = self.model(data) File "/opt/conda/envs/detectron2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, kwargs) File "/home/benjaminh/detectron2/AdelaiDet/adet/modeling/solov2/solov2.py", line 136, in forward
targets = self.get_ground_truth(gt_instances, mask_feat_size) File "/opt/conda/envs/detectron2/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, *kwargs) File "/home/benjaminh/detectron2/AdelaiDet/adet/modeling/solov2/solov2.py", line 163, in get_ground_truth
self.get_ground_truth_single(img_idx, gt_instances, File "/home/benjaminh/detectron2/AdelaiDet/adet/modeling/solov2/solov2.py", line 175, in get_ground_truth_single
device = gt_labels_raw[0].device IndexError: index 0 is out of bounds for dimension 0 with size 0 [07/07 00:34:34 d2.engine.hooks]: Overall training speed: 43200 iterations in 3:55:38 (0.3273 s / it)
[07/07 00:34:34 d2.engine.hooks]: Total training time: 7:08:04 (3:12:26 on hooks) [07/07 00:34:34 d2.utils.events]: eta: 4:38:50 iter: 43202 total_loss: 1.48 loss_ins: 1.212 loss_cate: 0.2656 total_val_loss: 1.498 val_loss_ins: 1.23 val_loss_cate: 0.265 time: 0.3273 data_time: 0.0687 lr: 0.00025 max_mem: 4022M Traceback (most recent call last): File "/home/benjaminh/detectron2/AdelaiDet/tools/train_solo.py", line 204, in
trainer.train() File "/opt/conda/envs/detectron2/lib/python3.9/site-packages/detectron2/engine/defaults.py", line 484, in train
super().train(self.start_iter, self.max_iter) File "/opt/conda/envs/detectron2/lib/python3.9/site-packages/detectron2/engine/train_loop.py", line 149, in train
self.run_step() File "/opt/conda/envs/detectron2/lib/python3.9/site-packages/detectron2/engine/defaults.py", line 494, in run_step
self._trainer.run_step() File "/opt/conda/envs/detectron2/lib/python3.9/site-packages/detectron2/engine/train_loop.py", line 274, in run_step
loss_dict = self.model(data) File "/opt/conda/envs/detectron2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(
targets = self.get_ground_truth(gt_instances, mask_feat_size) File "/opt/conda/envs/detectron2/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs) File "/home/benjaminh/detectron2/AdelaiDet/adet/modeling/solov2/solov2.py", line 163, in get_ground_truth
self.get_ground_truth_single(img_idx, gt_instances, File "/home/benjaminh/detectron2/AdelaiDet/adet/modeling/solov2/solov2.py", line 175, in get_ground_truth_single
device = gt_labels_raw[0].device IndexError: index 0 is out of bounds for dimension 0 with size 0```
PyTorch built with:
Testing NCCL connectivity ... this should not hang. NCCL succeeded.