facebookresearch / detectron2

Detectron2 is a platform for object detection, segmentation and other visual recognition tasks.
https://detectron2.readthedocs.io/en/latest/
Apache License 2.0
30.05k stars 7.42k forks source link

IndexError: index 0 is out of bounds for dimension 0 with size 0 #4385

Closed HuygheB closed 1 year ago

HuygheB commented 2 years ago

Hi, I am trying to train SOLOv2 on Mapillary dataset. I manually converted the dataset to coco format annotations. When I run a training everything seems to work fine until a certain stage where I get an IndexError (see logs). I presume this might have something to do with the format of the annotation file but I'm not sure. Any advice would be appreciated. Thanks.

Instructions To Reproduce the Issue:

  1. Full runnable code or full changes you made: Training file train_solo.py:
    
    from detectron2.data.datasets import register_coco_instances
    from detectron2.engine import DefaultTrainer
    from detectron2.engine import HookBase
    from detectron2.data import build_detection_train_loader
    from detectron2.evaluation import COCOEvaluator
    from adet.config import get_cfg
    import detectron2.utils.comm as comm
    import detectron2.utils.comm as comm
    import os, torch

register mapillary dataset

register_coco_instances("mapillary_train", {}, "/home/benjaminh/detectron2/AdelaiDet/datasets/mapillary/annotations/filtered_instances_train2017.json", "/home/benjaminh/detectron2/AdelaiDet/datasets/mapillary/train2017") register_coco_instances("mapillary_val", {}, "/home/benjaminh/detectron2/AdelaiDet/datasets/mapillary/annotations/filtered_instances_val2017.json", "/home/benjaminh/detectron2/AdelaiDet/datasets/mapillary/val2017") print('Datasets registered')

get and set config file

cfg = get_cfg() cfg.merge_from_file("/home/benjaminh/detectron2/AdelaiDet/configs/SOLOv2/R50_3x.yaml") cfg.MODEL.WEIGHTS = "SOLOv2_R50_3x.pth" # Let training initialize from model zoo

cfg.DATASETS.TRAIN = ("mapillary_train",) cfg.DATASETS.TEST = ("mapillary_val",)

cfg.TEST.EVAL_PERIOD = 100

cfg.DATALOADER.NUM_WORKERS = 2

cfg.SOLVER.IMS_PER_BATCH = 2 # This is the real "batch size" commonly known to deep learning people cfg.SOLVER.BASE_LR = 0.00025 # pick a good LR cfg.SOLVER.MAX_ITER = 100000 # 300 iterations seems good enough for this toy dataset; you will need to train longer for a practical dataset cfg.SOLVER.STEPS = [] # do not decay learning rate cfg.MODEL.SOLOV2.NUM_CLASSES = 17 # 17 classes

Definition of a Validation hook to track val_loss during training:

cfg.DATASETS.VAL = ("mapillary_val",)

class ValidationLoss(HookBase): def init(self, cfg): super().init() self.cfg = cfg.clone() self.cfg.DATASETS.TRAIN = cfg.DATASETS.VAL self._loader = iter(build_detection_train_loader(self.cfg))

def after_step(self):
    data = next(self._loader)
    with torch.no_grad():
        loss_dict = self.trainer.model(data)

        losses = sum(loss_dict.values())
        assert torch.isfinite(losses).all(), loss_dict

        loss_dict_reduced = {"val_" + k: v.item() for k, v in 
                             comm.reduce_dict(loss_dict).items()}
        losses_reduced = sum(loss for loss in loss_dict_reduced.values())
        if comm.is_main_process():
            self.trainer.storage.put_scalars(total_val_loss=losses_reduced, 
                                             **loss_dict_reduced)

Define trainer that subclasses DefaultTrainer

class Trainer(DefaultTrainer): @classmethod def build_evaluator(cls, cfg, dataset_name, output_folder=None): if output_folder is None: output_folder = os.path.join(cfg.OUTPUT_DIR,"inference") return COCOEvaluator(dataset_name, cfg, True, output_folder)

Train

os.makedirs(cfg.OUTPUT_DIR, exist_ok=True) trainer = Trainer(cfg) val_loss = ValidationLoss(cfg)
trainer.register_hooks([val_loss])

swap the order of PeriodicWriter and ValidationLoss

trainer._hooks = trainer._hooks[:-2] + trainer._hooks[-2:][::-1] trainer.resume_or_load(resume=True) trainer.train()


3. What exact command you run: python train_solo.py
4. __Full logs__ or other relevant observations:

ERROR [07/07 00:34:34 d2.engine.train_loop]: Exception during training: Traceback (most recent call last): File "/opt/conda/envs/detectron2/lib/python3.9/site-packages/detectron2/engine/train_loop.py", line 149, in train
self.run_step() File "/opt/conda/envs/detectron2/lib/python3.9/site-packages/detectron2/engine/defaults.py", line 494, in run_step
self._trainer.run_step() File "/opt/conda/envs/detectron2/lib/python3.9/site-packages/detectron2/engine/train_loop.py", line 274, in run_step
loss_dict = self.model(data) File "/opt/conda/envs/detectron2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, kwargs) File "/home/benjaminh/detectron2/AdelaiDet/adet/modeling/solov2/solov2.py", line 136, in forward
targets = self.get_ground_truth(gt_instances, mask_feat_size) File "/opt/conda/envs/detectron2/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, *kwargs) File "/home/benjaminh/detectron2/AdelaiDet/adet/modeling/solov2/solov2.py", line 163, in get_ground_truth
self.get_ground_truth_single(img_idx, gt_instances, File "/home/benjaminh/detectron2/AdelaiDet/adet/modeling/solov2/solov2.py", line 175, in get_ground_truth_single
device = gt_labels_raw[0].device IndexError: index 0 is out of bounds for dimension 0 with size 0 [07/07 00:34:34 d2.engine.hooks]: Overall training speed: 43200 iterations in 3:55:38 (0.3273 s / it)
[07/07 00:34:34 d2.engine.hooks]: Total training time: 7:08:04 (3:12:26 on hooks) [07/07 00:34:34 d2.utils.events]: eta: 4:38:50 iter: 43202 total_loss: 1.48 loss_ins: 1.212 loss_cate: 0.2656 total_val_loss: 1.498 val_loss_ins: 1.23 val_loss_cate: 0.265 time: 0.3273 data_time: 0.0687 lr: 0.00025 max_mem: 4022M Traceback (most recent call last): File "/home/benjaminh/detectron2/AdelaiDet/tools/train_solo.py", line 204, in
trainer.train() File "/opt/conda/envs/detectron2/lib/python3.9/site-packages/detectron2/engine/defaults.py", line 484, in train
super().train(self.start_iter, self.max_iter) File "/opt/conda/envs/detectron2/lib/python3.9/site-packages/detectron2/engine/train_loop.py", line 149, in train
self.run_step() File "/opt/conda/envs/detectron2/lib/python3.9/site-packages/detectron2/engine/defaults.py", line 494, in run_step
self._trainer.run_step() File "/opt/conda/envs/detectron2/lib/python3.9/site-packages/detectron2/engine/train_loop.py", line 274, in run_step
loss_dict = self.model(data) File "/opt/conda/envs/detectron2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(
input,
kwargs) File "/home/benjaminh/detectron2/AdelaiDet/adet/modeling/solov2/solov2.py", line 136, in forward
targets = self.get_ground_truth(gt_instances, mask_feat_size) File "/opt/conda/envs/detectron2/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs) File "/home/benjaminh/detectron2/AdelaiDet/adet/modeling/solov2/solov2.py", line 163, in get_ground_truth
self.get_ground_truth_single(img_idx, gt_instances, File "/home/benjaminh/detectron2/AdelaiDet/adet/modeling/solov2/solov2.py", line 175, in get_ground_truth_single
device = gt_labels_raw[0].device IndexError: index 0 is out of bounds for dimension 0 with size 0```


## Expected behavior:
Expect training to continue as normal until the max iteration specified. 

## Environment:

sys.platform linux Python 3.9.13 packaged by conda-forge (main, May 27 2022, 16:56:21) [GCC 10.3.0] numpy 1.22.3 detectron2 0.6 @/opt/conda/envs/detectron2/lib/python3.9/site-packages/detectron2 detectron2._C not built correctly: /opt/conda/envs/detectron2/lib/python3.9/site-packages/detectron2/_C.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN5torch11CppFunctionD1Ev Compiler ($CXX) c++ (Debian 8.3.0-6) 8.3.0 CUDA compiler Build cuda_11.0_bu.TC445_37.28845127_0 detectron2 arch flags 7.5 DETECTRON2_ENV_MODULE PyTorch 1.10.1 @/opt/conda/envs/detectron2/lib/python3.9/site-packages/torch PyTorch debug build False GPU available Yes GPU 0,1,2,3 A100-SXM4-40GB (arch=8.0) Driver version 460.73.01 CUDA_HOME /usr/local/cuda Pillow 9.1.1 torchvision 0.11.2 @/opt/conda/envs/detectron2/lib/python3.9/site-packages/torchvision torchvision arch flags 3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6 fvcore 0.1.5.post20220512 iopath 0.1.9 cv2 4.5.5

PyTorch built with:

Testing NCCL connectivity ... this should not hang. NCCL succeeded.

hendrenja commented 2 years ago

I encountered the same error a few weeks ago. The error was caused by an invalid dataset. There were items in the dataset that did not have annotations. I baked Datumaro into our data prep pipeline to clean this up.

See Datumaro - Python Module

# keep only annotated images
dataset.select(lambda item: len(item.annotations) != 0)
ppwwyyxx commented 1 year ago

Error comes from AdelaiDet. Please report there instead.