HuygheB commented 2 years ago

Hi, I am trying to train SOLOv2 on Mapillary dataset. I manually converted the dataset to coco format annotations. When I run a training everything seems to work fine until a certain stage where I get an IndexError (see logs). I presume this might have something to do with the format of the annotation file but I'm not sure. Any advice would be appreciated. Thanks.

Instructions To Reproduce the Issue:

Full runnable code or full changes you made: Training file train_solo.py:


from detectron2.data.datasets import register_coco_instances
from detectron2.engine import DefaultTrainer
from detectron2.engine import HookBase
from detectron2.data import build_detection_train_loader
from detectron2.evaluation import COCOEvaluator
from adet.config import get_cfg
import detectron2.utils.comm as comm
import detectron2.utils.comm as comm
import os, torch

register mapillary dataset

register_coco_instances("mapillary_train", {}, "/home/benjaminh/detectron2/AdelaiDet/datasets/mapillary/annotations/filtered_instances_train2017.json", "/home/benjaminh/detectron2/AdelaiDet/datasets/mapillary/train2017") register_coco_instances("mapillary_val", {}, "/home/benjaminh/detectron2/AdelaiDet/datasets/mapillary/annotations/filtered_instances_val2017.json", "/home/benjaminh/detectron2/AdelaiDet/datasets/mapillary/val2017") print('Datasets registered')

get and set config file

cfg = get_cfg() cfg.merge_from_file("/home/benjaminh/detectron2/AdelaiDet/configs/SOLOv2/R50_3x.yaml") cfg.MODEL.WEIGHTS = "SOLOv2_R50_3x.pth" # Let training initialize from model zoo

cfg.DATASETS.TRAIN = ("mapillary_train",) cfg.DATASETS.TEST = ("mapillary_val",)

cfg.TEST.EVAL_PERIOD = 100

cfg.DATALOADER.NUM_WORKERS = 2

cfg.SOLVER.IMS_PER_BATCH = 2 # This is the real "batch size" commonly known to deep learning people cfg.SOLVER.BASE_LR = 0.00025 # pick a good LR cfg.SOLVER.MAX_ITER = 100000 # 300 iterations seems good enough for this toy dataset; you will need to train longer for a practical dataset cfg.SOLVER.STEPS = [] # do not decay learning rate cfg.MODEL.SOLOV2.NUM_CLASSES = 17 # 17 classes

Definition of a Validation hook to track val_loss during training:

cfg.DATASETS.VAL = ("mapillary_val",)

class ValidationLoss(HookBase): def init(self, cfg): super().init() self.cfg = cfg.clone() self.cfg.DATASETS.TRAIN = cfg.DATASETS.VAL self._loader = iter(build_detection_train_loader(self.cfg))

def after_step(self):
    data = next(self._loader)
    with torch.no_grad():
        loss_dict = self.trainer.model(data)

        losses = sum(loss_dict.values())
        assert torch.isfinite(losses).all(), loss_dict

        loss_dict_reduced = {"val_" + k: v.item() for k, v in 
                             comm.reduce_dict(loss_dict).items()}
        losses_reduced = sum(loss for loss in loss_dict_reduced.values())
        if comm.is_main_process():
            self.trainer.storage.put_scalars(total_val_loss=losses_reduced, 
                                             **loss_dict_reduced)

Define trainer that subclasses DefaultTrainer

class Trainer(DefaultTrainer): @classmethod def build_evaluator(cls, cfg, dataset_name, output_folder=None): if output_folder is None: output_folder = os.path.join(cfg.OUTPUT_DIR,"inference") return COCOEvaluator(dataset_name, cfg, True, output_folder)

Train

os.makedirs(cfg.OUTPUT_DIR, exist_ok=True) trainer = Trainer(cfg) val_loss = ValidationLoss(cfg)
trainer.register_hooks([val_loss])

swap the order of PeriodicWriter and ValidationLoss

trainer._hooks = trainer._hooks[:-2] + trainer._hooks[-2:][::-1] trainer.resume_or_load(resume=True) trainer.train()


3. What exact command you run: python train_solo.py
4. __Full logs__ or other relevant observations:

ERROR [07/07 00:34:34 d2.engine.train_loop]: Exception during training: Traceback (most recent call last): File "/opt/conda/envs/detectron2/lib/python3.9/site-packages/detectron2/engine/train_loop.py", line 149, in train
self.run_step() File "/opt/conda/envs/detectron2/lib/python3.9/site-packages/detectron2/engine/defaults.py", line 494, in run_step
self._trainer.run_step() File "/opt/conda/envs/detectron2/lib/python3.9/site-packages/detectron2/engine/train_loop.py", line 274, in run_step
loss_dict = self.model(data) File "/opt/conda/envs/detectron2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, kwargs) File "/home/benjaminh/detectron2/AdelaiDet/adet/modeling/solov2/solov2.py", line 136, in forward
targets = self.get_ground_truth(gt_instances, mask_feat_size) File "/opt/conda/envs/detectron2/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, *kwargs) File "/home/benjaminh/detectron2/AdelaiDet/adet/modeling/solov2/solov2.py", line 163, in get_ground_truth
self.get_ground_truth_single(img_idx, gt_instances, File "/home/benjaminh/detectron2/AdelaiDet/adet/modeling/solov2/solov2.py", line 175, in get_ground_truth_single
device = gt_labels_raw[0].device IndexError: index 0 is out of bounds for dimension 0 with size 0 [07/07 00:34:34 d2.engine.hooks]: Overall training speed: 43200 iterations in 3:55:38 (0.3273 s / it)
[07/07 00:34:34 d2.engine.hooks]: Total training time: 7:08:04 (3:12:26 on hooks) [07/07 00:34:34 d2.utils.events]: eta: 4:38:50 iter: 43202 total_loss: 1.48 loss_ins: 1.212 loss_cate: 0.2656 total_val_loss: 1.498 val_loss_ins: 1.23 val_loss_cate: 0.265 time: 0.3273 data_time: 0.0687 lr: 0.00025 max_mem: 4022M Traceback (most recent call last): File "/home/benjaminh/detectron2/AdelaiDet/tools/train_solo.py", line 204, in
trainer.train() File "/opt/conda/envs/detectron2/lib/python3.9/site-packages/detectron2/engine/defaults.py", line 484, in train
super().train(self.start_iter, self.max_iter) File "/opt/conda/envs/detectron2/lib/python3.9/site-packages/detectron2/engine/train_loop.py", line 149, in train
self.run_step() File "/opt/conda/envs/detectron2/lib/python3.9/site-packages/detectron2/engine/defaults.py", line 494, in run_step
self._trainer.run_step() File "/opt/conda/envs/detectron2/lib/python3.9/site-packages/detectron2/engine/train_loop.py", line 274, in run_step
loss_dict = self.model(data) File "/opt/conda/envs/detectron2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(input, kwargs) File "/home/benjaminh/detectron2/AdelaiDet/adet/modeling/solov2/solov2.py", line 136, in forward
targets = self.get_ground_truth(gt_instances, mask_feat_size) File "/opt/conda/envs/detectron2/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 28, in decorate_context
return func(*args, **kwargs) File "/home/benjaminh/detectron2/AdelaiDet/adet/modeling/solov2/solov2.py", line 163, in get_ground_truth
self.get_ground_truth_single(img_idx, gt_instances, File "/home/benjaminh/detectron2/AdelaiDet/adet/modeling/solov2/solov2.py", line 175, in get_ground_truth_single
device = gt_labels_raw[0].device IndexError: index 0 is out of bounds for dimension 0 with size 0```


## Expected behavior:
Expect training to continue as normal until the max iteration specified. 

## Environment:

sys.platform linux Python 3.9.13	packaged by conda-forge	(main, May 27 2022, 16:56:21) [GCC 10.3.0] numpy 1.22.3 detectron2 0.6 @/opt/conda/envs/detectron2/lib/python3.9/site-packages/detectron2 detectron2._C not built correctly: /opt/conda/envs/detectron2/lib/python3.9/site-packages/detectron2/_C.cpython-39-x86_64-linux-gnu.so: undefined symbol: _ZN5torch11CppFunctionD1Ev Compiler ($CXX) c++ (Debian 8.3.0-6) 8.3.0 CUDA compiler Build cuda_11.0_bu.TC445_37.28845127_0 detectron2 arch flags 7.5 DETECTRON2_ENV_MODULE PyTorch 1.10.1 @/opt/conda/envs/detectron2/lib/python3.9/site-packages/torch PyTorch debug build False GPU available Yes GPU 0,1,2,3 A100-SXM4-40GB (arch=8.0) Driver version 460.73.01 CUDA_HOME /usr/local/cuda Pillow 9.1.1 torchvision 0.11.2 @/opt/conda/envs/detectron2/lib/python3.9/site-packages/torchvision torchvision arch flags 3.5, 5.0, 6.0, 7.0, 7.5, 8.0, 8.6 fvcore 0.1.5.post20220512 iopath 0.1.9 cv2 4.5.5

PyTorch built with:

GCC 7.3
C++ Version: 201402
Intel(R) oneAPI Math Kernel Library Version 2021.4-Product Build 20210904 for Intel(R) 64 architecture applications
Intel(R) MKL-DNN v2.2.3 (Git Hash 7336ca9f055cf1bfa13efb658fe15dc9b41f0740)
OpenMP 201511 (a.k.a. OpenMP 4.5)
LAPACK is enabled (usually provided by MKL)
NNPACK is enabled
CPU capability usage: AVX512
CUDA Runtime 11.3
NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_37,code=compute_37
CuDNN 8.2
Magma 2.5.2
Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.3, CUDNN_VERSION=8.2.0, CXX_COMPILER=/opt/rh/devtoolset-7/root/usr/bin/c++, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_VERSION=1.10.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON,

Testing NCCL connectivity ... this should not hang. NCCL succeeded.

hendrenja commented 2 years ago

I encountered the same error a few weeks ago. The error was caused by an invalid dataset. There were items in the dataset that did not have annotations. I baked Datumaro into our data prep pipeline to clean this up.

See Datumaro - Python Module

# keep only annotated images
dataset.select(lambda item: len(item.annotations) != 0)

ppwwyyxx commented 1 year ago

Error comes from AdelaiDet. Please report there instead.

facebookresearch / detectron2

IndexError: index 0 is out of bounds for dimension 0 with size 0 #4385