bug while running the sample model training

EdenBelouadah commented 2 years ago

Hello again, I tried to run the training code that is provided in the Readme.md file using 1 gpu.

./tools/dist_train.sh configs/ld/ld_r50_gflv1_r101_fpn_coco_1x.py 1

The only modification I did in the config file, is to precise that I want to run the program for one epoch:

runner = dict( type='EpochBasedRunner', max_epochs= 1 )

I am getting an error while trying to save the checkpoint after the training. This is the complete bug:

2022-02-03 02:45:24,914 - mmdet - INFO - Epoch [1][58550/58633] lr: 2.500e-03, eta: 0:00:46, time: 0.5
58, data_time: 0.004, memory: 4122, loss_cls: 0.7503, loss_bbox: 0.3822, loss_dfl: 0.2494, loss_ld: 0.
2703, loss_ld_vlr: 0.4174, loss_kd: 0.2851, loss_kd_neg: 0.0000, loss_im: 0.3847, loss: 2.7392
2022-02-03 02:45:52,665 - mmdet - INFO - Epoch [1][58600/58633] lr: 2.500e-03, eta: 0:00:18, time: 0.5
55, data_time: 0.004, memory: 4122, loss_cls: 0.7615, loss_bbox: 0.3864, loss_dfl: 0.2457, loss_ld: 0.
2216, loss_ld_vlr: 0.3467, loss_kd: 0.2940, loss_kd_neg: 0.0000, loss_im: 0.3496, loss: 2.6055
2022-02-03 02:46:16,693 - mmdet - INFO - Saving checkpoint at 1 epochs
[                                                  ] 0/5000, elapsed: 0s, ETA:Traceback (most recent c
all last):
  File "./tools/train.py", line 187, in <module>
    main()
  File "./tools/train.py", line 183, in main
    meta=meta)
  File "/home/edouard/eden/work/codes/LD/mmdet/apis/train.py", line 170, in train_detector
    runner.run(data_loaders, cfg.workflow)
  File "/home/edouard/anaconda3/envs/LD/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py"
, line 125, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/edouard/anaconda3/envs/LD/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py"
, line 54, in train
    self.call_hook('after_train_epoch')
  File "/home/edouard/anaconda3/envs/LD/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 
308, in call_hook
    getattr(hook, fn_name)(self)
  File "/home/edouard/eden/work/codes/LD/mmdet/core/evaluation/eval_hooks.py", line 276, in after_trai
n_epoch
    gpu_collect=self.gpu_collect)
  File "/home/edouard/eden/work/codes/LD/mmdet/apis/test.py", line 97, in multi_gpu_test
    result = model(return_loss=False, rescale=True, **data)
  File "/home/edouard/anaconda3/envs/LD/lib/python3.7/site-packages/torch/nn/modules/module.py", line 
550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/edouard/anaconda3/envs/LD/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 458, in forward
    output = self.module(*inputs[0], **kwargs[0])
  File "/home/edouard/anaconda3/envs/LD/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/edouard/anaconda3/envs/LD/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 84, in new_func
    return old_func(*args, **kwargs)
  File "/home/edouard/eden/work/codes/LD/mmdet/models/detectors/base.py", line 183, in forward
    return self.forward_test(img, img_metas, **kwargs)
  File "/home/edouard/eden/work/codes/LD/mmdet/models/detectors/base.py", line 160, in forward_test
    return self.simple_test(imgs[0], img_metas[0], **kwargs)
  File "/home/edouard/eden/work/codes/LD/mmdet/models/detectors/single_stage.py", line 120, in simple_test
    *outs, img_metas, rescale=rescale)
  File "/home/edouard/anaconda3/envs/LD/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 164, in new_func
    return old_func(*args, **kwargs)
      File "/home/edouard/eden/work/codes/LD/mmdet/models/dense_heads/anchor_head.py", line 583, in get_bboxes
    scale_factors, cfg, rescale)
  File "/home/edouard/eden/work/codes/LD/mmdet/models/dense_heads/gfl_head.py", line 560, in _get_bboxes
    cfg.max_per_img)
  File "/home/edouard/eden/work/codes/LD/mmdet/core/post_processing/bbox_nms.py", line 187, in multiclass_nms
    return dets, labels[keep]
IndexError: index 8663 is out of bounds for dimension 0 with size 100
Traceback (most recent call last):
  File "/home/edouard/anaconda3/envs/LD/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/edouard/anaconda3/envs/LD/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/edouard/anaconda3/envs/LD/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in <module>
    main()
  File "/home/edouard/anaconda3/envs/LD/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/home/edouard/anaconda3/envs/LD/bin/python', '-u', './tools/train.py', '--local_rank=0', 'configs/ld/ld_r50_gflv1_r101_fpn_coco_1x.py', '--launcher', 'pytorch']' returned non-zero exit status 1.

PS. all is installed as recommended in the readme file.

Thank you very much for you help

Zzh-tju commented 2 years ago

This is a nms bug. Please update LD/mmdet/core/post_processing/bbox_nms.py

EdenBelouadah commented 2 years ago

Thank you for your answer, should I update it using the current bbox_nms.py file in the mmdetection repository? Thanks again

HikariTJU / LD

bug while running the sample model training #21