Closed EdenBelouadah closed 1 year ago
Hello again, I tried to run the training code that is provided in the Readme.md file using 1 gpu.
./tools/dist_train.sh configs/ld/ld_r50_gflv1_r101_fpn_coco_1x.py 1
The only modification I did in the config file, is to precise that I want to run the program for one epoch:
runner = dict( type='EpochBasedRunner', max_epochs= 1 )
I am getting an error while trying to save the checkpoint after the training. This is the complete bug:
2022-02-03 02:45:24,914 - mmdet - INFO - Epoch [1][58550/58633] lr: 2.500e-03, eta: 0:00:46, time: 0.5 58, data_time: 0.004, memory: 4122, loss_cls: 0.7503, loss_bbox: 0.3822, loss_dfl: 0.2494, loss_ld: 0. 2703, loss_ld_vlr: 0.4174, loss_kd: 0.2851, loss_kd_neg: 0.0000, loss_im: 0.3847, loss: 2.7392 2022-02-03 02:45:52,665 - mmdet - INFO - Epoch [1][58600/58633] lr: 2.500e-03, eta: 0:00:18, time: 0.5 55, data_time: 0.004, memory: 4122, loss_cls: 0.7615, loss_bbox: 0.3864, loss_dfl: 0.2457, loss_ld: 0. 2216, loss_ld_vlr: 0.3467, loss_kd: 0.2940, loss_kd_neg: 0.0000, loss_im: 0.3496, loss: 2.6055 2022-02-03 02:46:16,693 - mmdet - INFO - Saving checkpoint at 1 epochs [ ] 0/5000, elapsed: 0s, ETA:Traceback (most recent c all last): File "./tools/train.py", line 187, in <module> main() File "./tools/train.py", line 183, in main meta=meta) File "/home/edouard/eden/work/codes/LD/mmdet/apis/train.py", line 170, in train_detector runner.run(data_loaders, cfg.workflow) File "/home/edouard/anaconda3/envs/LD/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py" , line 125, in run epoch_runner(data_loaders[i], **kwargs) File "/home/edouard/anaconda3/envs/LD/lib/python3.7/site-packages/mmcv/runner/epoch_based_runner.py" , line 54, in train self.call_hook('after_train_epoch') File "/home/edouard/anaconda3/envs/LD/lib/python3.7/site-packages/mmcv/runner/base_runner.py", line 308, in call_hook getattr(hook, fn_name)(self) File "/home/edouard/eden/work/codes/LD/mmdet/core/evaluation/eval_hooks.py", line 276, in after_trai n_epoch gpu_collect=self.gpu_collect) File "/home/edouard/eden/work/codes/LD/mmdet/apis/test.py", line 97, in multi_gpu_test result = model(return_loss=False, rescale=True, **data) File "/home/edouard/anaconda3/envs/LD/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__ result = self.forward(*input, **kwargs) File "/home/edouard/anaconda3/envs/LD/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 458, in forward output = self.module(*inputs[0], **kwargs[0]) File "/home/edouard/anaconda3/envs/LD/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__ result = self.forward(*input, **kwargs) File "/home/edouard/anaconda3/envs/LD/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 84, in new_func return old_func(*args, **kwargs) File "/home/edouard/eden/work/codes/LD/mmdet/models/detectors/base.py", line 183, in forward return self.forward_test(img, img_metas, **kwargs) File "/home/edouard/eden/work/codes/LD/mmdet/models/detectors/base.py", line 160, in forward_test return self.simple_test(imgs[0], img_metas[0], **kwargs) File "/home/edouard/eden/work/codes/LD/mmdet/models/detectors/single_stage.py", line 120, in simple_test *outs, img_metas, rescale=rescale) File "/home/edouard/anaconda3/envs/LD/lib/python3.7/site-packages/mmcv/runner/fp16_utils.py", line 164, in new_func return old_func(*args, **kwargs) File "/home/edouard/eden/work/codes/LD/mmdet/models/dense_heads/anchor_head.py", line 583, in get_bboxes scale_factors, cfg, rescale) File "/home/edouard/eden/work/codes/LD/mmdet/models/dense_heads/gfl_head.py", line 560, in _get_bboxes cfg.max_per_img) File "/home/edouard/eden/work/codes/LD/mmdet/core/post_processing/bbox_nms.py", line 187, in multiclass_nms return dets, labels[keep] IndexError: index 8663 is out of bounds for dimension 0 with size 100 Traceback (most recent call last): File "/home/edouard/anaconda3/envs/LD/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/home/edouard/anaconda3/envs/LD/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/edouard/anaconda3/envs/LD/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in <module> main() File "/home/edouard/anaconda3/envs/LD/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main cmd=cmd) subprocess.CalledProcessError: Command '['/home/edouard/anaconda3/envs/LD/bin/python', '-u', './tools/train.py', '--local_rank=0', 'configs/ld/ld_r50_gflv1_r101_fpn_coco_1x.py', '--launcher', 'pytorch']' returned non-zero exit status 1.
PS. all is installed as recommended in the readme file.
Thank you very much for you help
This is a nms bug. Please update LD/mmdet/core/post_processing/bbox_nms.py
LD/mmdet/core/post_processing/bbox_nms.py
Thank you for your answer, should I update it using the current bbox_nms.py file in the mmdetection repository? Thanks again
Hello again, I tried to run the training code that is provided in the Readme.md file using 1 gpu.
./tools/dist_train.sh configs/ld/ld_r50_gflv1_r101_fpn_coco_1x.py 1
The only modification I did in the config file, is to precise that I want to run the program for one epoch:
runner = dict( type='EpochBasedRunner', max_epochs= 1 )
I am getting an error while trying to save the checkpoint after the training. This is the complete bug:
PS. all is installed as recommended in the readme file.
Thank you very much for you help