RangiLyu / nanodet

NanoDet-Plus⚡Super fast and lightweight anchor-free object detection model. 🔥Only 980 KB(int8) / 1.8MB (fp16) and run 97FPS on cellphone🔥
Apache License 2.0
5.61k stars 1.03k forks source link

ProcessGroupNCCL.cpp:784, unhandled system error #285

Open waduhekx opened 2 years ago

waduhekx commented 2 years ago

(nanodet) simon@Simon:~/nanodet$ python tools/train.py config/nanodet-m-416.yml [root][07-16 11:17:37]INFO:Using Tensorboard, logs will be saved in workspace/nanodet_m_416/logs [root][07-16 11:17:37]INFO:Setting up data... loading annotations into memory... Done (t=16.84s) creating index... index created! loading annotations into memory... Done (t=0.72s) creating index... index created! [root][07-16 11:17:55]INFO:Creating model... model size is 1.0x init weights... => loading pretrained model https://download.pytorch.org/models/shufflenetv2_x1-5666bf0f80.pth Finish initialize Lite GFL Head. GPU available: True, used: True TPU available: False, using: 0 TPU cores initializing ddp: GLOBAL_RANK: 0, MEMBER: 1/1 Traceback (most recent call last): File "tools/train.py", line 100, in main(args)

File "tools/train.py", line 95, in main

trainer.fit(task, train_dataloader, val_dataloader)

File "/home/simon/anaconda3/envs/nanodet/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 460, in fit self._run(model) File "/home/simon/anaconda3/envs/nanodet/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 714, in _run self.accelerator.setup_environment() File "/home/simon/anaconda3/envs/nanodet/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 80, in setup_environment self.training_type_plugin.setup_environment() File "/home/simon/anaconda3/envs/nanodet/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 118, in setup_environment self.setup_distributed() File "/home/simon/anaconda3/envs/nanodet/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 206, in setup_distributed self.init_ddp_connection() File "/home/simon/anaconda3/envs/nanodet/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 273, in init_ddp_connection torch_distrib.init_process_group(self.torch_distributed_backend, rank=global_rank, world_size=world_size) File "/home/simon/anaconda3/envs/nanodet/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 455, in init_process_group barrier() File "/home/simon/anaconda3/envs/nanodet/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1960, in barrier work = _default_pg.barrier() RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370172916/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, unhandled system error, NCCL version 2.7.8

waduhekx commented 2 years ago

This problem comes, costed days to solve, but there seems no right solutions on network. Is there anyone has solved this error? plz, thinks a lot.

RangiLyu commented 2 years ago

It seems that there are some problems with your pytorch DDP. If you are using only one GPU, try to delete this line: https://github.com/RangiLyu/nanodet/blob/4ecfb1cbf7378582713a62bc31b331f635dd82c0/tools/train.py#L87

waduhekx commented 2 years ago

It seems that there are some problems with your pytorch DDP. If you are using only one GPU, try to delete this line:

https://github.com/RangiLyu/nanodet/blob/4ecfb1cbf7378582713a62bc31b331f635dd82c0/tools/train.py#L87

Thinks lot, problems solved. But another error came out , as following: python ./tools/train.py ./config/nanodet-m-416.yml [root][07-23 14:36:11]INFO:Using Tensorboard, logs will be saved in workspace/nanodet_m_416/logs [root][07-23 14:36:11]INFO:Setting up data... loading annotations into memory... Done (t=14.44s) creating index... index created! loading annotations into memory... Done (t=0.44s) creating index... index created! [root][07-23 14:36:27]INFO:Creating model... model size is 1.0x init weights... => loading pretrained model https://download.pytorch.org/models/shufflenetv2_x1-5666bf0f80.pth Finish initialize Lite GFL Head. GPU available: True, used: True TPU available: False, using: 0 TPU cores LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] Traceback (most recent call last): File "./tools/train.py", line 99, in main(args) File "./tools/train.py", line 94, in main trainer.fit(task, train_dataloader, val_dataloader) File "/home/simon/anaconda3/envs/nanodet/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 460, in fit self._run(model) File "/home/simon/anaconda3/envs/nanodet/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 717, in _run self.accelerator.setup(self, model) # note: this sets up self.lightning_module File "/home/simon/anaconda3/envs/nanodet/lib/python3.8/site-packages/pytorch_lightning/accelerators/gpu.py", line 41, in setup return super().setup(trainer, model) File "/home/simon/anaconda3/envs/nanodet/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in setup self.setup_optimizers(trainer) File "/home/simon/anaconda3/envs/nanodet/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 374, in setup_optimizers optimizers, lr_schedulers, optimizer_frequencies = self.training_type_plugin.init_optimizers( File "/home/simon/anaconda3/envs/nanodet/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 190, in init_optimizers return trainer.init_optimizers(model) File "/home/simon/anaconda3/envs/nanodet/lib/python3.8/site-packages/pytorch_lightning/trainer/optimizers.py", line 34, in init_optimizers optim_conf = model.configure_optimizers() File "/mnt/f/ubuntu_temp/nanodet/nanodet/trainer/task.py", line 174, in configure_optimizers optimizer = build_optimizer(params=self.parameters(), **optimizer_cfg) TypeError: init() got an unexpected keyword argument 'momentum'

Did you have this porblems?

waduhekx commented 2 years ago

It seems that there are some problems with your pytorch DDP. If you are using only one GPU, try to delete this line: https://github.com/RangiLyu/nanodet/blob/4ecfb1cbf7378582713a62bc31b331f635dd82c0/tools/train.py#L87

Thinks lot, problems solved. But another error came out , as following: python ./tools/train.py ./config/nanodet-m-416.yml [root][07-23 14:36:11]INFO:Using Tensorboard, logs will be saved in workspace/nanodet_m_416/logs [root][07-23 14:36:11]INFO:Setting up data... loading annotations into memory... Done (t=14.44s) creating index... index created! loading annotations into memory... Done (t=0.44s) creating index... index created! [root][07-23 14:36:27]INFO:Creating model... model size is 1.0x init weights... => loading pretrained model https://download.pytorch.org/models/shufflenetv2_x1-5666bf0f80.pth Finish initialize Lite GFL Head. GPU available: True, used: True TPU available: False, using: 0 TPU cores LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] Traceback (most recent call last): File "./tools/train.py", line 99, in main(args) File "./tools/train.py", line 94, in main trainer.fit(task, train_dataloader, val_dataloader) File "/home/simon/anaconda3/envs/nanodet/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 460, in fit self._run(model) File "/home/simon/anaconda3/envs/nanodet/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 717, in _run self.accelerator.setup(self, model) # note: this sets up self.lightning_module File "/home/simon/anaconda3/envs/nanodet/lib/python3.8/site-packages/pytorch_lightning/accelerators/gpu.py", line 41, in setup return super().setup(trainer, model) File "/home/simon/anaconda3/envs/nanodet/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in setup self.setup_optimizers(trainer) File "/home/simon/anaconda3/envs/nanodet/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 374, in setup_optimizers optimizers, lr_schedulers, optimizer_frequencies = self.training_type_plugin.init_optimizers( File "/home/simon/anaconda3/envs/nanodet/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 190, in init_optimizers return trainer.init_optimizers(model) File "/home/simon/anaconda3/envs/nanodet/lib/python3.8/site-packages/pytorch_lightning/trainer/optimizers.py", line 34, in init_optimizers optim_conf = model.configure_optimizers() File "/mnt/f/ubuntu_temp/nanodet/nanodet/trainer/task.py", line 174, in configure_optimizers optimizer = build_optimizer(params=self.parameters(), optimizer_cfg) TypeError: init**() got an unexpected keyword argument 'momentum'

Did you have this porblems?

I use Adam instead of SDG, so 'momentum' in config file is useless, #it, solved this problem!