IDEA-Research / DINO

[ICLR 2023] Official implementation of the paper "DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection"
Apache License 2.0
2.15k stars 232 forks source link

Training using multi GPU #120

Open sekhaish opened 1 year ago

sekhaish commented 1 year ago

I am trying to use your script on a 4 GPU machine and getting the following error. I have made the following change in the DINO_train_dist.sh

python3 -m torch.distributed.launch --nproc_per_node=4 main.py \ --output_dir logs/DINO/R50-MS4 -c config/DINO/DINO_4scale.py --coco_path $coco_path \ --options dn_scalar=100 embed_init_tgt=TRUE \ dn_label_coef=1.0 dn_bbox_coef=1.0 use_ema=False \ dn_box_noise_scale=1.0

world_size:4 rank:2 local_rank:2 | distributed init (rank 3): env:// | distributed init (rank 1): env:// | distributed init (rank 0): env:// | distributed init (rank 2): env:// Traceback (most recent call last): File "main.py", line 398, in <module> main(args) File "main.py", line 96, in main utils.init_distributed_mode(args) File "/data-mount/DINO/util/misc.py", line 514, in init_distributed_mode world_size=args.world_size, rank=args.rank) File "/home/user/miniconda3/envs/detrex/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 422, in init_process_group store, rank, world_size = next(rendezvous_iterator) File "/home/user/miniconda3/envs/detrex/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 172, in _env_rendezvous_handler store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout) RuntimeError: Address already in use Traceback (most recent call last): File "/home/user/miniconda3/envs/detrex/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/home/user/miniconda3/envs/detrex/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/user/miniconda3/envs/detrex/lib/python3.7/site-packages/torch/distributed/launch.py", line 261, in <module> main() File "/home/user/miniconda3/envs/detrex/lib/python3.7/site-packages/torch/distributed/launch.py", line 257, in main cmd=cmd) subprocess.CalledProcessError: Command '['/home/user/miniconda3/envs/detrex/bin/python3', '-u', 'main.py', '--local_rank=3', '--output_dir', 'logs/DINO/R50-MS4', '-c', 'config/DINO/DINO_4scale.py', '--coco_path', 'COCODIR/', '--options', 'dn_scalar=100', 'embed_init_tgt=TRUE', 'dn_label_coef=1.0', 'dn_bbox_coef=1.0', 'use_ema=False', 'dn_box_noise_scale=1.0']' returned non-zero exit status 1.

Could you help me solve this? Thank you.

HaoZhang534 commented 1 year ago

@sekhaish This is an "Address already in use" error. You need to assign another port.