IDEA-Research / detrex

detrex is a research platform for DETR-based object detection, segmentation, pose estimation and other visual recognition tasks.
https://detrex.readthedocs.io/en/latest/
Apache License 2.0
1.99k stars 206 forks source link

A issues about slurm and hydra #325

Closed wangzhaoyang-508 closed 9 months ago

wangzhaoyang-508 commented 9 months ago

When I was training with hydra and slurm, the configuration that the project read from the command line looked like a .yaml file instead of a .py file, but,when i use tran_net.py it can train very nice .

see the code

enter

python tools/hydra_train_net.py \ num_machines=2 num_gpus=4 auto_output_dir=true \ config_file=projects/dino/configs/dino-resnet/dino_r50_4scale_12ep_custom1.py \ +model.num_queries=50 \ +slurm=Nvidia_A800

.sh file

#!/bin/bash

# Parameters
#SBATCH --cpus-per-task=8
#SBATCH --error=/public/home/wangchaoyang/wzy_tests/Projects/detrex-main/outputs/+model.num_queries.50-num_gpus.4-num_machines.2/20231210-23:02:10/%j_0_log.err
#SBATCH --gres=gpu:2
#SBATCH --job-name=detrex
#SBATCH --nodes=2
#SBATCH --ntasks=4
#SBATCH --ntasks-per-node=2
#SBATCH --open-mode=append
#SBATCH --output=/public/home/wangchaoyang/wzy_tests/Projects/detrex-main/outputs/+model.num_queries.50-num_gpus.4-num_machines.2/20231210-23:02:10/%j_0_log.out
#SBATCH --partition=Nvidia_A800
#SBATCH --signal=USR2@120
#SBATCH --time=14400
#SBATCH --wckey=submitit

# command
export SUBMITIT_EXECUTOR=slurm
srun --unbuffered --output /public/home/wangchaoyang/wzy_tests/Projects/detrex-main/outputs/+model.num_queries.50-num_gpus.4-num_machines.2/20231210-23:02:10/%j_%t_log.out --error /public/home/wangchaoyang/wzy_tests/Projects/detrex-main/outputs/+model.num_queries.50-num_gpus.4-num_machines.2/20231210-23:02:10/%j_%t_log.err /public/home/wangchaoyang/anaconda3/envs/detrex-wzy/bin/python -u -m submitit.core._submit /public/home/wangchaoyang/wzy_tests/Projects/detrex-main/outputs/+model.num_queries.50-num_gpus.4-num_machines.2/20231210-23:02:10

logs

submitit INFO (2023-12-10 23:11:26,701) - Starting with JobEnvironment(job_id=52674, hostname=gpu23, local_rank=0(2), node=0(2), global_rank=0(4))
submitit INFO (2023-12-10 23:11:26,701) - Loading pickle: /public/home/wangchaoyang/wzy_tests/Projects/detrex-main/outputs/+model.num_queries.50-num_gpus.4-num_machines.2/20231210-23:11:24/52674_submitted.pkl
------------ <class 'dict'> {}
------------ <class 'dict'> {}
------------ <class 'dict'> {}
------------ <class 'dict'> {}
------------ <class 'dict'> {'ENABLED': False}
------------ <class 'dict'> {}
------------ <class 'dict'> {}
------------ <class 'dict'> {}
------------ <class 'dict'> {}
------------ <class 'dict'> {}
------------ <class 'dict'> {}
------------ <class 'dict'> {}
------------ <class 'dict'> {}
------------ <class 'dict'> {}
------------ <class 'dict'> {}
------------ <class 'dict'> {}
------------ <class 'dict'> {}
------------ <class 'dict'> {}
------------ <class 'dict'> {}
------------ <class 'dict'> {'ENABLED': True}
------------ <class 'dict'> {}
------------ <class 'dict'> {}
------------ <class 'dict'> {}
------------ <class 'dict'> {'ENABLED': False}
------------ <class 'dict'> {'ENABLED': False}
------------ <class 'dict'> {}
------------ <class 'dict'> {'ENABLED': False}
------------ <class 'dict'> {'ENABLED': False}
------------ <class 'dict'> {}
Process group: 4 tasks, rank: 0
| distributed init (rank 0): tcp://gpu23:47031, gpu 0
[23:11:32.008615] ------------ <class 'str'> from detrex.config import get_config from ..models.dino_r50_custom import model
dataloader = get_config('common/data/custom_su.py').dataloader optimizer = get_config("common/optim.py").AdamW lr_multiplier = get_config("common/coco_schedule.py").lr_multiplier_12ep train = get_config("common/train.py").train
train.init_checkpoint = "detectron2://ImageNetPretrained/torchvision/R-50.pkl" train.output_dir = "./output/dino_r50_4scale_12ep"
train.max_iter = 90000 train.eval_period = 5000 train.log_period = 20 train.checkpointer.period = 5000
train.clip_grad.enabled = True train.clip_grad.params.max_norm = 0.1 train.clip_grad.params.norm_type = 2
train.device = "cuda" model.device = train.device
optimizer.lr = 1e-4 optimizer.betas = (0.9, 0.999) optimizer.weight_decay = 1e-4 optimizer.params.lr_factor_func = lambda module_name:0.1 if "backbone" in module_name else 1
dataloader.train.num_workers = 16
dataloader.train.total_batch_size = 16
dataloader.evaluator.output_dir = train.output_dir
submitit ERROR (2023-12-10 23:11:32,015) - Submitted job triggered an exception

error log

submitit ERROR (2023-12-10 23:11:32,015) - Submitted job triggered an exception
Traceback (most recent call last):
  File "/public/home/wangchaoyang/anaconda3/envs/detrex-wzy/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/public/home/wangchaoyang/anaconda3/envs/detrex-wzy/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/public/home/wangchaoyang/anaconda3/envs/detrex-wzy/lib/python3.9/site-packages/submitit/core/_submit.py", line 11, in <module>
    submitit_main()
  File "/public/home/wangchaoyang/anaconda3/envs/detrex-wzy/lib/python3.9/site-packages/submitit/core/submission.py", line 76, in submitit_main
    process_job(args.folder)
  File "/public/home/wangchaoyang/anaconda3/envs/detrex-wzy/lib/python3.9/site-packages/submitit/core/submission.py", line 69, in process_job
    raise error
  File "/public/home/wangchaoyang/anaconda3/envs/detrex-wzy/lib/python3.9/site-packages/submitit/core/submission.py", line 55, in process_job
    result = delayed.result()
  File "/public/home/wangchaoyang/anaconda3/envs/detrex-wzy/lib/python3.9/site-packages/submitit/core/utils.py", line 133, in result
    self._result = self.function(*self.args, **self.kwargs)
  File "/public/home/wangchaoyang/wzy_tests/Projects/detrex-main/tools/hydra_train_net.py", line 108, in __call__
    main(self.args)
  File "/public/home/wangchaoyang/wzy_tests/Projects/detrex-main/detectron2/tools/train_net.py", line 125, in main
    cfg = setup(args)
  File "/public/home/wangchaoyang/wzy_tests/Projects/detrex-main/detectron2/tools/train_net.py", line 117, in setup
    cfg.merge_from_file(args.config_file)
  File "/public/home/wangchaoyang/wzy_tests/Projects/detrex-main/detectron2/detectron2/config/config.py", line 47, in merge_from_file
    loaded_cfg = type(self)(loaded_cfg)
  File "/public/home/wangchaoyang/anaconda3/envs/detrex-wzy/lib/python3.9/site-packages/yacs/config.py", line 86, in __init__
    init_dict = self._create_config_tree_from_dict(init_dict, key_list)
  File "/public/home/wangchaoyang/anaconda3/envs/detrex-wzy/lib/python3.9/site-packages/yacs/config.py", line 124, in _create_config_tree_from_dict
    for k, v in dic.items():
AttributeError: 'str' object has no attribute 'items'
srun: error: gpu24: task 3: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=52674.0
slurmstepd: error: *** STEP 52674.0 ON gpu23 CANCELLED AT 2023-12-10T23:11:32 ***
srun: error: gpu24: task 2: Exited with exit code 1
srun: error: gpu23: task 0: Exited with exit code 1
srun: error: gpu23: task 1: Exited with exit code 1
rentainhe commented 9 months ago

Sry for the late reply, we did not implement the hydra and slurm scripts by ourselves, maybe you can refer to this PR for more details: https://github.com/IDEA-Research/detrex/pull/215

wangzhaoyang-508 commented 9 months ago

Sry for the late reply, we did not implement the hydra and slurm scripts by ourselves, maybe you can refer to this PR for more details: #215

It is a same bug with #216 I tried

pip uninstall detrex && pip uninstall detectron2 && pip install -e . && pip install -e detectron2 (tested)

and The problem was solved

rentainhe commented 9 months ago

Seems like the problem has been solved, so I'm closing this issue, feel free to reopen it if needed.