Open greeneggsandyaml opened 2 years ago
Hi,
Sorry for the unexpected behavior, our config is only for 8 GPUs. The config specifies the per-GPU-batch-size when DATALOADER. USE_DIFF_BS_SIZE=True
(all Detic configs), and ignores SOLVER.IMS_PER_BATCH
. Assuming your GPU has sufficient memory, you will need to set DATALOADER.DATASET_BS=(16,64)
. I'll add an assert in the dataloader to ensure this is consistent with SOLVER.IMS_PER_BATCH
.
Gotcha, thanks! I'll find some more GPUs and try it out again. Once I can train and reproduce, I'll close the issue :)
Hi @xingyizhou,
Thank you for the explanation on how to scale to cases less than 8 GPUs. Should any other hyperparameters like LR be adjusted accordingly, for example using 4 GPUs, with a DATALOADER.DATASET_BS=[8, 32]
?
Could you please provide some help on how to run the configuration on slurm nodes (eg. 4 GPUs, 2 nodes).
I see that using these lines in the train_net.py, https://github.com/facebookresearch/Detic/blob/cfe14bcc231986cea9bfbe56b97761534e219d0c/train_net.py#L247 to https://github.com/facebookresearch/Detic/blob/cfe14bcc231986cea9bfbe56b97761534e219d0c/train_net.py#L260 the logs are generated as many times as number of GPU's. It would be great if you could provide some help on slurm training as well.
Thank you in advance.
Hi, Thank you for your interest. Yes, if you change the total batch size, the learning rate should be changed according to the linear learning rate rule.
I share my scripts for multi-node training below (2 nodes x 8 GPUs each). However I believe there are easier ways to do that.
I used two files train-2nodes.sh
and multi-node_run.sh
. Run sbatch train-2nodes.sh --config-files configs/XXXX
to start.
train-2nodes.sh
#!/bin/bash
#SBATCH -p YOUR_PARTITION
#SBATCH --nodes=2
#SBATCH --gres=gpu:8
#SBATCH --gpus-per-node=8
#SBATCH --cpus-per-task=80
#SBATCH --mem=496G
#SBATCH --time 4320
#SBATCH -o "slurm-output/slurm-%j.out"
srun multi-node_run.sh $@
multi-node_run.sh
#!/bin/bash
MASTER_NODE=$(scontrol show hostname "$SLURM_NODELIST" | head -n1)
DIST_URL="tcp://$MASTER_NODE:12399"
SOCKET_NAME=$(ip r | grep default | awk '{print $5}')
export GLOO_SOCKET_IFNAME=$SOCKET_NAME
python -u train_net.py --num-gpus 8 --num-machines 2 --machine-rank "$SLURM_NODEID" --dist-url "$DIST_URL" "$@"
Hi, Sorry for the unexpected behavior, our config is only for 8 GPUs. The config specifies the per-GPU-batch-size when
DATALOADER. USE_DIFF_BS_SIZE=True
(all Detic configs), and ignoresSOLVER.IMS_PER_BATCH
. Assuming your GPU has sufficient memory, you will need to setDATALOADER.DATASET_BS=(16,64)
. I'll add an assert in the dataloader to ensure this is consistent withSOLVER.IMS_PER_BATCH
.
Hi, since batch-sizes of both datasets are defined, what is the role of DATASET_RATIO: [1:4]
?
@wusize I believe DATASET_RATIO
will be ignored when USE_DIFF_BS_SIZE
is on. Please check the code to confirm.
Hi, Thank you for your interest. Yes, if you change the total batch size, the learning rate should be changed according to the linear learning rate rule.
I share my scripts for multi-node training below (2 nodes x 8 GPUs each). However I believe there are easier ways to do that.
I used two files
train-2nodes.sh
andmulti-node_run.sh
. Runsbatch train-2nodes.sh --config-files configs/XXXX
to start.train-2nodes.sh
#!/bin/bash #SBATCH -p YOUR_PARTITION #SBATCH --nodes=2 #SBATCH --gres=gpu:8 #SBATCH --gpus-per-node=8 #SBATCH --cpus-per-task=80 #SBATCH --mem=496G #SBATCH --time 4320 #SBATCH -o "slurm-output/slurm-%j.out" srun multi-node_run.sh $@
multi-node_run.sh
#!/bin/bash MASTER_NODE=$(scontrol show hostname "$SLURM_NODELIST" | head -n1) DIST_URL="tcp://$MASTER_NODE:12399" SOCKET_NAME=$(ip r | grep default | awk '{print $5}') export GLOO_SOCKET_IFNAME=$SOCKET_NAME python -u train_net.py --num-gpus 8 --num-machines 2 --machine-rank "$SLURM_NODEID" --dist-url "$DIST_URL" "$@"
Hi, Xinyi!
I used your scripts but got the following error. Have you met such issues?
Traceback (most recent call last):
File "/mnt/cache/wusize/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/mnt/cache/wusize/work_dirs/detectron2/detectron2/engine/launch.py", line 126, in _distributed_worker
main_func(*args)
File "/mnt/cache/wusize/projects/detic/train_net.py", line 250, in main
do_train(cfg, model, resume=args.resume)
File "/mnt/cache/wusize/projects/detic/train_net.py", line 137, in do_train
data_loader = build_detection_train_loader(cfg, mapper=mapper)
File "/mnt/cache/wusize/work_dirs/detectron2/detectron2/config/config.py", line 207, in wrapped
explicit_args = _get_args_from_config(from_config, *args, **kwargs)
File "/mnt/cache/wusize/work_dirs/detectron2/detectron2/config/config.py", line 245, in _get_args_from_config
ret = from_config_func(*args, **kwargs)
File "/mnt/cache/wusize/work_dirs/detectron2/detectron2/data/build.py", line 366, in _train_loader_from_config
sampler = TrainingSampler(len(dataset))
File "/mnt/cache/wusize/work_dirs/detectron2/detectron2/data/samplers/distributed_sampler.py", line 52, in __init__
seed = comm.shared_random_seed()
File "/mnt/cache/wusize/work_dirs/detectron2/detectron2/utils/comm.py", line 166, in shared_random_seed
all_ints = all_gather(ints)
File "/mnt/cache/wusize/work_dirs/detectron2/detectron2/utils/comm.py", line 114, in all_gather
group = _get_global_gloo_group() # use CPU group by default, to reduce GPU RAM usage.
File "/mnt/cache/wusize/work_dirs/detectron2/detectron2/utils/comm.py", line 94, in _get_global_gloo_group
return dist.new_group(backend="gloo")
File "/mnt/cache/wusize/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2900, in new_group
pg = _new_process_group_helper(
File "/mnt/cache/wusize/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 685, in _new_process_group_helper
pg = ProcessGroupGloo(prefix_store, rank, world_size, timeout=timeout)
RuntimeError: No device(s) specified
I used your scripts but got the following error. Have you met such issues?
Hi, I met the same issues with you, have you solved it?
I used your scripts but got the following error. Have you met such issues?
Hi, I met the same issues with you, have you solved it?
+1, could you share the solutions?
Hello authors, thank you for your work. I am trying to reproduce your runs on Open-vocabulary COCO. I have prepared the data and I am trying to run the Detic_CLIP_R50_1x_image model.
I have downloaded the pretrained
BoxSup_OVCOCO_CLIP_R50_1x.pth
model to the specified location.I am using a single GPU, and my current command is:
However, after 680 iterations of training, I get:
Is this training instability and divergence expected?