facebookresearch / Detic

Code release for "Detecting Twenty-thousand Classes using Image-level Supervision".
Apache License 2.0
1.85k stars 211 forks source link

Open-vocabulary COCO training keeps diverging #22

Open greeneggsandyaml opened 2 years ago

greeneggsandyaml commented 2 years ago

Hello authors, thank you for your work. I am trying to reproduce your runs on Open-vocabulary COCO. I have prepared the data and I am trying to run the Detic_CLIP_R50_1x_image model.

I have downloaded the pretrained BoxSup_OVCOCO_CLIP_R50_1x.pth model to the specified location.

I am using a single GPU, and my current command is:

python train_net.py --num-gpus 1 --config-file ./configs/Detic_OVCOCO_CLIP_R50_1x_max-size.yaml

However, after 680 iterations of training, I get:

[02/04 01:35:21 d2.utils.events]:  eta: 1 day, 12:22:15  iter: 680  total_loss: 0.9863  loss_cls: 0.04699  loss_box_reg: 0.05717  image_loss: 0.2118  loss_rpn_cls: 0.1054  loss_rpn_loc: 0.03833  time: 1.2271  data_time: 0.0522  lr: 0.013586  max_mem: 4325M
Traceback (most recent call last):
  ...
  File "/path/to/Detic/detectron2/detectron2/modeling/proposal_generator/proposal_utils.py", line 99, in find_top_rpn_proposals
    raise FloatingPointError(
FloatingPointError: Predicted boxes or scores contain Inf/NaN. Training has diverged.

Is this training instability and divergence expected?

xingyizhou commented 2 years ago

Hi, Sorry for the unexpected behavior, our config is only for 8 GPUs. The config specifies the per-GPU-batch-size when DATALOADER. USE_DIFF_BS_SIZE=True (all Detic configs), and ignores SOLVER.IMS_PER_BATCH. Assuming your GPU has sufficient memory, you will need to set DATALOADER.DATASET_BS=(16,64). I'll add an assert in the dataloader to ensure this is consistent with SOLVER.IMS_PER_BATCH.

greeneggsandyaml commented 2 years ago

Gotcha, thanks! I'll find some more GPUs and try it out again. Once I can train and reproduce, I'll close the issue :)

hanoonaR commented 2 years ago

Hi @xingyizhou,

Thank you for the explanation on how to scale to cases less than 8 GPUs. Should any other hyperparameters like LR be adjusted accordingly, for example using 4 GPUs, with a DATALOADER.DATASET_BS=[8, 32] ?

Could you please provide some help on how to run the configuration on slurm nodes (eg. 4 GPUs, 2 nodes).

I see that using these lines in the train_net.py, https://github.com/facebookresearch/Detic/blob/cfe14bcc231986cea9bfbe56b97761534e219d0c/train_net.py#L247 to https://github.com/facebookresearch/Detic/blob/cfe14bcc231986cea9bfbe56b97761534e219d0c/train_net.py#L260 the logs are generated as many times as number of GPU's. It would be great if you could provide some help on slurm training as well.

Thank you in advance.

xingyizhou commented 2 years ago

Hi, Thank you for your interest. Yes, if you change the total batch size, the learning rate should be changed according to the linear learning rate rule.

I share my scripts for multi-node training below (2 nodes x 8 GPUs each). However I believe there are easier ways to do that.

I used two files train-2nodes.sh and multi-node_run.sh. Run sbatch train-2nodes.sh --config-files configs/XXXX to start.

train-2nodes.sh

#!/bin/bash
#SBATCH -p YOUR_PARTITION
#SBATCH --nodes=2
#SBATCH --gres=gpu:8
#SBATCH --gpus-per-node=8
#SBATCH --cpus-per-task=80
#SBATCH --mem=496G
#SBATCH --time 4320
#SBATCH -o "slurm-output/slurm-%j.out"

srun multi-node_run.sh $@

multi-node_run.sh

#!/bin/bash
MASTER_NODE=$(scontrol show hostname "$SLURM_NODELIST" | head -n1)
DIST_URL="tcp://$MASTER_NODE:12399"
SOCKET_NAME=$(ip r | grep default | awk '{print $5}')
export GLOO_SOCKET_IFNAME=$SOCKET_NAME

python -u train_net.py --num-gpus 8 --num-machines 2 --machine-rank "$SLURM_NODEID" --dist-url "$DIST_URL" "$@"
wusize commented 2 years ago

Hi, Sorry for the unexpected behavior, our config is only for 8 GPUs. The config specifies the per-GPU-batch-size when DATALOADER. USE_DIFF_BS_SIZE=True (all Detic configs), and ignores SOLVER.IMS_PER_BATCH. Assuming your GPU has sufficient memory, you will need to set DATALOADER.DATASET_BS=(16,64). I'll add an assert in the dataloader to ensure this is consistent with SOLVER.IMS_PER_BATCH.

Hi, since batch-sizes of both datasets are defined, what is the role of DATASET_RATIO: [1:4] ?

xingyizhou commented 2 years ago

@wusize I believe DATASET_RATIO will be ignored when USE_DIFF_BS_SIZE is on. Please check the code to confirm.

wusize commented 2 years ago

Hi, Thank you for your interest. Yes, if you change the total batch size, the learning rate should be changed according to the linear learning rate rule.

I share my scripts for multi-node training below (2 nodes x 8 GPUs each). However I believe there are easier ways to do that.

I used two files train-2nodes.sh and multi-node_run.sh. Run sbatch train-2nodes.sh --config-files configs/XXXX to start.

train-2nodes.sh

#!/bin/bash
#SBATCH -p YOUR_PARTITION
#SBATCH --nodes=2
#SBATCH --gres=gpu:8
#SBATCH --gpus-per-node=8
#SBATCH --cpus-per-task=80
#SBATCH --mem=496G
#SBATCH --time 4320
#SBATCH -o "slurm-output/slurm-%j.out"

srun multi-node_run.sh $@

multi-node_run.sh

#!/bin/bash
MASTER_NODE=$(scontrol show hostname "$SLURM_NODELIST" | head -n1)
DIST_URL="tcp://$MASTER_NODE:12399"
SOCKET_NAME=$(ip r | grep default | awk '{print $5}')
export GLOO_SOCKET_IFNAME=$SOCKET_NAME

python -u train_net.py --num-gpus 8 --num-machines 2 --machine-rank "$SLURM_NODEID" --dist-url "$DIST_URL" "$@"

Hi, Xinyi!

I used your scripts but got the following error. Have you met such issues?

Traceback (most recent call last):
  File "/mnt/cache/wusize/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
    fn(i, *args)
  File "/mnt/cache/wusize/work_dirs/detectron2/detectron2/engine/launch.py", line 126, in _distributed_worker
    main_func(*args)
  File "/mnt/cache/wusize/projects/detic/train_net.py", line 250, in main
    do_train(cfg, model, resume=args.resume)
  File "/mnt/cache/wusize/projects/detic/train_net.py", line 137, in do_train
    data_loader = build_detection_train_loader(cfg, mapper=mapper)
  File "/mnt/cache/wusize/work_dirs/detectron2/detectron2/config/config.py", line 207, in wrapped
    explicit_args = _get_args_from_config(from_config, *args, **kwargs)
  File "/mnt/cache/wusize/work_dirs/detectron2/detectron2/config/config.py", line 245, in _get_args_from_config
    ret = from_config_func(*args, **kwargs)
  File "/mnt/cache/wusize/work_dirs/detectron2/detectron2/data/build.py", line 366, in _train_loader_from_config
    sampler = TrainingSampler(len(dataset))
  File "/mnt/cache/wusize/work_dirs/detectron2/detectron2/data/samplers/distributed_sampler.py", line 52, in __init__
    seed = comm.shared_random_seed()
  File "/mnt/cache/wusize/work_dirs/detectron2/detectron2/utils/comm.py", line 166, in shared_random_seed
    all_ints = all_gather(ints)
  File "/mnt/cache/wusize/work_dirs/detectron2/detectron2/utils/comm.py", line 114, in all_gather
    group = _get_global_gloo_group()  # use CPU group by default, to reduce GPU RAM usage.
  File "/mnt/cache/wusize/work_dirs/detectron2/detectron2/utils/comm.py", line 94, in _get_global_gloo_group
    return dist.new_group(backend="gloo")
  File "/mnt/cache/wusize/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2900, in new_group
    pg = _new_process_group_helper(
  File "/mnt/cache/wusize/anaconda3/envs/open-mmlab/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 685, in _new_process_group_helper
    pg = ProcessGroupGloo(prefix_store, rank, world_size, timeout=timeout)
RuntimeError: No device(s) specified
l412198735 commented 1 year ago

I used your scripts but got the following error. Have you met such issues?

Hi, I met the same issues with you, have you solved it?

wdrink commented 1 year ago

I used your scripts but got the following error. Have you met such issues?

Hi, I met the same issues with you, have you solved it?

+1, could you share the solutions?