Question regarding multi-gpu training in ContrastiveLosses4VRD/tools/train_net_step_rel.py

` ### Adaptively adjust some configs ###

cfg.NUM_GPUS = torch.cuda.device_count()  
original_batch_size = cfg.NUM_GPUS * cfg.TRAIN.IMS_PER_BATCH
original_ims_per_batch = cfg.TRAIN.IMS_PER_BATCH
original_num_gpus = cfg.NUM_GPUS
if args.batch_size is None:
    args.batch_size = original_batch_size
assert (args.batch_size % cfg.NUM_GPUS) == 0, \
    'batch_size: %d, NUM_GPUS: %d' % (args.batch_size, cfg.NUM_GPUS)
cfg.TRAIN.IMS_PER_BATCH = args.batch_size // cfg.NUM_GPUS
effective_batch_size = args.iter_size * args.batch_size
print('effective_batch_size = batch_size * iter_size = %d * %d' % (args.batch_size, args.iter_size))

print('Adaptive config changes:')
print('    effective_batch_size: %d --> %d' % (original_batch_size, effective_batch_size))
print('    NUM_GPUS:             %d --> %d' % (original_num_gpus, cfg.NUM_GPUS))
print('    IMS_PER_BATCH:        %d --> %d' % (original_ims_per_batch, cfg.TRAIN.IMS_PER_BATCH))

### Adjust learning based on batch size change linearly
# For iter_size > 1, gradients are `accumulated`, so lr is scaled based
# on batch_size instead of effective_batch_size
old_base_lr = cfg.SOLVER.BASE_LR
cfg.SOLVER.BASE_LR *= args.batch_size / original_batch_size
print('Adjust BASE_LR linearly according to batch_size change:\n'
      '    BASE_LR: {} --> {}'.format(old_base_lr, cfg.SOLVER.BASE_LR))

### Adjust solver steps
step_scale = original_batch_size / effective_batch_size
old_solver_steps = cfg.SOLVER.STEPS
old_max_iter = cfg.SOLVER.MAX_ITER
cfg.SOLVER.STEPS = list(map(lambda x: int(x * step_scale + 0.5), cfg.SOLVER.STEPS))
cfg.SOLVER.MAX_ITER = int(cfg.SOLVER.MAX_ITER * step_scale + 0.5)
print('Adjust SOLVER.STEPS and SOLVER.MAX_ITER linearly based on effective_batch_size change:\n'
      '    SOLVER.STEPS: {} --> {}\n'
      '    SOLVER.MAX_ITER: {} --> {}'.format(old_solver_steps, cfg.SOLVER.STEPS,
                                              old_max_iter, cfg.SOLVER.MAX_ITER))`

Here you assign NUM_GPUS and original_num_gpus to be the same value, so why do you expect them to change in the "Adaptive changes"? Also, there are "62723" images (MAX_ITER = 62723), now if you use 4 GPUs then it should come down to 62723/4 ~ 15680. But the MAX_ITER doesn't change during the training, it remains 62727 because the "step_scale" doesn't change from 1 to 1/4, it remains 1, no matter how many GPUs we use.

So, is there any problem with this logic or am I missing something? Can you please explain? Also, did you train for only one epoch (i.e. 62723 images) or many epochs of 62723 images per epoch?

Thank you.

NVIDIA / ContrastiveLosses4VRD

Question regarding multi-gpu training in ContrastiveLosses4VRD/tools/train_net_step_rel.py #13