Whether to support distributed training

dqshuai commented 3 years ago

hello，thanks for your project. i want to know whether to support distributed training. And what should i do to make it support distributed training.

daodaofr commented 3 years ago

Hi, we didn't try to train with multiple GPUs. But MMDetection supports distributed training, please refer to https://github.com/daodaofr/AlignPS/blob/master/tools/dist_train.sh

dqshuai commented 3 years ago

thanks for your reply. now, i try to support distributed training using cmd "./tools/dist_train.sh configs/fcos/prw_dcn_base_focal_labelnorm_sub_ldcn_fg15_wd7-4.py 8 --launcher pytorch --no-validate". it can traniing normally, but i don‘t know it whether it will affect the final performance. Normally, distributed training will not damage the performance? is this correct？

daodaofr commented 3 years ago

Normally, you can still get fair performance, maybe there needs some adjusting in batch size and learning rate to get the best results.

dqshuai commented 3 years ago

hi，I just finished training using multi gpu on prw dataset. Compared to the results of the paper, map is 2% lower, but r1 is 1% higher. when i check the config, i found the bbox_head is ''FCOSReidHeadFocalOimSub'' without triplet loss.https://github.com/daodaofr/AlignPS/blob/c20cf329b2934a8693e2064435d3e3f65c496095/configs/fcos/prw_dcn_base_focal_labelnorm_sub_ldcn_fg15_wd7-4.py#L11 I want to know whether the difference in results is related to this, and no ablation experiment in this regard was found in your paper. thanks！

daodaofr commented 3 years ago

Thanks for your results, I think the results are normal. According to my experience, the triplet loss only has a very slight influence on PRW, less than 1%. Different environments (mmcv, pytorch, cuda) can also bring 1%-2% performance difference. PRW is smaller compared to CUHK-SYSU, so it is normal to see some fluctuations.

dqshuai commented 3 years ago

when i try to train model on CUHK-SYSU using muti gpus.mAP is 89.15,R1 is 89.79 without adjustment of any parameters. After that, I tried the following (1) adjust lr from 0.001 to 0.01 (2) using 'model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)' to verify whether bn is not sync. But,these measures did not work. Can you give me some suggestions！thanks! my environments: mmvc-full==1.1.5 pytoch==1.5.1 cuda==10.2 but i don't think environment can bring 4% map difference. :)

daodaofr commented 3 years ago

I am sorry, but I haven't tried distributed training. So I cannot give practical suggestions on that. If you want to reproduce the results, please try to use a single GPU.

dqshuai commented 3 years ago

thanks for your reply. I received system email, in which you suggest to use all gather to update lookup_table with global features. The example you provided has some problems due to the inconsistency of the feature size of each rank. I made some modifications, and then adjusted the learning rate, the current map can reach 92.91. Why did I not see this reply in the issue, and are there any other details that I need to pay attention to to get a higher map?

daodaofr commented 3 years ago

I also noticed the inconsistency issue of feature size, where the network stops training, so I deleted the reply. It would be nice if you could give an example of your modified code, to help others with distributed training. Maybe more epochs are needed with multiple GPU.

dqshuai commented 3 years ago

I also noticed the inconsistency issue of feature size, where the network stops training, so I deleted the reply. It would be nice if you could give an example of your modified code, to help others with distributed training. Maybe more epochs are needed with multiple GPU.

My current implementation is a bit ugly. :)

@torch.no_grad()
def all_gather_tensor(x, gpu=None, save_memory=False):
    rank, world_size, is_dist = get_dist_info()
    if not is_dist:
        return x
    if not save_memory:
        # all gather features in parallel
        # cost more GPU memory but less time
        # x = x.cuda(gpu)
        x_gather = [torch.empty_like(x) for _ in range(world_size)]
        dist.all_gather(x_gather, x, async_op=False)
#         x_gather = torch.cat(x_gather, dim=0)
    else:
        # broadcast features in sequence
        # cost more time but less GPU memory
        container = torch.empty_like(x).cuda(gpu)
        x_gather = []
        for k in range(world_size):
            container.data.copy_(x)
            print("gathering features from rank no.{}".format(k))
            dist.broadcast(container, k)
            x_gather.append(container.cpu())
#         x_gather = torch.cat(x_gather, dim=0)
        # return cpu tensor
    return x_gather
def undefined_l_gather(features,pid_labels):
    resized_num = 10000
    pos_num = min(features.size(0),resized_num)
    if features.size(0)>resized_num:
        print(f'{features.size(0)}out of {resized_num}')
    resized_features = torch.empty((resized_num,features.size(1))).to(features.device)
    resized_features[:pos_num,:] = features[:pos_num,:]
    resized_pid_labels = torch.empty((resized_num,)).to(pid_labels.device)
    resized_pid_labels[:pos_num] = pid_labels[:pos_num]
    pos_num = torch.tensor([pos_num]).to(features.device)
    all_pos_num = all_gather_tensor(pos_num)
    all_features = all_gather_tensor(resized_features)
    all_pid_labels = all_gather_tensor(resized_pid_labels)
    gather_features = []
    gather_pid_labels = []
    for index,p_num in enumerate(all_pos_num):
        gather_features.append(all_features[index][:p_num,:])
        gather_pid_labels.append(all_pid_labels[index][:p_num])
    gather_features = torch.cat(gather_features,dim=0)
    gather_pid_labels = torch.cat(gather_pid_labels,dim=0)
    return gather_features,gather_pid_labels
class LabeledMatching(Function):
    @staticmethod
    def forward(ctx, features, pid_labels, lookup_table, momentum=0.5):
        # The lookup_table can't be saved with ctx.save_for_backward(), as we would
        # modify the variable which has the same memory address in backward()
#         ctx.save_for_backward(features, pid_labels)
        gather_features,gather_pid_labels = undefined_l_gather(features,pid_labels)
        ctx.save_for_backward(gather_features, gather_pid_labels)  
        ctx.lookup_table = lookup_table
        ctx.momentum = momentum
        scores = features.mm(lookup_table.t())
        #print(features, lookup_table, scores)
        pos_feats = lookup_table.clone().detach()
        pos_idx = pid_labels > 0
        pos_pids = pid_labels[pos_idx]
        pos_feats = pos_feats[pos_pids]
        #pos_feats.require_grad = False
        return scores, pos_feats, pos_pids

    @staticmethod
    def backward(ctx, grad_output, grad_feat, grad_pids):
        features, pid_labels = ctx.saved_tensors
        pid_labels = pid_labels.long()
        lookup_table = ctx.lookup_table
        momentum = ctx.momentum
        grad_feats = None
        if ctx.needs_input_grad[0]:
            grad_feats = grad_output.mm(lookup_table)
        # Update lookup table, but not by standard backpropagation with gradients
        for indx, label in enumerate(pid_labels):
            if label >= 0:
                lookup_table[label] = (
                    momentum * lookup_table[label] + (1 - momentum) * features[indx]
                )
                #lookup_table[label] /= lookup_table[label].norm()
        return grad_feats, None, None, None

daodaofr commented 3 years ago

Great! Thanks :)

anDoer commented 3 years ago

I think all_gather_tensor should return a list if is_dist is false

@torch.no_grad()
def all_gather_tensor(x, gpu=None, save_memory=False):
    rank, world_size, is_dist = get_dist_info()
    if not is_dist:
        return [x]
    # remaining code here...

hh23333 commented 3 years ago

when i try to train model on CUHK-SYSU using muti gpus.mAP is 89.15,R1 is 89.79 without adjustment of any parameters. After that, I tried the following (1) adjust lr from 0.001 to 0.01 (2) using 'model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)' to verify whether bn is not sync. But,these measures did not work. Can you give me some suggestions！thanks! my environments: mmvc-full==1.1.5 pytoch==1.5.1 cuda==10.2 but i don't think environment can bring 4% map difference. :)

@dqshuai Thanks for sharing you modified dist_training code. I have several questions about the two points you mentioned above. How many GPUs did you use and what is the batch size within one gpu, by which you got 92.91 mAP? What is the empirical ratio of lr used for single gpu training and multi gpus training? Did the using of sync_batchnorm affect the final results. Thanks!

dqshuai commented 3 years ago

when i try to train model on CUHK-SYSU using muti gpus.mAP is 89.15,R1 is 89.79 without adjustment of any parameters. After that, I tried the following (1) adjust lr from 0.001 to 0.01 (2) using 'model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)' to verify whether bn is not sync. But,these measures did not work. Can you give me some suggestions！thanks! my environments: mmvc-full==1.1.5 pytoch==1.5.1 cuda==10.2 but i don't think environment can bring 4% map difference. :)

@dqshuai Thanks for sharing you modified dist_training code. I have several questions about the two points you mentioned above. How many GPUs did you use and what is the batch size within one gpu, by which you got 92.91 mAP? What is the empirical ratio of lr used for single gpu training and multi gpus training? Did the using of sync_batchnorm affect the final results. Thanks!

(1)my GPUs' num is 8,and the batch_size of each gpu is 4. When i set lr=0.05,i get 92.91 mAP. At first，I thought that the empirical ratio lr is about single_gpu_lr(0.001)*num_of_gpus. But,I don't get a better result when i using 0.008 or 0.01 lr. (2)using sync_batchnorm reduces the result. and i don‘t know why. If you have any other findings, you can share it with me. I haven't fully reproduced the results of the paper with multi gpu. Thanks!

hh23333 commented 3 years ago

Got it, Thanks!

qixiong-wang commented 3 years ago

Hi, I tried the distributed implemention of @dqshuai, but the performance got worse. I notice that there is a toolkit in mmdet/models/dense_heads/oim_utils.py and , which contains the distributed tools. Is this implemented by you? @daodaofr Can I use it to fix the feature size inconsistance of each rank?

daodaofr commented 3 years ago

Hi, I tried the distributed implemention of @dqshuai, but the performance got worse. I notice that there is a toolkit in mmdet/models/dense_heads/oim_utils.py and , which contains the distributed tools. Is this implemented by you? @daodaofr Can I use it to fix the feature size inconsistance of each rank?

This is just my try, but it doesn't work out.

daodaofr / AlignPS

Whether to support distributed training #4