Open dqshuai opened 3 years ago
Hi, we didn't try to train with multiple GPUs. But MMDetection supports distributed training, please refer to https://github.com/daodaofr/AlignPS/blob/master/tools/dist_train.sh
thanks for your reply. now, i try to support distributed training using cmd "./tools/dist_train.sh configs/fcos/prw_dcn_base_focal_labelnorm_sub_ldcn_fg15_wd7-4.py 8 --launcher pytorch --no-validate". it can traniing normally, but i don‘t know it whether it will affect the final performance. Normally, distributed training will not damage the performance? is this correct?
Normally, you can still get fair performance, maybe there needs some adjusting in batch size and learning rate to get the best results.
hi,I just finished training using multi gpu on prw dataset. Compared to the results of the paper, map is 2% lower, but r1 is 1% higher. when i check the config, i found the bbox_head is ''FCOSReidHeadFocalOimSub'' without triplet loss.https://github.com/daodaofr/AlignPS/blob/c20cf329b2934a8693e2064435d3e3f65c496095/configs/fcos/prw_dcn_base_focal_labelnorm_sub_ldcn_fg15_wd7-4.py#L11 I want to know whether the difference in results is related to this, and no ablation experiment in this regard was found in your paper. thanks!
Thanks for your results, I think the results are normal. According to my experience, the triplet loss only has a very slight influence on PRW, less than 1%. Different environments (mmcv, pytorch, cuda) can also bring 1%-2% performance difference. PRW is smaller compared to CUHK-SYSU, so it is normal to see some fluctuations.
when i try to train model on CUHK-SYSU using muti gpus.mAP is 89.15,R1 is 89.79 without adjustment of any parameters. After that, I tried the following (1) adjust lr from 0.001 to 0.01 (2) using 'model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)' to verify whether bn is not sync. But,these measures did not work. Can you give me some suggestions!thanks! my environments: mmvc-full==1.1.5 pytoch==1.5.1 cuda==10.2 but i don't think environment can bring 4% map difference. :)
I am sorry, but I haven't tried distributed training. So I cannot give practical suggestions on that. If you want to reproduce the results, please try to use a single GPU.
thanks for your reply. I received system email, in which you suggest to use all gather to update lookup_table with global features. The example you provided has some problems due to the inconsistency of the feature size of each rank. I made some modifications, and then adjusted the learning rate, the current map can reach 92.91. Why did I not see this reply in the issue, and are there any other details that I need to pay attention to to get a higher map?
I also noticed the inconsistency issue of feature size, where the network stops training, so I deleted the reply. It would be nice if you could give an example of your modified code, to help others with distributed training. Maybe more epochs are needed with multiple GPU.
I also noticed the inconsistency issue of feature size, where the network stops training, so I deleted the reply. It would be nice if you could give an example of your modified code, to help others with distributed training. Maybe more epochs are needed with multiple GPU.
My current implementation is a bit ugly. :)
@torch.no_grad()
def all_gather_tensor(x, gpu=None, save_memory=False):
rank, world_size, is_dist = get_dist_info()
if not is_dist:
return x
if not save_memory:
# all gather features in parallel
# cost more GPU memory but less time
# x = x.cuda(gpu)
x_gather = [torch.empty_like(x) for _ in range(world_size)]
dist.all_gather(x_gather, x, async_op=False)
# x_gather = torch.cat(x_gather, dim=0)
else:
# broadcast features in sequence
# cost more time but less GPU memory
container = torch.empty_like(x).cuda(gpu)
x_gather = []
for k in range(world_size):
container.data.copy_(x)
print("gathering features from rank no.{}".format(k))
dist.broadcast(container, k)
x_gather.append(container.cpu())
# x_gather = torch.cat(x_gather, dim=0)
# return cpu tensor
return x_gather
def undefined_l_gather(features,pid_labels):
resized_num = 10000
pos_num = min(features.size(0),resized_num)
if features.size(0)>resized_num:
print(f'{features.size(0)}out of {resized_num}')
resized_features = torch.empty((resized_num,features.size(1))).to(features.device)
resized_features[:pos_num,:] = features[:pos_num,:]
resized_pid_labels = torch.empty((resized_num,)).to(pid_labels.device)
resized_pid_labels[:pos_num] = pid_labels[:pos_num]
pos_num = torch.tensor([pos_num]).to(features.device)
all_pos_num = all_gather_tensor(pos_num)
all_features = all_gather_tensor(resized_features)
all_pid_labels = all_gather_tensor(resized_pid_labels)
gather_features = []
gather_pid_labels = []
for index,p_num in enumerate(all_pos_num):
gather_features.append(all_features[index][:p_num,:])
gather_pid_labels.append(all_pid_labels[index][:p_num])
gather_features = torch.cat(gather_features,dim=0)
gather_pid_labels = torch.cat(gather_pid_labels,dim=0)
return gather_features,gather_pid_labels
class LabeledMatching(Function):
@staticmethod
def forward(ctx, features, pid_labels, lookup_table, momentum=0.5):
# The lookup_table can't be saved with ctx.save_for_backward(), as we would
# modify the variable which has the same memory address in backward()
# ctx.save_for_backward(features, pid_labels)
gather_features,gather_pid_labels = undefined_l_gather(features,pid_labels)
ctx.save_for_backward(gather_features, gather_pid_labels)
ctx.lookup_table = lookup_table
ctx.momentum = momentum
scores = features.mm(lookup_table.t())
#print(features, lookup_table, scores)
pos_feats = lookup_table.clone().detach()
pos_idx = pid_labels > 0
pos_pids = pid_labels[pos_idx]
pos_feats = pos_feats[pos_pids]
#pos_feats.require_grad = False
return scores, pos_feats, pos_pids
@staticmethod
def backward(ctx, grad_output, grad_feat, grad_pids):
features, pid_labels = ctx.saved_tensors
pid_labels = pid_labels.long()
lookup_table = ctx.lookup_table
momentum = ctx.momentum
grad_feats = None
if ctx.needs_input_grad[0]:
grad_feats = grad_output.mm(lookup_table)
# Update lookup table, but not by standard backpropagation with gradients
for indx, label in enumerate(pid_labels):
if label >= 0:
lookup_table[label] = (
momentum * lookup_table[label] + (1 - momentum) * features[indx]
)
#lookup_table[label] /= lookup_table[label].norm()
return grad_feats, None, None, None
Great! Thanks :)
I think all_gather_tensor should return a list if is_dist
is false
@torch.no_grad()
def all_gather_tensor(x, gpu=None, save_memory=False):
rank, world_size, is_dist = get_dist_info()
if not is_dist:
return [x]
# remaining code here...
when i try to train model on CUHK-SYSU using muti gpus.mAP is 89.15,R1 is 89.79 without adjustment of any parameters. After that, I tried the following (1) adjust lr from 0.001 to 0.01 (2) using 'model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)' to verify whether bn is not sync. But,these measures did not work. Can you give me some suggestions!thanks! my environments: mmvc-full==1.1.5 pytoch==1.5.1 cuda==10.2 but i don't think environment can bring 4% map difference. :)
@dqshuai Thanks for sharing you modified dist_training code. I have several questions about the two points you mentioned above. How many GPUs did you use and what is the batch size within one gpu, by which you got 92.91 mAP? What is the empirical ratio of lr used for single gpu training and multi gpus training? Did the using of sync_batchnorm affect the final results. Thanks!
when i try to train model on CUHK-SYSU using muti gpus.mAP is 89.15,R1 is 89.79 without adjustment of any parameters. After that, I tried the following (1) adjust lr from 0.001 to 0.01 (2) using 'model = torch.nn.SyncBatchNorm.convert_sync_batchnorm(model)' to verify whether bn is not sync. But,these measures did not work. Can you give me some suggestions!thanks! my environments: mmvc-full==1.1.5 pytoch==1.5.1 cuda==10.2 but i don't think environment can bring 4% map difference. :)
@dqshuai Thanks for sharing you modified dist_training code. I have several questions about the two points you mentioned above. How many GPUs did you use and what is the batch size within one gpu, by which you got 92.91 mAP? What is the empirical ratio of lr used for single gpu training and multi gpus training? Did the using of sync_batchnorm affect the final results. Thanks!
(1)my GPUs' num is 8,and the batch_size of each gpu is 4. When i set lr=0.05,i get 92.91 mAP. At first,I thought that the empirical ratio lr is about single_gpu_lr(0.001)*num_of_gpus. But,I don't get a better result when i using 0.008 or 0.01 lr. (2)using sync_batchnorm reduces the result. and i don‘t know why. If you have any other findings, you can share it with me. I haven't fully reproduced the results of the paper with multi gpu. Thanks!
Got it, Thanks!
Hi, I tried the distributed implemention of @dqshuai, but the performance got worse.
I notice that there is a toolkit in mmdet/models/dense_heads/oim_utils.py
and , which contains the distributed tools. Is this implemented by you? @daodaofr Can I use it to fix the feature size inconsistance of each rank?
Hi, I tried the distributed implemention of @dqshuai, but the performance got worse. I notice that there is a toolkit in
mmdet/models/dense_heads/oim_utils.py
and , which contains the distributed tools. Is this implemented by you? @daodaofr Can I use it to fix the feature size inconsistance of each rank?
This is just my try, but it doesn't work out.
hello,thanks for your project. i want to know whether to support distributed training. And what should i do to make it support distributed training.