Alibaba-MIIL / ML_Decoder

Official PyTorch implementation of "ML-Decoder: Scalable and Versatile Classification Head" (2021)
MIT License
315 stars 52 forks source link

about multi GPU training error #10

Closed myh12138 closed 2 years ago

myh12138 commented 2 years ago

Hi, while I training this model on multi GPUs some error happened My code is `from torch.utils.data.distributed import DistributedSampler

local_rank = torch.distributed.get_rank() torch.cuda.set_device(local_rank)

model = create_model(args,load_head=True) model = torch.nn.DataParallel(model,device_ids=device_ids) model.to(device)

for i, (inputData,target) in pbar:

for i, (inputData, target) in enumerate(train_loader):

        inputData = inputData.to(device)
        target = target.to(device)  # (batch,3,num_classes)
        # target = target.max(dim=1)[0]
        with autocast():  # mixed precision
            output = model(inputData).float()  # sigmoid will be done in loss !
        # print("target shape = ",target)
        # print("output shape = ",output)
        loss = criterion(output, target)`

But error happened: RuntimeError: AssertionTHCTensor_(checkGPU)(state, 3, input, output, weight)' failed. Some of weight/gradient/input tensors are located on different GPUs. Please move them to a single one. at /opt/conda/conda-bld/pytorch_1603728993639/work/aten/src/THCUNN/generic/SpatialDepthwiseConvolution.cu:16 `

When I changed self.filt = filt[None, None, :, :].repeat((self.channels, 1, 1, 1)).cuda().half() to self.filt = filt[None, None, :, :].repeat((self.channels, 1, 1, 1)).cuda(device=int(os.environ.get('RANK', 0))).half() as you mentioned in TResNet

But the same error happened `RuntimeError: attribute lookup is not defined on python value of type '_Environ': File "/home/kpl/code/multilabel/ML_Decoder/src_files/models/tresnet/layers/anti_aliasing.py", line 35 filt = filt / torch.sum(filt)

self.filt = filt[None, None, :, :].repeat((self.channels, 1, 1, 1)).cuda().half()

    self.filt = filt[None, None, :, :].repeat((self.channels, 1, 1, 1)).cuda(device=int(os.environ.get('RANK', 0))).half()
                                                                                        ~~~~~~~~~~~~~~ <--- HERE`

you said you added an option --remove_aa_jit. run with it, it should be ok. But I don't find --remove_aa_jit can you give some suggestions? Thanks very much.

mrT23 commented 2 years ago

i made some changes to the anti_aliasing.py file, try again and let me know if the issue persists.

p.s. DDP (DistributedDataParallel) is more recommend than DataParallel for multi-GPU usage in pytorch.

myh12138 commented 2 years ago

i made some changes to the anti_aliasing.py file, try again and let me know if the issue persists.

p.s. DDP (DistributedDataParallel) is more recommend than DataParallel for multi-GPU usage in pytorch.

OHHHHHHHHHHHHHHHH,great job,it is useful. Now I can use multi GPUs.

myh12138 commented 2 years ago

i made some changes to the anti_aliasing.py file, try again and let me know if the issue persists.

p.s. DDP (DistributedDataParallel) is more recommend than DataParallel for multi-GPU usage in pytorch.

by the way,can you share the idea of ACC computing,I can get some meaning ,but can't get macro significance. Besides, I think the code should be like this ap1 = np.zeros((preds.shape[0]))

compute average precision for each class

for k in range(preds.shape[0]):
    # sort scores
    scores = preds[k:, ]   
    targets = targs[k:, ]
    # compute average precision
    ap[k] = average_precision(scores, targets)
return 100 * ap.mean()

because, if in this batch no class one,this batch target is [0,0,0,0,,,,,,] , and if preds is [0,0,0,0,,,,,,]. This result means the predict is right but the acc is 0.

mrT23 commented 2 years ago

i am not sure i understood your messege. what is "ACC computing" ?

'because, if in this batch no class one,this batch target is [0,0,0,0,,,,,,] , and if preds is [0,0,0,0,,,,,,]. This result means the prediction is right but the acc is 0. " in a typical multi-label dataset (COCO, OpenImages), i think that you are guaranteed that each image has at least one positive class.

myh12138 commented 2 years ago

i am not sure i understood your messege. what is "ACC computing" ?

'because, if in this batch no class one,this batch target is [0,0,0,0,,,,,,] , and if preds is [0,0,0,0,,,,,,]. This result means the prediction is right but the acc is 0. " in a typical multi-label dataset (COCO, OpenImages), i think that you are guaranteed that each image has at least one positive class.

the "ACC computing" is " mAP_score = validate_multi(train_loader, model, ema)" Yes, image has at least one positive class. But your code isfor k in range(preds.shape[1]) not for k in range(preds.shape[0]) the dim[0] is batch, dim [1] is classes so.

 for k in range(preds.shape[1])
    scores = preds[:, k]    
    targets = targs[:, k]

means [batch,1] not [1,classes]. theredor i think the code should be for k in range(preds.shape[0])

mrT23 commented 2 years ago

i think it's ok. it's a standard mAP calculation code. you aggregate the results from all the validation data (preds.shape[0]== num validation images, not batch), calculate mAP per class, and then average.

see similar code in: https://github.com/Megvii-Nanjing/ML-GCN/blob/master/util.py#L220