Closed myh12138 closed 2 years ago
i made some changes to the anti_aliasing.py file, try again and let me know if the issue persists.
p.s. DDP (DistributedDataParallel) is more recommend than DataParallel for multi-GPU usage in pytorch.
i made some changes to the anti_aliasing.py file, try again and let me know if the issue persists.
p.s. DDP (DistributedDataParallel) is more recommend than DataParallel for multi-GPU usage in pytorch.
OHHHHHHHHHHHHHHHH,great job,it is useful. Now I can use multi GPUs.
i made some changes to the anti_aliasing.py file, try again and let me know if the issue persists.
p.s. DDP (DistributedDataParallel) is more recommend than DataParallel for multi-GPU usage in pytorch.
by the way,can you share the idea of ACC computing,I can get some meaning ,but can't get macro significance. Besides, I think the code should be like this ap1 = np.zeros((preds.shape[0]))
for k in range(preds.shape[0]):
# sort scores
scores = preds[k:, ]
targets = targs[k:, ]
# compute average precision
ap[k] = average_precision(scores, targets)
return 100 * ap.mean()
because, if in this batch no class one,this batch target is [0,0,0,0,,,,,,] , and if preds is [0,0,0,0,,,,,,]. This result means the predict is right but the acc is 0.
i am not sure i understood your messege. what is "ACC computing" ?
'because, if in this batch no class one,this batch target is [0,0,0,0,,,,,,] , and if preds is [0,0,0,0,,,,,,]. This result means the prediction is right but the acc is 0. " in a typical multi-label dataset (COCO, OpenImages), i think that you are guaranteed that each image has at least one positive class.
i am not sure i understood your messege. what is "ACC computing" ?
'because, if in this batch no class one,this batch target is [0,0,0,0,,,,,,] , and if preds is [0,0,0,0,,,,,,]. This result means the prediction is right but the acc is 0. " in a typical multi-label dataset (COCO, OpenImages), i think that you are guaranteed that each image has at least one positive class.
the "ACC computing" is " mAP_score = validate_multi(train_loader, model, ema)"
Yes, image has at least one positive class. But your code isfor k in range(preds.shape[1])
not for k in range(preds.shape[0])
the dim[0] is batch, dim [1] is classes so.
for k in range(preds.shape[1])
scores = preds[:, k]
targets = targs[:, k]
means [batch,1] not [1,classes]. theredor i think the code should be for k in range(preds.shape[0])
i think it's ok. it's a standard mAP calculation code. you aggregate the results from all the validation data (preds.shape[0]== num validation images, not batch), calculate mAP per class, and then average.
see similar code in: https://github.com/Megvii-Nanjing/ML-GCN/blob/master/util.py#L220
Hi, while I training this model on multi GPUs some error happened My code is `from torch.utils.data.distributed import DistributedSampler
local_rank = torch.distributed.get_rank() torch.cuda.set_device(local_rank)
model = create_model(args,load_head=True) model = torch.nn.DataParallel(model,device_ids=device_ids) model.to(device)
for i, (inputData,target) in pbar:
for i, (inputData, target) in enumerate(train_loader):
But error happened:
RuntimeError: Assertion
THCTensor_(checkGPU)(state, 3, input, output, weight)' failed. Some of weight/gradient/input tensors are located on different GPUs. Please move them to a single one. at /opt/conda/conda-bld/pytorch_1603728993639/work/aten/src/THCUNN/generic/SpatialDepthwiseConvolution.cu:16 `When I changed
self.filt = filt[None, None, :, :].repeat((self.channels, 1, 1, 1)).cuda().half() to self.filt = filt[None, None, :, :].repeat((self.channels, 1, 1, 1)).cuda(device=int(os.environ.get('RANK', 0))).half()
as you mentioned in TResNetBut the same error happened `RuntimeError: attribute lookup is not defined on python value of type '_Environ': File "/home/kpl/code/multilabel/ML_Decoder/src_files/models/tresnet/layers/anti_aliasing.py", line 35 filt = filt / torch.sum(filt)
self.filt = filt[None, None, :, :].repeat((self.channels, 1, 1, 1)).cuda().half()
you said you added an option --remove_aa_jit. run with it, it should be ok. But I don't find --remove_aa_jit can you give some suggestions? Thanks very much.