A random bug - Githubissues

liumarcus70s commented 5 years ago

Hi everyone,

When I train the net, I got a random bug. An error will occur in random bench

Processing |########################## | (50860/61225) Data: 2.597300s | Batch: 3.278s | Total: 0:56:45 |Processing |########################## | (50880/61225) Data: 0.000299s | Batch: 0.681s | Total: 0:56:46 |Processing |########################## | (50900/61225) Data: 0.000489s | Batch: 0.691s | Total: 0:56:47 |Processing |########################## | (50920/61225) Data: 0.000502s | Batch: 0.683s | Total: 0:56:47 |Processing |########################## | (50940/61225) Data: 2.483688s | Batch: 3.165s | Total: 0:56:50 | ETA: 0:10:09 | LOSS vox: 0.0337; coord: 0.0034 | NME: 0.3116Traceback (most recent call last): File "train.py", line 281, in main(parser.parse_args()) File "train.py", line 90, in main run(model, train_loader, mode, criterion_vox, criterion_coord, optimizer_G, optimizer_P) File "train.py", line 144, in run for i, (inputs, target, meta) in enumerate(data_loader): File "/home/jliu9/anaconda3/envs/jvcr/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 623, in next return self._process_next_batch(batch) File "/home/jliu9/anaconda3/envs/jvcr/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 658, in _process_next_batch raise batch.exc_type(batch.exc_msg) ValueError: Traceback (most recent call last): File "/home/jliu9/anaconda3/envs/jvcr/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 138, in _worker_loop samples = collate_fn([dataset[i] for i in batch_indices]) File "/home/jliu9/Codes/JVCR-3Dlandmark/datasets/fa68pt3D.py", line 151, in getitem target_j = draw_labelvolume(target_j, tpts[j] - 1, self.sigma, type=self.label_type) File "/home/jliu9/Codes/JVCR-3Dlandmark/utils/imutils.py", line 123, in draw_labelvolume img[img_y[0]:img_y[1], img_x[0]:img_x[1]] = g[g_y[0]:g_y[1], g_x[0]:g_x[1]] ValueError: could not broadcast input array from shape (7,7) into shape (7,8)

So, what's the problem?

HongwenZhang commented 5 years ago

Replacing the int() with the np.int() in utils/imutils.py#L94-L95 may solve this problem.

JackLongKing commented 5 years ago

I meet this problem too. And modify int to np.int, this error still happens. I use pytorch0.4.0. Hope help! @HongwenZhang

JackLongKing commented 5 years ago

Did you solve this problem? @liumarcus70s

HongwenZhang commented 5 years ago

Hi @JackLongKing, Could you print the value of ul, br, and pt when the bug occurs?

JackLongKing commented 5 years ago

Information Flow as follows: //============================================================================ ('pt: \n', tensor([ 48.4674, 5.6901, -0.0979])) ('ul: \n', [45, 0]) ('br: \n', [52, 7]Traceback (most recent call last): File "train.py", line 278, in main(parser.parse_args()) File "train.py", line 90, in main run(model, train_loader, mode, criterion_vox, criterion_coord, optimizer_G, optimizer_P) File "train.py", line 144, in run for i, (inputs, target, meta) in enumerate(data_loader): File "/home/gulong/project/env/python2/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 272, in next return self._process_next_batch(batch) File "/home/gulong/project/env/python2/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 307, in _process_next_batch raise batch.exc_type(batch.exc_msg) ValueError: Traceback (most recent call last): File "/home/gulong/project/env/python2/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 57, in _worker_loop samples = collate_fn([dataset[i] for i in batch_indices]) File "/home/gulong/project/face/landmark/JVCR-3Dlandmark/datasets/fa68pt3D.py", line 151, in getitem target_j = draw_labelvolume(target_j, tpts[j] - 1, self.sigma, type=self.label_type) File "/home/gulong/project/face/landmark/JVCR-3Dlandmark/utils/imutils.py", line 124, in draw_labelvolume img[img_y[0]:img_y[1], img_x[0]:img_x[1]] = g[g_y[0]:g_y[1], g_x[0]:g_x[1]] ValueError: could not broadcast input array from shape (7,7) into shape (8,7) //============================================================================ @HongwenZhang Great appreciation for your help!

HongwenZhang commented 5 years ago

These values seem inconsistent with utils/imutils.py#L94-L95. sigma is 1 and int(5.6901 - 3 * 1) should be 2 for ul[1]? Could you carefully check and provide values at utils/imutils.py#L94 and img_x, img_y, g_x, g_y at utils/imutils.py#L119?

JackLongKing commented 5 years ago

Print code as follows: //================================================================== print("pt: {}\n".format(pt)) print("ul: {}\n".format(ul)) print("br: {}\n".format(br)) print('g_x[0]: {},g_x[1]: {}\n'.format(g_x[0],g_x[1])) print('g_y[0]: {},g_y[1]: {}\n'.format(g_y[0],g_y[1])) print('img_x[0]: {},img_x[1]: {}\n'.format(img_x[0],img_x[1])) print('img_y[0]: {},img_y[1]: {}\n'.format(img_y[0],img_y[1])) //================================================================== And ouput information as follows: //================================================================== pt: tensor([ 50.2262, 18.8357, -0.0273]) ul: [47, 15] br: [54, 22] g_x[0]: 0,g_x[1]: 7 g_y[0]: 0,g_y[1]: 7 img_x[0]: 47,img_x[1]: 54 img_y[0]: 15,img_y[1]: 22

pt: tensor([ 49.ValueError: Traceback (most recent call last): File "/home/gulong/project/env/python2/lib/python2.7/site-packages/torch/utils/data/dataloader.py", line 57, in _worker_loop samples = collate_fn([dataset[i] for i in batch_indices]) File "/home/gulong/project/face/landmark/JVCR-3Dlandmark/datasets/fa68pt3D.py", line 151, in getitem target_j = draw_labelvolume(target_j, tpts[j] - 1, self.sigma, type=self.label_type) File "/home/gulong/project/face/landmark/JVCR-3Dlandmark/utils/imutils.py", line 130, in draw_labelvolume img[img_y[0]:img_y[1], img_x[0]:img_x[1]] = g[g_y[0]:g_y[1], g_x[0]:g_x[1]] ValueError: could not broadcast input array from shape (7,7) into shape (7,8) //================================================================== From the output information, maybe this is caused by pt ? @HongwenZhang

HongwenZhang commented 5 years ago

These values are so weird. Given these values, both img[15:22, 47:54] and g[0:7, 0:7] should have the same shape of (7,7). So, I think it's better to replace utils/imutils.py#L119 with the following code for debugging.

try:
    img[img_y[0]:img_y[1], img_x[0]:img_x[1]] = g[g_y[0]:g_y[1], g_x[0]:g_x[1]]
except:
    print('something wrong happened.\n')
    print('pt: {}\n'.format(pt))
    print('ul: {}\n'.format(ul))
    print('br: {}\n'.format(br))
    print('sigma: {}\n'.format(sigma))
    print('g_x[0]: {},g_x[1]: {}\n'.format(g_x[0],g_x[1]))
    print('g_y[0]: {},g_y[1]: {}\n'.format(g_y[0],g_y[1]))
    print('img_x[0]: {},img_x[1]: {}\n'.format(img_x[0],img_x[1]))
    print('img_y[0]: {},img_y[1]: {}\n'.format(img_y[0],img_y[1]))
    print('img shape: {}\n'.format(img.shape))
    print('g shape: {}\n'.format(g.shape))
    raise

JackLongKing commented 5 years ago

Yes, try...except was used in utils/imutils.py, and then met another problem, out of memory, which needs another try. My device is Titan X(12GB). My log as follows and thank you for your help! @HongwenZhang //================================================================= ==> creating model: stacks=4, blocks=1, z-res=[1, 2, 4, 64] coarse to fine mode: True p2v params: 13.01M v2c params: 19.46M using ADAM optimizer.

Epoch: 1 | LR: 0.00025000 pre_training... train.py:201: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number losses_vox.update(loss_vox.data[0], inputs.size(0)) train.py:202: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number losses_coord.update(loss_coord.data[0], inputs.size(0)) train.py:217: UserWarning: invalid index of a 0-dim tensor. This will be an error in PyTorch 0.5. Use tensor.item() to convert a 0-dim tensor to a Python number loss='vox: {:.4f}; coord: {:.4f}'.format(loss_vox.data[0], loss_coord.data[0]), train.py:122: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead. input_var = torch.autograd.Variable(inputs.cuda(), volatile=True) train.py:124: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead. range(len(target))] train.py:125: UserWarning: volatile was removed and now has no effect. Use with torch.no_grad(): instead. coord_var = torch.autograd.Variable(meta['tpts_inp'].cuda(async=True), volatile=True) THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCStorage.cu line=58 error=2 : out of memory Exception NameError: "global name 'FileNotFoundError' is not defined" in <bound method _DataLoaderIter.del of <torch.utils.data.dataloader._DataLoaderIter object at 0x7f57d4ad7fd0>> ignored Traceback (most recent call last): File "train.py", line 278, in main(parser.parse_args()) File "train.py", line 95, in main optimizer_P) File "train.py", line 151, in run predvox, , pred_coord = model.forward(input_var) File "/home/gulong/project/face/landmark/JVCR-3Dlandmark/models/pix2vox2coord.py", line 55, in forward vox_list = self.pix2vox(x) File "/home/gulong/project/env/python2/lib/python2.7/site-packages/torch/nn/modules/module.py", line 491, in call result = self.forward(*input, **kwargs) File "/home/gulong/project/env/python2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 114, in forward outputs = self.parallel_apply(replicas, inputs, kwargs) File "/home/gulong/project/env/python2/lib/python2.7/site-packages/torch/nn/parallel/data_parallel.py", line 124, in parallel_apply return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) File "/home/gulong/project/env/python2/lib/python2.7/site-packages/torch/nn/parallel/parallel_apply.py", line 65, in parallel_apply raise output RuntimeError: cuda runtime error (2) : out of memory at /pytorch/aten/src/THC/generic/THCStorage.cu:58 //=================================================================

HongwenZhang commented 5 years ago

The error of 'out of memory' is out of the scope of this issue. To reproduce the bug occurred in the dataloader, we can bypass the forward of the network by adding continue at train.py#L145.

HongwenZhang / JVCR-3Dlandmark

A random bug #9