Adaptive Pixel Intensity Loss generated NaN values while training

ThiruRJST commented 2 years ago

Was training on custom human dataset. Batch Size = 8 No of training images = 3800

No of steps trained before showing error = 75

After 75th step It generated an error:

RuntimeError: Function 'UpsampleBilinear2DBackward1' returned nan values in its 0th output.

The model trained successfully when using BCE loss.

We even checked for NaN values using torch.autograd.set_detect_anamoly(True) But it returned False stating that no NaN values were found

ThiruRJST commented 2 years ago

File "/opt/conda/envs/test/lib/python3.7/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/opt/conda/envs/test/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/opt/conda/envs/test/lib/python3.7/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/envs/test/lib/python3.7/site-packages/torch/nn/parallel/parallel_apply.py", line 61, in _worker
    output = module(*input, **kwargs)
  File "/opt/conda/envs/test/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/jupyter/TRACER/model/TRACER.py", line 38, in forward
    features, edge = self.model.get_blocks(x, H, W)
  File "/home/jupyter/TRACER/model/EfficientNet.py", line 250, in get_blocks
    edge = F.interpolate(edge, size=(H, W), mode='bilinear')
  File "/opt/conda/envs/test/lib/python3.7/site-packages/torch/nn/functional.py", line 3709, in interpolate
    return torch._C._nn.upsample_bilinear2d(input, output_size, align_corners, scale_factors)
 (function _print_stack)
 16%|███████████████▎                                                                                | 76/475 [04:51<25:29,  3.83s/it]
Traceback (most recent call last):
  File "main.py", line 49, in <module>
    main(cfg)
  File "main.py", line 34, in main
    Trainer(cfg, save_path)
  File "/home/jupyter/TRACER/trainer.py", line 59, in _init_
    train_loss, train_mae = self.training(args)
  File "/home/jupyter/TRACER/trainer.py", line 117, in training
    loss.backward()
  File "/opt/conda/envs/test/lib/python3.7/site-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/envs/test/lib/python3.7/site-packages/torch/autograd/_init_.py", line 149, in backward
    allow_unreachable=True, accumulate_grad=True)  # allow_unreachable flag
RuntimeError: Function 'UpsampleBilinear2DBackward1' returned nan values in its 0th output.

The entire stacktrace of the error

Karel911 commented 2 years ago

Hi, It seems the MEAM did not clearly generate the edges. I recommend you to remove all lines related with the edge generation parts (e.g., generating edges or computing loss).

ThiruRJST commented 2 years ago

But how did that run completely fine when using BCE loss alone

Karel911 commented 2 years ago

I don't exactly know about the dataset you used so I'm not sure what the problem is. But the error you posted shows that the MEAM module could not capture the edges. What does it say when you execute the torch.autograd.set_detect_anamoly(True) ? And also, excluding the lines related with the edge parts works well under the using API loss?

ThiruRJST commented 2 years ago

Actually using torch.autograd.set_detect_anamoly returns False for all tensors

Karel911 commented 2 years ago

Hi, It seems the MEAM did not clearly generate the edges. I recommend you to remove all lines related with the edge generation parts (e.g., generating edges or computing loss).

How about this approach? Does it work?

hackkhai commented 2 years ago

@Karel911 can you help me with removing the edge generation parts? because i am facing a similar issue.

ThiruRJST commented 2 years ago

@Karel911 my team mate @hackkhai is working on that.

Karel911 commented 2 years ago

@Karel911 can you help me with removing the edge generation parts? because i am facing a similar issue.

I also curious about which parts make this issue. I released the version of TRACER without edge generation. Replace the released scripts with the existing ones. I briefly tested it so if there is any problem, please let me know.

Thanks.

hackkhai commented 2 years ago

Thanks, Let me check this out

Karel911 / TRACER

Adaptive Pixel Intensity Loss generated NaN values while training #9