Linear Sum Assignment Fails: Avoid numeric errors in float16

FabianSchuetze commented 1 year ago

After training maskdino for some time the linear sum assignment regularly fails, see below:

  File "/notebooks/detrex/tools/train_net.py", line 303, in main
    do_train(args, cfg)
  File "/notebooks/detrex/tools/train_net.py", line 276, in do_train
    trainer.train(start_iter, cfg.train.max_iter)
  File "/notebooks/detrex/detectron2/detectron2/engine/train_loop.py", line 149, in train
    self.run_step()
  File "/notebooks/detrex/tools/train_net.py", line 101, in run_step
    loss_dict = self.model(data)
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/notebooks/detrex/projects/maskdino/maskdino.py", line 165, in forward
    losses = self.criterion(outputs, targets,mask_dict)
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/notebooks/detrex/projects/maskdino/modeling/criterion.py", line 353, in forward
    indices = self.matcher(outputs_without_aux, targets)
  File "/usr/local/lib/python3.9/dist-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/usr/local/lib/python3.9/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/notebooks/detrex/projects/maskdino/modeling/matcher.py", line 233, in forward
    return self.memory_efficient_forward(outputs, targets, cost)
  File "/usr/local/lib/python3.9/dist-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/notebooks/detrex/projects/maskdino/modeling/matcher.py", line 203, in memory_efficient_forward
    indices.append(linear_sum_assignment(C))
  File "/usr/local/lib/python3.9/dist-packages/scipy/optimize/_lsap.py", line 100, in linear_sum_assignment
    return _lsap_module.calculate_assignment(cost_matrix)
ValueError: matrix contains invalid numeric entries

The problem in my case is that some of the values in cost_class are inf. This was caused by numeric errors in the log calculation because 1 - out_prob + 1e-8 was zero. Increase the safety buffer from 1e-8 to 1e-7 solved the problem. To verify that 1e-8 is truncated to zero type torch.tensor(1e-8, dtype=torch.float16) and compare with torch.tensor(1e-7, dtype=torch.float16) .

HaoZhang534 commented 1 year ago

@FabianSchuetze Thank you very much for reporting the bug. 1e-7 seems not enough to avoid overflow. So we will change it to 1e-5.

FabianSchuetze commented 1 year ago

Solved with b978bf2

IDEA-Research / detrex

Linear Sum Assignment Fails: Avoid numeric errors in float16 #248