DeepRegNet / DeepReg

Medical image registration using deep learning
Apache License 2.0
564 stars 76 forks source link

690 nan inf loss #719

Closed mathpluscode closed 3 years ago

mathpluscode commented 3 years ago

Description

After a long debugging, I found the final source of inf was due to a negative discrete variance (~1e-5) in LNCC loss function. This can happen is that, within a window, the voxels might have the same values thus the variance should be 0, then numerical issues causes (somehow) a negative value in the end.

The fix is to normalize the kernel weights directly so that we do not need to divide the kernel volume later. During debugging, the min variance is now ~-1e-11, althou still negative, it's around machine error now.

To be safe, we now clip the variance to 0 and add an EPS 1e-5 (EPS was 1e-7, but VoxelMorph adopts 1e-5 and if we consider mixed-precision later, 1e-5 will be safer).

Fixes #690

Type of change

What types of changes does your code introduce to DeepReg?

Please check the boxes that apply after submitting the pull request.

Checklist

Please check the boxes that apply after submitting the pull request.

If you're unsure about any of them, don't hesitate to ask. We're here to help! This is simply a reminder of what we are going to look for before merging your code.

codecov[bot] commented 3 years ago

Codecov Report

Merging #719 (6ad3312) into main (c48b78b) will not change coverage. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##              main      #719   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           38        39    +1     
  Lines         2445      2443    -2     
=========================================
- Hits          2445      2443    -2     
Impacted Files Coverage Δ
deepreg/constant.py 100.00% <100.00%> (ø)
deepreg/dataset/loader/interface.py 100.00% <100.00%> (ø)
deepreg/loss/image.py 100.00% <100.00%> (ø)
deepreg/loss/label.py 100.00% <100.00%> (ø)
deepreg/loss/util.py 100.00% <100.00%> (ø)
deepreg/model/network.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update c48b78b...6ad3312. Read the comment docs.

mathpluscode commented 3 years ago

I've been testing the new code overnight and no more inf/crash ^^ So I consider the bug to be fixed.

YipengHu commented 3 years ago

for record: Project-MONAI/MONAI#1868

YipengHu commented 3 years ago

@mathpluscode have you tested if the 2-pass algorithm produces negative var?

mathpluscode commented 3 years ago

@mathpluscode have you tested if the 2-pass algorithm produces negative var?

It won't as you do conv on the square of tensors, and conv having all positive weights. So by design, this will be >= 0.

I've tested the current implementation for multiple epochs and it seems to be safe.

YipengHu commented 3 years ago

@mathpluscode have you tested if the 2-pass algorithm produces negative var?

It won't as you do conv on the square of tensors, and conv having all positive weights. So by design, this will be >= 0.

I've tested the current implementation for multiple epochs and it seems to be safe.

so it won;t have the inf issue

mathpluscode commented 3 years ago

@mathpluscode have you tested if the 2-pass algorithm produces negative var?

It won't as you do conv on the square of tensors, and conv having all positive weights. So by design, this will be >= 0. I've tested the current implementation for multiple epochs and it seems to be safe.

so it won;t have the inf issue

It should not.

YipengHu commented 3 years ago

@acasamitjana can you have a look at this please?