ELEKTRONN / elektronn3

A PyTorch-based library for working with 3D and 2D convolutional neural networks, with focus on semantic segmentation of volumetric biomedical image data
MIT License
160 stars 27 forks source link

"Invalid" targets with out-of-bounds elements #10

Closed mdraw closed 5 years ago

mdraw commented 6 years ago

In some of the batches that are created by PatchCreator, the target tensor contains elements that are not inside the expected value range (which is given by the number of unique classes that exist in the data set).

Quoting a comment from a previous commit message (https://github.com/ELEKTRONN/elektronn3/commit/46d0b2b0fc1f2beccdb02e1c788667dc218e73a9):

I found that the values of the maximum elements of the invalid targets are usually quite similar. Here are the last few examples from the warning message at cnndata:145, collected from a few different training runs at random steps:

65072 39121 65535 63480 65535 # found directly after the previous value 65509 64205

All of those are below 65536, which is 2**16. Most are only slightly smaller than 65536. (Why 2**16? Everything should be 32 bit (float) or 64 bit (int)...)

Such invalid targets are automatically detected by PatchCreator and their batches are discarded as a workaround for this problem, but that's certainly not a good way of dealing with it in the long term. We need to find out what's causing this bug. We may find the root of the problem somewhere around this code block: https://github.com/ELEKTRONN/elektronn3/blob/05dcd88340a66e703c79b6c6bc5e55e5939e899f/elektronn3/data/transformations.py#L355 or in the numba-jitted functions that are called from there.

mdraw commented 6 years ago

Accidentally closed this via a commit message...

mdraw commented 5 years ago

I can now finally reproduce this bug (which wasn't that easy because it happened only once every ~200,000 iterations) and found out what's causing it: The issue happens in the numba-jitted generalized ufunc code at https://github.com/ELEKTRONN/elektronn3/blob/076efe043db0badf092cd3a70e8a48c16dfe751a/elektronn3/data/coord_transforms.py#L24-L30 The reported garbage values appear if u, v and/or w point to indices in src that are out of the bounds of src, so line 30 reads from unallocated memory. I didn't really think that was possible because the process would have just segfaulted, but it turns out that segfaults only happen sometimes in this case, while in most cases the dest array will just be silently filled with some garbage values. Segfaults seem to happen more often if the out-of-bounds memory access is further away from the actually allocated values. It's hard to debug this because we can't set breakpoints, make shape checks or raise errors in jitted generalized ufuncs, so I'm not yet sure why exactly map_coordinates_nearest() is sometimes called in a way that causes these problems, but I'm working on finding it out.