GalSim-developers / GalSim

The modular galaxy image simulation toolkit. Documentation:
http://galsim-developers.github.io/GalSim/
Other
227 stars 107 forks source link

Fix subtle floating point bug in ProbabilityTree #1080

Closed rmjarvis closed 4 years ago

rmjarvis commented 4 years ago

@jchiang87 found a bug in drawing InterpolatedImage objects when there were a bunch of pixels with values around 1.e-16 -- 1.e-12 of the total flux where it would seg fault on the draw command.

It turned out to be due to a subtle bug in the ProbabilityTree code where some sums could have rounding errors that caused some of the totals to get out of sync, eventually leading a pointer to go one higher than it should have been able to. Once that happened, it went into a loop that sent it off into the depths of the computer's memory dereferencing things willy nilly until it seg faulted.

Anyway, the solution was to add an extra check to make sure that the variable (mid) doesn't go past the part of the array being worked on (end), rather than relying on the floating point math to ensure that, since it doesn't necessarily do so, apparently.

Along the way, I was also getting frustrated trying to get the asserts to trigger, since the python compiler automatically sets -DNDEBUG, which turns off asserts, and it's hard to undo that it turns out. Plus, the regular assert bombs out with an abort, which isn't very python-friendly.

So I rolled my own assert statement (copied from TreeCorr actually), which will now always run some checks regardless of NDEBUG status to make sure we don't get seg faults if things aren't what we expect. And if they do hit, it raises a runtime_error, rather than abort, so a bit less catastrophic when running from python.

We have some other asserts throughout the code, and I checked that they are all not doing any significant calculations. Most of them check for non-NULL pointers and similar things that would lead to badness if they were false. So probably worth keeping them all turned on all the time.

There is also an xassert we use in various places which is (still) not run unless DEBUGLOGGING is turned on. So those are mostly for trying to track down bugs. Some of these do have arithmetic in them, but they turn into a no op in normal running, so that's ok.

rmjarvis commented 4 years ago

Note: This is listed to merge into releases/2.2. Once merged, I'll also cherry-pick it to master. But we should probably cut a bugfix release in the 2.2 series with this fix.