cctbx / cctbx_project

Computational Crystallography Toolbox
https://cci.lbl.gov/docs/cctbx
Other
216 stars 114 forks source link

Treatment of NaN in flex #478

Open graeme-winter opened 4 years ago

graeme-winter commented 4 years ago

It seems that NaN's confuse flex histogram...

    ratio = flex.log10(t_i.select(t_nn) / r_i.select(r_nn)).as_double()

    print("Min: %f" % flex.min(ratio))
    print("Number < -5: %d" % (ratio < -5).count(True))

    r_hist = flex.histogram(ratio, data_min=-10, data_max=10, n_slots=10)

    for c, v in zip(r_hist.slot_centers(), r_hist.slots()):
        print(c, v)

It turns out there are 6,338 NaN's in ratio, which all end up in the smallest bin:

Min: -5.173720
Number < -5: 2
-9.0 6338
-7.0 0
-5.0 9
-3.0 59
-1.0 481347
1.0 343104
3.0 61
5.0 1
7.0 0
9.0 0

I believe flex.histogram should not be treating NaN in this way...

graeme-winter commented 4 years ago
      std::size_t
      get_i_slot(ValueType const& d_)
      {
        std::size_t i_slot = 0;
        ValueType d = d_ - data_min_;
        if (d != 0 && d >= slot_width_) {
              i_slot = static_cast<std::size_t>(d / slot_width_);
          if (i_slot >= slots_.size()) i_slot = slots_.size() - 1;
        }
        return i_slot;
      }

    protected:
      void
      assign_to_slot(ValueType const& d)
      {
        slots_[get_i_slot(d)]++;
      }

in scitbx/histogram.h - guess

if (not isnan(d)) {
        slots_[get_i_slot(d)]++;
}

would fix this, opinions?

graeme-winter commented 4 years ago

Update: this is not limited to flex.histogram - flex.min gives some weird results:

from scitbx.array_family import flex
import math

x = flex.double(range(-10, 100)) + 0.5
l = flex.log10(x)

for _ in l:
    if math.isnan(_):
        continue
    print(_)
    break

print(flex.min(l), flex.max(l))

gives

-0.3010299956639812
1.9800033715837464 1.9978230807457256

i.e. flex.min() gives the wrong answer if there are NaN's in the input list

graeme-winter commented 4 years ago

To be clear following the last comment: either NaN or min(!NaN) values should be returned, not some other random element. Returning NaN would be consistent with std::min and the true min value with fmin from math.h

phyy-nx commented 4 years ago

Hi, so first I note that your test behaves differently on my centos7 build and my macos build. Since #463, macos builds have removed -ffast-math, so that's likely the reason.

Test:

from scitbx.array_family import flex
import math

x = flex.double(range(-10, 100)) + 0.5
l = flex.log10(x)
print(flex.min(l), flex.max(l))

Min result on centos7: -0.3010299956639812 Min result on macos: NaN

In neither case do I get your min result of 1.9800033715837464. Not sure what is going on, but it's enough to say that how the software is compiled greatly impacts the results.

That said, I do have some context that could help inform this discussion (likely familiar to all parties). To my knowledge, cctbx doesn't claim IEEE compatibility. Indeed, until #324, we explicitly crashed developer environments when many operations that produce NaNs were executed (such as 1/0). The reason mainly is because we wanted --fast-math. So it was up to the developer to not allow their code to reach a situation where NaNs were possible.

Since #324, it could be argued that cctbx should support NaNs. This is a big statement to make. Your commit in #479 adds a check for NaN for every entry in a flex array while creating a histogram. The same check would need to be added to every flex operation, not just min and max, but mean, median, etc. That seems to me like a big overhead, but maybe modern compilers with branch prediction are good enough that this doesn't have an impact. Regardless, it's a big set of changes. Is that really what we want?

If we want to be IEEE compliant then we should drop --fast-math, but it's a big decision, likely a community one. Otherwise my vote is to keep the current policy: buyer beware if you generate NaNs. cctbx is specific for a scientific field (e.g. crystallography, cryo-EM, and modeling macromolecules), which generally means that the numbers we work with are finite. We're not like numpy, which is for general math. Usually, any algorithm that regularly generates NaN or infinity values has some issue that should be fixed at the algorithm level.

Thoughts or comments? Seems like a useful discussion to have.

Anthchirp commented 4 years ago

I just verified this using an installation of DIALS 1.1 we still keep around (March 2016, and thus well before any NaN discussions from #324). cctbx did evidently not make a guarantee that NaNs were always avoided. It just happened to crash some of the time.

$ dials.python
Python 2.7.8 (default_cci, Apr  8 2016, 15:33:21) 
[GCC 4.4.7 20120313 (Red Hat 4.4.7-16)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from scitbx.array_family import flex
>>> import math
>>> x = flex.double(range(-10, 100)) + 0.5
>>> l = flex.log10(x)
>>> list(l)
[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, -0.3010299956639812, 0.17609125905568124, 0.3979400086720376, 0.5440680443502757, 0.6532125137753437, 0.7403626894942439, 0.8129133566428556, 0.8750612633917001, 0.9294189257142927, 0.9777236052888477, 1.021189299069938, 1.0606978403536118, 1.0969100130080565, 1.130333768495006, 1.1613680022349748, 1.1903316981702914, 1.2174839442139063, 1.2430380486862944, 1.2671717284030137, 1.290034611362518, 1.3117538610557542, 1.3324384599156054, 1.3521825181113625, 1.3710678622717363, 1.3891660843645326, 1.4065401804339552, 1.423245873936808, 1.4393326938302626, 1.4548448600085102, 1.469822015978163, 1.4842998393467859, 1.4983105537896004, 1.5118833609788744, 1.5250448070368452, 1.5378190950732742, 1.550228353055094, 1.5622928644564746, 1.5740312677277188, 1.5854607295085006, 1.5965970956264601, 1.6074550232146685, 1.6180480967120927, 1.6283889300503116, 1.6384892569546374, 1.6483600109809315, 1.6580113966571124, 1.667452952889954, 1.6766936096248666, 1.6857417386022637, 1.6946051989335686, 1.7032913781186614, 1.711807229041191, 1.7201593034059568, 1.7283537820212285, 1.7363965022766423, 1.7442929831226763, 1.7520484478194385, 1.7596678446896306, 1.7671558660821804, 1.7745169657285496, 1.7817553746524688, 1.7888751157754168, 1.7958800173440752, 1.8027737252919755, 1.8095597146352678, 1.816241299991783, 1.8228216453031045, 1.8293037728310249, 1.8356905714924256, 1.841984804590114, 1.8481891169913987, 1.8543060418010806, 1.8603380065709938, 1.866287339084195, 1.8721562727482928, 1.8779469516291882, 1.8836614351536176, 1.8893017025063104, 1.8948696567452525, 1.9003671286564703, 1.9057958803678685, 1.9111576087399766, 1.916453948549925, 1.921686475483602, 1.9268567089496924, 1.9319661147281726, 1.9370161074648142, 1.9420080530223132, 1.9469432706978254, 1.951823035315912, 1.9566485792052033, 1.9614210940664483, 1.9661417327390325, 1.9708116108725178, 1.975431808509263, 1.9800033715837464, 1.9845273133437926, 1.9890046156985368, 1.9934362304976116, 1.9978230807457253]
>>> print(flex.min(l), flex.max(l))
(-0.3010299956639812, 1.9978230807457253)
>>> import dials.util.version
>>> dials.util.version.dials_version()
'DIALS 1.1.3-ga7ac5d8-release'

I do note however that in 2016 on RHEL7 it returned the correct values

Anthchirp commented 4 years ago

same results for all versions since (cs03r-sc-serv-16, RHEL 7.8 server)

$ for x in 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.10 1.11 1.12 1.14 2.0 2.1 2.2; do module load dials/$x; dials.python test.py; done
DIALS 1.2.6-gc8a142d-release
(-0.3010299956639812, 1.9978230807457253)
DIALS 1.3.4-g9f5e5cd-release
(-0.3010299956639812, 1.9978230807457253)
DIALS 1.4.5-g37e56cb3-release
(-0.3010299956639812, 1.9978230807457253)
DIALS 1.5.1-gf72becc3-release
(-0.3010299956639812, 1.9978230807457253)
DIALS 1.6.4-g47a579eb-release
(-0.3010299956639812, 1.9978230807457253)
DIALS 1.7.4-gbe933d59-release
(-0.3010299956639812, 1.9978230807457253)
DIALS 1.8.6-gc63d8697-release
(-0.3010299956639812, 1.9978230807457253)
DIALS 1.9.3-gb491019a-release
(-0.3010299956639812, 1.9978230807457253)
DIALS 1.10.4-gc14d9720b-release
(-0.3010299956639812, 1.9978230807457253)
DIALS 1.11.4-g9c616a499-release
(-0.3010299956639812, 1.9978230807457253)
DIALS 1.12.6-gdcf0b845a-release
(-0.3010299956639812, 1.9978230807457253)
DIALS 1.14.13-g10ecfbb15-release
(-0.3010299956639812, 1.9978230807457253)
DIALS 2.0.4-g0252e41c3-release
(-0.3010299956639812, 1.9978230807457253)
DIALS 2.1.5-g118c27f8c-release
(-0.3010299956639812, 1.9978230807457253)
DIALS 2.2.4-g04de204b4-release
-0.3010299956639812 1.9978230807457253
phyy-nx commented 4 years ago

@Anthchirp Regarding crashes, the #324 traps aren't present in the release builds so they won't trigger. Regardless, good to see the consistent results from the linux builds. I'd be curious if the mac builds suddenly change from -0.3 to NaN when #463 was merged.

graeme-winter commented 4 years ago

Thoughts or comments? Seems like a useful discussion to have.

Agree with this, certainly. Personally I am very much in favour of having code which behaves like IEEE says it should - robust and working trumps --ffast-math in my book.