Closed ralfulrich closed 3 years ago
... same thing for make_histogram_with(dense_storage<mean>(), ...)
which I tried as alternative solution.
Not at all. Returning NaN is the intended behavior. The variance is undefined when you have a single sample per bin, returning anything other than NaN makes no sense. The problem is that you have signaling NaNs enabled, which is not the default.
There are three options: 1) You check for .count() > 1 before calling .variance(). 2) You turn off signaling NaNs. 3) You write your own accumulator that implements the behavior that you want and use that with Boost.Histogram. The library was designed to make using custom Accumulators easy.
After inspecting the code, I am reopening this. I maintain that reporting NaN for .count() == 1 is a sensible choice, but I admit that there is an annoying inconsistency, because .variance() is 0 if .count() is 0, but .variance() is NaN if .count() is 1. This is awkward. Ideally, it should be either NaN in both cases or 0 in both cases.
@ralfulrich I want to give you some context why the design is how it is, so that you understand why I cannot just "fix" this. In the C++ stdlib and in Boost we aim for the Zero Overhead Principle. Library code should be the most efficient implementation that anyone could write. It should not be possible for the user to write code which is more efficient. An example is std::vector::operator[]
, which does not check whether the argument index is valid, because doing these checks always would hurt also those users which know for sure that their indices are always valid.
The builtin accumulators were written with this principle in mind. The consequence is that validity checks are sometimes pushed onto the user. In this case, the user is supposed to check .count() before calling .variance(). If I included that check in the call to .variance() it would incur a runtime penalty for all users, even those which know for sure that their bins always have more than 2 entries.
There are three options to go about this: 1) I change the code to always return 0 for .count() < 2. This makes the code slightly slower for everyeone 2) I change the code to always return NaN for .count() < 2. This makes the code slightly slower for everyone. 3) I do nothing and write in the docs that the return value of .variance() for .count() < 2 is undefined. This would be the usual C++ solution, compliant with the Zero Overhead Principle.
Behavior is now documented (option 3) in develop and master.
Running a standard profile and fill it with a single entry will produce a SiGFPE when using streaming to cout:
produces:
which indicates that the
variance()
method is not robust at all. At least it should check whethersum_>1
and return0
otherwise, or something smarter (?).