Joel-Dahne commented 2 years ago

I have some comments regarding correctness issues with minimum and maximum. One is about handling of signed zero for floating points and the other about short circuiting for NaN and missing. I believe the current implementation favours performance over correctness, something I think one should be extremely careful with.

Before discussing the issues I want to give some context about the minimum and maximum functions and why I think it is extra important that these methods are correctly implemented. The call minimum(A::AbstractArray) eventually calls mapreduce(identity, min, A), similarly minimum(f, A) eventually calls mapreduce(f, min, A). After some more indirection this calls Base._mapreduce(f, min, IndexLinear(), A), this method has some special cases if the length of A is small, more precisely it handles length(A) < 16 directly, otherwise it calls Base.mapreduce_impl(f, min, A, firstindex(A), lastindex(A)). So far everything uses the default implementations of the reduction functions, there is no special handling of the min argument, the Base.mapreduce_impl function does however have a specialised method for min and max found here.

The above indirection leads to three potential issues

The implementation used for the reduction depends on the length of the vector. This makes it easy to miss bugs or correctness issues since it is common to only write tests for short vectors.
It is impossible to dispatch on the return type of f. This means that if Base.mapreduce_impl(f, min, A, firstindex(A), lastindex(A)) makes some optimizations that are not valid for your custom type T there is no way to handle this. For most Julia functions you can create a custom method by dispatching on your type, but here this is not possible. This calls for being extra careful that the implementation is correct in all possible cases.
It might be possible to document any possible correctness issues for minimum and maximum. However, since they call mapreduce(f, min, A) you would get the same correctness issues if you for example call reduce(min, A) and it is not really possible to add documentation to reduce and mapreduce about special handing of min and max. Again this calls for being extra careful with correctness since it is not possible to document it fully.

Now to the actual issues. I give all examples using minimum but maximum has exactly the same issues. Whenever I refer to timings it is for a AMD Ryzen 9 5900X.

Handling of signed zeros

The main optimization done in the implementation of Base.mapreduce_impl(f, min, A, firstindex(A), lastindex(A)) is in the way it handles signed zeros. It calls _fast(min, x, y) which is defined by

_fast(::typeof(min),x,y) = min(x,y)
function _fast(::typeof(min),x::AbstractFloat, y::AbstractFloat)
    ifelse(isnan(x),
        x,
        ifelse(x < y, x, y))
end

so in general it calls min directly and there is nothing to discuss. But if x and y are floats it uses a simplified version of min which doesn't take into account the ordering of -0.0 and 0.0. To still enforce the ordering of the signed zeros it ends with

# enforce correct order of 0.0 and -0.0
# e.g. maximum([0.0, -0.0]) === 0.0
# should hold
if isbadzero(op, v)
    for i in first:last
        x = @inbounds A[i]
        isgoodzero(op,x) && return x
    end
end

However the issue with this code is that it doesn't apply the function f to x before comparison, so it only works correctly for f = identity (or as long as f preserves the sign I guess). For example we have

julia> minimum(abs, fill(-0.0, 16))
-0.0
julia> minimum(abs.(fill(-0.0, 16)))
0.0

Note that if you give fewer than 16 elements it uses another implementation and you get a correct result (so the example maximum([0.0, -0.0]) === 0.0 in the comment doesn't actually use the code below the comment), which makes it harder to track down this issue.

julia> minimum(abs, fill(-0.0, 15))
0.0

What I can tell there is no simple fix for this which doesn't sacrifice performance. One possibility is to add the application of f to the check for -0.0. It would give correct results but would mean applying f twice on each element, which I don't think is an option. The most "correct" solution would be to avoid using the _fast method and just use min directly. However this comes at a fairly large performance penalty compared to the current implementation, about a 100% increasing in runtime for Float64 in some benchmarks I did. Another, slightly more involved, solution is to special case on f = identity and apply this optimization for that case but not otherwise.

Personally I would lean to the correct but slower solution, avoid using _fast and call min directly. But I don't know what type of performance regressions people are willing to accept.

Lastly I can also mention that the current version gives a quite noticeable performance penalty if the minimum happens to be 0.0, since the array is traversed twice.

julia> A1 = rand(10^6);
julia> A2 = [0.0; rand(10^6 - 1)];
julia> @btime minimum($A1)
  174.130 μs (0 allocations: 0 bytes)
1.9306038734345776e-7
julia> @btime minimum($A2)
  412.420 μs (0 allocations: 0 bytes)
0.0

Short circuiting for `NaN` and `missing`

The implementation has a short circuit meant for when it encounters NaN or missing. The relevant code is

while simdstop <= last - 3
    # short circuit in case of NaN or missing
    (v1 == v1) === true || return v1
    (v2 == v2) === true || return v2
    (v3 == v3) === true || return v3
    (v4 == v4) === true || return v4
    @inbounds for i in start:4:simdstop
        v1 = _fast(op, v1, f(A[i+0]))
        v2 = _fast(op, v2, f(A[i+1]))
        v3 = _fast(op, v3, f(A[i+2]))
        v4 = _fast(op, v4, f(A[i+3]))
    end
    checkbounds(A, simdstop+3)
    start += chunk_len
    simdstop += chunk_len
end

The benefit of this is that we can avoid looking through the whole vector in some cases. However it has some issues. The main issue is that it short circuits on BOTH NaN and missing and hence can return NaN in cases where it would be more correct to return missing, exactly which one is returned is also slightly involved which makes it very hard to test for this case. Some examples

julia> # Result depends on order of NaN and missing
julia> A1 = vcat([NaN, missing], fill(0.0, 255));
julia> minimum(A1)
NaN
julia> A2 = vcat([missing, NaN], fill(0.0, 255));
julia> minimum(A2)
missing
julia> # Result depends on length of vector
julia> A3 = vcat([0.0, 0.0, missing, 0.0, 0.0, NaN], fill(0.0, 506));
julia> minimum(A3)
missing
julia> A4 = vcat([0.0, 0.0, missing, 0.0, 0.0, NaN], fill(0.0, 507));
julia> minimum(A4)
NaN

Here I think the obvious solution is to just remove this short circuiting. As mentioned in #35939 it seems unlikely that this is used in the hot path in many cases. For the vast majority of cases this should not lead to any performance reductions, it might even lead to very slight performance improvements. If one really wants to keep this behaviour then one option could be to only short circuit on missing, in which case the behaviour would be correct. If one really wants to I guess it might be possible to check if the return type of f contains missing or not and short circuit on NaN if it does not. In that case the check should be updated to use isnan. But as I said I would opt towards just removing the short circuiting altogether.

Behaviour for other types

Finally I wanted to give some observations on the behaviour for other types. The reason I took a look at this function from the beginning was due to issues I encountered when developing Arblib.jl.

For AbstractFloat it uses ifelse(isnan(x), x, ifelse(x < y, x, y)) instead of min directly. Except for not handling signed zeros this is equivalent for Float64 (and most other AbstractFloat in Base). For BigFloat it has the benefit of not allocating (which min does). In fact I don't know why min(x::BigFloat, y::BigFloat) is written to allocate a new variable for the result, in contrast the implementation of min(x::BigInt, y::BigInt) does not allocate a new variable.

My issue with using this approach instead of min directly is that it is not valid for the Arb type that occurs in Arblib.jl. The Arb type can slightly simplified be described as a floating point with some extra information keeping track of rounding errors, when taking the minimum these rounding errors have to be taken into account and this is not done with the above check. Now this might very well be my own fault for subtyping AbstractFloat even though the type doesn't satisfy the assumptions assumed by Julia, but in many cases it is very natural to treat an Arb as a floating point and see the tracking of the rounding errors as a bonus. In most cases you can then just dispatch on the Arb type when you need some different implementation, but as mentioned in the beginning this is not possible for minimum.

The second observation is related to the current implementation of the short circuiting. It currently assumes that (x == x) === true is false only for NaN and missing. While this is true for all types in base Julia I don't think it should be assumed for custom types. For example the Arb type mentioned above doesn't satisfy this assumption (x == x is false if the rounding error is non-zero). It would be better to be more explicit and instead use something along the like of(x isa Number && isnan(x)) || ismissing(x) as mentioned in #35989, this should also be slightly faster in many cases. Though my stance would be that the short circuiting should be removed entirely.

What to do now?

I would be happy to prepare a PR to handle some of these issues. But exactly what approach to take depends on how much performance one is willing to sacrifice or how much code complexity one is willing to add. Below are some possible solutions, I can add more benchmark information if it is of interest.

The simplest possible implementation which handles all of the above mentioned issues would be

function mapreduce_impl(
    f,
    op::Union{typeof(max),typeof(min)},
    A::AbstractArrayOrBroadcasted,
    first::Int,
    last::Int,
)
    @inbounds v1 = mapreduce_first(f, op, A[first])
    v2 = v3 = v4 = v1

    @inbounds indices = (first+1):4:(last-3)
    for i = indices
        v1 = op(v1, f(A[i+0]))
        v2 = op(v2, f(A[i+1]))
        v3 = op(v3, f(A[i+2]))
        v4 = op(v4, f(A[i+3]))
    end

    v = op(op(v1, v2), op(v3, v4))
    @inbounds for i = (indices[end]+3):last
        v = op(v, f(A[i]))
    end

    return v
end

The drawbacks of this approach is no short circuiting and worse performance for floating points. For Float64 it is about 100% slower. For BigFloat it is about 300% slower due to min allocating whereas the previous version didn't allocate. The performance regression for BigFloat I think is acceptable, if you want to make it faster you can use MutableArithmetic.jl. The performance penalty for Float64 I'm not sure if is okay or not.

A solution which solves some of the issues but doesn't pay much in terms of performance would be to replace the op in the above implementation with the _fast version. This would mean that it doesn't handle signed zeros correctly but at least it handles NaN and missing correctly. The only performance regression would be the removal of the short circuiting. In this case a note should probably be added to minimum about not handling signed zeros.

Finally, the most complicated solution would be to special case on f = identity and in that case use _fast and explicitly check for the result being zero in the end, falling back to min if f is not identity.

None of the solutions keep the short circuiting, this I think is a good trade of in terms of correctness versus performance. But others might think differently.

N5N3 commented 2 years ago

Thanks so much for the detailed report!

The short circuiting has been entirely removed in #43604. So at least the incoming 1.8 would handle NaN/missing corrently. As for signed zeros, I think it would be easier to fix the isbadzero branch directly. Although it might further aggravate the overhead of repeated traversal, the influence would be minimum if f is cheap. For heavy f, I believe the allocation overhead from minimum(f.(A)) could be ingnored.

As for Arb, a temperary solution is overloading Base._fast for your own type. Altough this function is for internal usage, and it might be removed in the future release (see #45581).

JeffBezanson commented 2 years ago

Short-circuiting for missing is maybe worthwhile but I agree it's not for NaN.

I wonder if we can make BigFloat min and max never allocate --- surely either one argument or the other is always a correct answer?

mikmoore commented 2 years ago

I've been swamped with real life stuff lately. I hope to actually finish #45581 soon. This should restore correctness and improve performance across a range of cases. The optimizations fail to apply with missing elements, but at least it will restore correctness to that case (and it appears they might still be faster than the current version anyway, although not as fast as possible). It should address all the points you raised above.

I hate function-specialized versions of general-purpose functions like Base.mapreduce_impl. They're a maintenance nightmare and tend to have unintended consequences (e.g., #45581 significantly improves minimum over Int elements just by liberating them from the trap of the specialized Base.mapreduce_impl that was added to accelerate IEEEFloat elements.

Joel-Dahne commented 2 years ago

Ah, so with #43604 then one of the issues is fixed! If #45581 might be finished somewhat soon then maybe the best idea is to just wait for that?

Indeed with #43604 fixing the short circuiting then overloading Base._fast is a solution for Arb then. I will just have to keep track of any new internal changes that might happen.

Regarding a non-allocating min and max for BigFloat I don't see why the implementation used for Float64 would not also work for BigFloat. Looking at the mpfr_min source code it doesn't seem to do anything special that we would have to take into account. I think the only difference would be that the current implementation allocates a fresh result and rounds it to the current precision. But if one of the arguments is NaN it returns that argument directly so you are already not guaranteed a freshly allocated result or that the precision is set to the current precision.

N5N3 commented 2 years ago

Regarding a non-allocating min and max for BigFloat.

41709 would take care of it.

nlw0 commented 2 years ago

Regarding signed zeros, my understanding is that -0.0 == 0.0, and they must be treated as the same number in every occasion. I'm surprised there's any handling of signed zeros in the codebase, other than perhaps specialized functions.

As for mapreduce, the discussion reminds me a bit of #45000, which I still consider a major issue despite it having been swiftly closed. My understanding is that the algorithm we use in general for mapreduce should have only been used for a specific application: accurate summations with arbitrary length. For moderate-sized arrays, the approach is actually less accurate today than a simple summation algorithm compiled for a modern processor, because of vectorization. I personally think there's a place for specialized algorithms, but using this "piecewise reduction" algorithm by default is not great. It's a specialized algorithm being used in general. A default reduction should simply translate into a basic for-loop. By using something else, we risk preventing compiler optimizations, and I believe this actually happens in many situations.

LilithHafner commented 1 year ago

nalimilan commented 1 year ago

https://github.com/JuliaLang/julia/pull/43604 removed short-circuiting in the presence of NaN or missing so some examples in the OP now return correct results.

LilithHafner commented 1 year ago

IIUC the only example in the OP that remains broken is

julia> minimum(abs, fill(-0.0, 16))
-0.0

julia> minimum(abs.(fill(-0.0, 16)))
0.0

JuliaLang / julia

Correctness issues with `minimum` and `maximum` #45932

Handling of signed zeros

Short circuiting for `NaN` and `missing`

Behaviour for other types

What to do now?

41709 would take care of it.

JuliaLang / julia

Correctness issues with `minimum` and `maximum` #45932

Handling of signed zeros

Short circuiting for NaN and missing

Behaviour for other types

What to do now?

41709 would take care of it.

Short circuiting for `NaN` and `missing`