Open Joel-Dahne opened 2 years ago
Thanks so much for the detailed report!
The short circuiting has been entirely removed in #43604. So at least the incoming 1.8 would handle NaN
/missing
corrently.
As for signed zeros, I think it would be easier to fix the isbadzero
branch directly.
Although it might further aggravate the overhead of repeated traversal, the influence would be minimum if f
is cheap.
For heavy f
, I believe the allocation overhead from minimum(f.(A))
could be ingnored.
As for Arb
, a temperary solution is overloading Base._fast
for your own type.
Altough this function is for internal usage, and it might be removed in the future release (see #45581).
Short-circuiting for missing
is maybe worthwhile but I agree it's not for NaN.
I wonder if we can make BigFloat min and max never allocate --- surely either one argument or the other is always a correct answer?
I've been swamped with real life stuff lately. I hope to actually finish #45581 soon. This should restore correctness and improve performance across a range of cases. The optimizations fail to apply with missing
elements, but at least it will restore correctness to that case (and it appears they might still be faster than the current version anyway, although not as fast as possible). It should address all the points you raised above.
I hate function-specialized versions of general-purpose functions like Base.mapreduce_impl
. They're a maintenance nightmare and tend to have unintended consequences (e.g., #45581 significantly improves minimum
over Int
elements just by liberating them from the trap of the specialized Base.mapreduce_impl
that was added to accelerate IEEEFloat
elements.
Ah, so with #43604 then one of the issues is fixed! If #45581 might be finished somewhat soon then maybe the best idea is to just wait for that?
Indeed with #43604 fixing the short circuiting then overloading Base._fast
is a solution for Arb
then. I will just have to keep track of any new internal changes that might happen.
Regarding a non-allocating min and max for BigFloat
I don't see why the implementation used for Float64
would not also work for BigFloat
. Looking at the mpfr_min
source code it doesn't seem to do anything special that we would have to take into account. I think the only difference would be that the current implementation allocates a fresh result and rounds it to the current precision. But if one of the arguments is NaN
it returns that argument directly so you are already not guaranteed a freshly allocated result or that the precision is set to the current precision.
Regarding a non-allocating min and max for BigFloat.
Regarding signed zeros, my understanding is that -0.0 == 0.0, and they must be treated as the same number in every occasion. I'm surprised there's any handling of signed zeros in the codebase, other than perhaps specialized functions.
As for mapreduce, the discussion reminds me a bit of #45000, which I still consider a major issue despite it having been swiftly closed. My understanding is that the algorithm we use in general for mapreduce should have only been used for a specific application: accurate summations with arbitrary length. For moderate-sized arrays, the approach is actually less accurate today than a simple summation algorithm compiled for a modern processor, because of vectorization. I personally think there's a place for specialized algorithms, but using this "piecewise reduction" algorithm by default is not great. It's a specialized algorithm being used in general. A default reduction should simply translate into a basic for-loop. By using something else, we risk preventing compiler optimizations, and I believe this actually happens in many situations.
https://github.com/JuliaLang/julia/pull/43604 removed short-circuiting in the presence of NaN
or missing
so some examples in the OP now return correct results.
IIUC the only example in the OP that remains broken is
julia> minimum(abs, fill(-0.0, 16))
-0.0
julia> minimum(abs.(fill(-0.0, 16)))
0.0
I have some comments regarding correctness issues with
minimum
andmaximum
. One is about handling of signed zero for floating points and the other about short circuiting forNaN
andmissing
. I believe the current implementation favours performance over correctness, something I think one should be extremely careful with.Before discussing the issues I want to give some context about the
minimum
andmaximum
functions and why I think it is extra important that these methods are correctly implemented. The callminimum(A::AbstractArray)
eventually callsmapreduce(identity, min, A)
, similarlyminimum(f, A)
eventually callsmapreduce(f, min, A)
. After some more indirection this callsBase._mapreduce(f, min, IndexLinear(), A)
, this method has some special cases if the length ofA
is small, more precisely it handleslength(A) < 16
directly, otherwise it callsBase.mapreduce_impl(f, min, A, firstindex(A), lastindex(A))
. So far everything uses the default implementations of the reduction functions, there is no special handling of themin
argument, theBase.mapreduce_impl
function does however have a specialised method formin
andmax
found here.The above indirection leads to three potential issues
f
. This means that ifBase.mapreduce_impl(f, min, A, firstindex(A), lastindex(A))
makes some optimizations that are not valid for your custom typeT
there is no way to handle this. For most Julia functions you can create a custom method by dispatching on your type, but here this is not possible. This calls for being extra careful that the implementation is correct in all possible cases.minimum
andmaximum
. However, since they callmapreduce(f, min, A)
you would get the same correctness issues if you for example callreduce(min, A)
and it is not really possible to add documentation toreduce
andmapreduce
about special handing ofmin
andmax
. Again this calls for being extra careful with correctness since it is not possible to document it fully.Now to the actual issues. I give all examples using
minimum
butmaximum
has exactly the same issues. Whenever I refer to timings it is for a AMD Ryzen 9 5900X.Handling of signed zeros
The main optimization done in the implementation of
Base.mapreduce_impl(f, min, A, firstindex(A), lastindex(A))
is in the way it handles signed zeros. It calls_fast(min, x, y)
which is defined byso in general it calls
min
directly and there is nothing to discuss. But ifx
andy
are floats it uses a simplified version ofmin
which doesn't take into account the ordering of-0.0
and0.0
. To still enforce the ordering of the signed zeros it ends withHowever the issue with this code is that it doesn't apply the function
f
tox
before comparison, so it only works correctly forf = identity
(or as long asf
preserves the sign I guess). For example we haveNote that if you give fewer than 16 elements it uses another implementation and you get a correct result (so the example
maximum([0.0, -0.0]) === 0.0
in the comment doesn't actually use the code below the comment), which makes it harder to track down this issue.What I can tell there is no simple fix for this which doesn't sacrifice performance. One possibility is to add the application of
f
to the check for-0.0
. It would give correct results but would mean applyingf
twice on each element, which I don't think is an option. The most "correct" solution would be to avoid using the_fast
method and just usemin
directly. However this comes at a fairly large performance penalty compared to the current implementation, about a 100% increasing in runtime forFloat64
in some benchmarks I did. Another, slightly more involved, solution is to special case onf = identity
and apply this optimization for that case but not otherwise.Personally I would lean to the correct but slower solution, avoid using
_fast
and callmin
directly. But I don't know what type of performance regressions people are willing to accept.Lastly I can also mention that the current version gives a quite noticeable performance penalty if the minimum happens to be
0.0
, since the array is traversed twice.Short circuiting for
NaN
andmissing
The implementation has a short circuit meant for when it encounters
NaN
ormissing
. The relevant code isThe benefit of this is that we can avoid looking through the whole vector in some cases. However it has some issues. The main issue is that it short circuits on BOTH
NaN
andmissing
and hence can returnNaN
in cases where it would be more correct to returnmissing
, exactly which one is returned is also slightly involved which makes it very hard to test for this case. Some examplesHere I think the obvious solution is to just remove this short circuiting. As mentioned in #35939 it seems unlikely that this is used in the hot path in many cases. For the vast majority of cases this should not lead to any performance reductions, it might even lead to very slight performance improvements. If one really wants to keep this behaviour then one option could be to only short circuit on
missing
, in which case the behaviour would be correct. If one really wants to I guess it might be possible to check if the return type off
containsmissing
or not and short circuit onNaN
if it does not. In that case the check should be updated to useisnan
. But as I said I would opt towards just removing the short circuiting altogether.Behaviour for other types
Finally I wanted to give some observations on the behaviour for other types. The reason I took a look at this function from the beginning was due to issues I encountered when developing Arblib.jl.
For
AbstractFloat
it usesifelse(isnan(x), x, ifelse(x < y, x, y))
instead ofmin
directly. Except for not handling signed zeros this is equivalent forFloat64
(and most otherAbstractFloat
in Base). ForBigFloat
it has the benefit of not allocating (whichmin
does). In fact I don't know whymin(x::BigFloat, y::BigFloat)
is written to allocate a new variable for the result, in contrast the implementation ofmin(x::BigInt, y::BigInt)
does not allocate a new variable.My issue with using this approach instead of
min
directly is that it is not valid for theArb
type that occurs in Arblib.jl. TheArb
type can slightly simplified be described as a floating point with some extra information keeping track of rounding errors, when taking the minimum these rounding errors have to be taken into account and this is not done with the above check. Now this might very well be my own fault for subtypingAbstractFloat
even though the type doesn't satisfy the assumptions assumed by Julia, but in many cases it is very natural to treat anArb
as a floating point and see the tracking of the rounding errors as a bonus. In most cases you can then just dispatch on theArb
type when you need some different implementation, but as mentioned in the beginning this is not possible forminimum
.The second observation is related to the current implementation of the short circuiting. It currently assumes that
(x == x) === true
is false only forNaN
andmissing
. While this is true for all types in base Julia I don't think it should be assumed for custom types. For example theArb
type mentioned above doesn't satisfy this assumption (x == x
is false if the rounding error is non-zero). It would be better to be more explicit and instead use something along the like of(x isa Number && isnan(x)) || ismissing(x)
as mentioned in #35989, this should also be slightly faster in many cases. Though my stance would be that the short circuiting should be removed entirely.What to do now?
I would be happy to prepare a PR to handle some of these issues. But exactly what approach to take depends on how much performance one is willing to sacrifice or how much code complexity one is willing to add. Below are some possible solutions, I can add more benchmark information if it is of interest.
The simplest possible implementation which handles all of the above mentioned issues would be
The drawbacks of this approach is no short circuiting and worse performance for floating points. For
Float64
it is about 100% slower. ForBigFloat
it is about 300% slower due tomin
allocating whereas the previous version didn't allocate. The performance regression forBigFloat
I think is acceptable, if you want to make it faster you can use MutableArithmetic.jl. The performance penalty forFloat64
I'm not sure if is okay or not.A solution which solves some of the issues but doesn't pay much in terms of performance would be to replace the
op
in the above implementation with the_fast
version. This would mean that it doesn't handle signed zeros correctly but at least it handlesNaN
andmissing
correctly. The only performance regression would be the removal of the short circuiting. In this case a note should probably be added tominimum
about not handling signed zeros.Finally, the most complicated solution would be to special case on
f = identity
and in that case use_fast
and explicitly check for the result being zero in the end, falling back tomin
iff
is notidentity
.None of the solutions keep the short circuiting, this I think is a good trade of in terms of correctness versus performance. But others might think differently.