Simple sum loop faster when adding `zero(T)`

In the example below, computing the sum of an Array{Union{Float64, Missing}} using a simple loop is faster when inserting a branch which adds zero(T) to the accumulator when an entry is missing rather than doing nothing. This is because the latter uses SIMD instructions, but not the former (even with @simd). Adding -zero(T) instead of zero(T) disables SIMD. Is this expected or could the compiler improve?

This means that sum(skipmissing(::Array{Union{Float64, Missing}})) is slower than what a specialized implementation which would add zero(T) could do (note that pairwise sum by passing ismissing(x) ? zero(T) : x has similar performance). But of course adding zero(T) doesn't give the same result in corner cases (when the vector contains only -0.0 AFAICT; are there other cases?). If that's the only solution, we could check whether there's at least one entry which is not -0.0, and if so switch to adding zero(T)?

julia> using BenchmarkTools

julia> function sum_nonmissing(X::AbstractArray)
           s = -zero(eltype(X))
           @inbounds @simd for x in X
               if x !== missing
                   s += x
               end
           end
           s
       end
sum_nonmissing (generic function with 1 method)

julia> function sum_nonmissing2(X::AbstractArray)
           s = -zero(eltype(X))
           @inbounds @simd for x in X
               if x !== missing
                   s += x
               else
                   s += zero(eltype(X))
               end
           end
           s
       end
sum_nonmissing2 (generic function with 1 method)

julia> function sum_nonmissing3(X::AbstractArray)
           s = -zero(eltype(X))
           @inbounds @simd for x in X
               if x !== missing
                   s += x
               else
                   s += -zero(eltype(X))
               end
           end
           s
       end
sum_nonmissing3 (generic function with 1 method)

julia> X1 = rand(10_000_000);

julia> X2 = Vector{Union{Missing, Float64}}(X1);

julia> X3 = ifelse.(rand(length(X2)) .< 0.9, X2, missing);

julia> @btime sum_nonmissing(X3)
  17.278 ms (1 allocation: 16 bytes)
4.499755611211945e6

julia> @btime sum_nonmissing2(X3)
  6.788 ms (1 allocation: 16 bytes)
4.4997556112116e6

julia> @btime sum_nonmissing3(X3)
  16.542 ms (1 allocation: 16 bytes)
4.499755611211945e6

JuliaLang / julia

Simple sum loop faster when adding `zero(T)` #40599