JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.77k stars 5.49k forks source link

binomial is sluggish on Int128; @generated binomial(n, ::Val{k}) where {k} #31093

Open chethega opened 5 years ago

chethega commented 5 years ago

We are missing

julia> binomial(Int128(137), 5)
ERROR: MethodError: no method matching binomial(::Int128, ::Int64)

Further, binomial is very slow on Int128:

julia> @btime binomial(Int128(137), Int128(5))
  2.678 μs (64 allocations: 992 bytes)
373566942

This is due to the widemul. At the very least, we should probably use Base.GMP.MPZ to do all operations in-place (I think we only need to allocate 2 BigInt that we can re-use). But it is probably not too hard to avoid BigInt altogether. binomial on Int128 is quite relevant, due to its tendency to overflow on small integer types.

Lastly, it would be very cool to have binomial(n, ::Val{k}) where k: Many applications know how many (small k) elements they want to choose, out of a runtime n. Unfortunately, constant-prop is currently not up to the task of speeding this up (even if binomial were @inline). Knowing k at compile time has three advantages: Most importantly, expensive integer divisions can be replaced by multiplications and shifts. LLVM is quite adept at this. Second, the loop can be fully unrolled. Third, we only need a single overflow check at the beginning: We know k at compile time, so we need two comparisons: One to check whether we can return any result at all, and a second one to check whether widening is needed at all (if n and k are relatively small Int64, then we don't need to promote intermediate values to Int128). This could look like

julia> @inline function Base.binomial(n::Int64, ::Val{k}) where k
       if @generated
           rss = Expr(:block)
           rs = rss.args
           push!(rs, :(acc = 1))
           for i=0:k-1
               push!(rs, :(acc = div(acc*(n-$i), $(i+1))))
           end
           push!(rs, :(return acc))
           return rss
       else
           return binomial(n,k)
       end
       end
julia> ac1 = n->binomial(n, 5); ac2 = n->binomial(n, Val(5)); ns = rand(1:(1<<12), 10_000);

julia> using BenchmarkTools

julia> @btime ac1.($ns);
  835.487 μs (4 allocations: 78.23 KiB)

julia> @btime ac2.($ns);
  61.778 μs (4 allocations: 78.23 KiB)

That is a hefty speedup from 80 to 6 nanoseconds. A real implementation would behave slightly different: Due to the necessary branches, it could not simd; but we could possibly save some cycles by pulling the divisions out, as long as that is possible without overflow (without overflow checking, that brings be to 3ns for binomial(n, Val(5)); that means that we beat lookup tables, if Val is applicable).

Is there interest in such a variant, or is this something that should rather live in a package?

It should be possible to make this non-@generated, by using recursion; but that would be more complicated. Is there an advantage to that?

laborg commented 1 year ago

This has improved

julia> @btime binomial(Int128(137), 5)
  5.048 ns (0 allocations: 0 bytes)
3.73566942e8

but its still slow for two Int128 arguments:

julia> @btime binomial(Int128(137), Int128(5))
  2.898 μs (52 allocations: 992 bytes)
373566942