Open chethega opened 5 years ago
This has improved
julia> @btime binomial(Int128(137), 5)
5.048 ns (0 allocations: 0 bytes)
3.73566942e8
but its still slow for two Int128
arguments:
julia> @btime binomial(Int128(137), Int128(5))
2.898 μs (52 allocations: 992 bytes)
373566942
We are missing
Further,
binomial
is very slow onInt128
:This is due to the
widemul
. At the very least, we should probably useBase.GMP.MPZ
to do all operations in-place (I think we only need to allocate 2BigInt
that we can re-use). But it is probably not too hard to avoidBigInt
altogether.binomial
onInt128
is quite relevant, due to its tendency to overflow on small integer types.Lastly, it would be very cool to have
binomial(n, ::Val{k}) where k
: Many applications know how many (small k) elements they want to choose, out of a runtimen
. Unfortunately, constant-prop is currently not up to the task of speeding this up (even ifbinomial
were@inline
). Knowingk
at compile time has three advantages: Most importantly, expensive integer divisions can be replaced by multiplications and shifts. LLVM is quite adept at this. Second, the loop can be fully unrolled. Third, we only need a single overflow check at the beginning: We knowk
at compile time, so we need two comparisons: One to check whether we can return any result at all, and a second one to check whether widening is needed at all (ifn
andk
are relatively smallInt64
, then we don't need to promote intermediate values toInt128
). This could look likeThat is a hefty speedup from 80 to 6 nanoseconds. A real implementation would behave slightly different: Due to the necessary branches, it could not simd; but we could possibly save some cycles by pulling the divisions out, as long as that is possible without overflow (without overflow checking, that brings be to 3ns for
binomial(n, Val(5))
; that means that we beat lookup tables, ifVal
is applicable).Is there interest in such a variant, or is this something that should rather live in a package?
It should be possible to make this non-
@generated
, by using recursion; but that would be more complicated. Is there an advantage to that?