Open carstenbauer opened 3 years ago
More?
Support non-10 bases, like digits
does (I know you said you're ignoring the base, but that's easy to add in Alex's implementation ð
)
This can be made O(n) instead of O(n log(n)) by not calculating 10^(i-1)
in each iteration, but building it up iteratively instead. Not sure that will make a big difference for the ranges Int
can cover though.
Should we also be careful and check that abs(x) < 10
for each x
?
As for names, perhaps this could even be a method of the Int
constructor Int(::AbstractArray{<:Integer}; base=10)
? Or would that be potentially confusing?
by not calculating 10^(i-1) in each iteration,
You can also just precompute them, that's what I see people typically do in various parsers. Assumes you will only go up to e.g. Int64 though.
Or would that be potentially confusing?
I would say so, yes.
I don't think precomputing is worth it here - also won't work with different bases
julia> rev2(arr; base=10) = begin
res = 0
factor = 1
length(arr) > 19 && throw(ArgumentError("Only up to 19 digits (resulting in maximally typemax(Int64) are supported"))
for x in arr
abs(x) >= base && throw(ArgumentError("Invalid digit $x for given base $base"))
res += x*factor
factor *= base
end
res
end
rev2 (generic function with 1 method)
julia> rev3(arr; base=10) = begin
res = 0
length(arr) > 19 && throw(ArgumentError("Only up to 19 digits (resulting in maximally typemax(Int64) are supported"))
factors = [ 1 10 100 1_000 10_000 100_000 1_000_000 10_000_000 100_000_000 1_000_000_000 10_000_000_000 100_000_000_000 1_000_000_000_000 10_000_000_000_000 100_000_000_000_000 1_000_000_000_000_000 10_0
00_000_000_000_000 100_000_000_000_000_000 1_000_000_000_000_000_000 ]
for (i,x) in enumerate(arr)
abs(x) >= base && throw(ArgumentError("Invalid digit $x for given base $base"))
res += x*factors[i]
end
res
end
rev3 (generic function with 1 method)
julia> @benchmark rev2(a) setup=(a=rand(0:9, 18)) evals=1
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.800 Ξs (0.00% GC)
median time: 1.900 Ξs (0.00% GC)
mean time: 2.048 Ξs (0.00% GC)
maximum time: 36.500 Ξs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
julia> @benchmark rev3(a) setup=(a=rand(0:9, 18)) evals=1
BenchmarkTools.Trial:
memory estimate: 240 bytes
allocs estimate: 1
--------------
minimum time: 1.900 Ξs (0.00% GC)
median time: 2.000 Ξs (0.00% GC)
mean time: 2.849 Ξs (0.00% GC)
maximum time: 296.700 Ξs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
@simd
and @inbounds
had no effect here.
Nice one, @mgkuhn ! But sadly slower than the naive loop version from the spoiler tag above:
julia> @benchmark evalpoly(10, (a...,)) setup=(a=rand(0:9, 18)) evals=1
BenchmarkTools.Trial:
memory estimate: 160 bytes
allocs estimate: 1
--------------
minimum time: 2.600 Ξs (0.00% GC)
median time: 2.800 Ξs (0.00% GC)
mean time: 4.417 Ξs (0.00% GC)
maximum time: 1.790 ms (0.00% GC)
--------------
samples: 10000
evals/sample: 1
though it may be different for different bases
I think it's fast if you remove the splat, just call evalpoly(10, digits(-1234))
etc.
What this doesn't do is any checking that the digits are single digits, and all the same sign.
Indeed - not any better than the version with error checking though. Branch predictors are truly marvellous!
julia> @benchmark evalpoly(10, a) setup=(a=rand(0:9, 18)) evals=1
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 1.800 Ξs (0.00% GC)
median time: 1.900 Ξs (0.00% GC)
mean time: 2.207 Ξs (0.00% GC)
maximum time: 314.900 Ξs (0.00% GC)
--------------
samples: 10000
evals/sample: 1
evalpoly
is probably faster for longer inputs, but those would become larger than typemax(Int64)
This is dramatically slower than what I am seeing on 1.6, though it's slightly tricky to benchmark:
jl> @btime evalpoly(10, a) setup=(a=rand(0:9, 18)) evals=1;
0.001 ns (0 allocations: 0 bytes)
jl> @btime evalpoly(10, a) setup=(a=rand(0:9, 18));
15.716 ns (0 allocations: 0 bytes)
jl> @btime evalpoly(10, $(Ref(a))[]);
16.650 ns (0 allocations: 0 bytes)
Even including input checking, it's around 20ns.
Here's one suggestion for an implementation. I suggest the name joindigits
since it is a bit similar to the join
function, which, incidentally has this behaviour
jl> join([2,3,4,5])
"2345"
There is also a joinpath
function that is somewhat similar.
function joindigits(v, base::Integer=10)
maxlength = log(base, typemax(base))
ind = findlast(!iszero, v)
if isnothing(ind)
return zero(base)
end
vlength = length(v) + ind - lastindex(v)
0 < vlength <= ceil(maxlength) || error("Max input length is $(ceil(Int, maxlength)), discounting trailing zeros.")
all(x->0<=x<base, v) || error("Digits are out of range")
val = evalpoly(base, v)
if val < 0
error("Numerical overflow.")
end
return val
end
The error messages are just placeholders for now.
The implementation checks length and range of digits, and also discounts trailing zeros in the inputs which will not affect the final number. I wanted to use the base
input to select the output type, and using it as a keyword argument. Unfortunately, that means you cannot dispatch on its type, e.g. for BigInt
s. Therefore it is just a positional argument here. Not sure if this is the right interface.
using BenchmarkTools
base = 10
n = typemax(Int)
v = digits(n; base=base)
@btime joindigits($v, $base)
# 39.376 ns (0 allocations: 0 bytes)
@btime joindigits($v, $(Int128(base)))
# 66.360 ns (0 allocations: 0 bytes)
This will fail if the input vector is a vector with non-standard indexing, due to the behaviour of evalpoly
, I've opened an issue about this here: https://github.com/JuliaLang/julia/issues/40454
The implementation will fail if one tries to use BigInt
as base
, so I made an alternative implementation for BigInt
:
function joindigits(v, base::BigInt=10)
all(x->0<=x<base, v) || error("Digits are out of range")
return _bigevalpoly(base, v)
end
function _bigevalpoly(n::BigInt, p)
ex = big(last(p))
for i in reverse(eachindex(p))[2:end]
Base.GMP.MPZ.mul!(ex, n)
Base.GMP.MPZ.add_ui!(ex, p[i])
end
return ex
end
Performance of this is not too bad (for BigInt
):
@btime joindigits($v, $(big(base)))
# 252.676 ns (3 allocations: 48 bytes)
Note that the BigInt
implementation will work for non-standard indexing.
I've spent a bit of time fiddling with this problem and came up with this
fromdigits_unchecked(digs::AbstractVector{D}; base::D=D(10)) where D<:Integer =
fromdigits_unchecked(D, digs; base=base)
function fromdigits_unchecked(T::Type{<:Integer}, digs::AbstractVector{D}; base::D=(10)) where D<:Integer
# return evalpoly(base, digs)
# return foldr((d, x) -> base*x + d, digs; init=zero(T))
x = zero(T)
for i in reverse(eachindex(digs))
x = base*x + digs[i]
end
x
end
fromdigits(digs::AbstractVector{D}; base::D=D(10)) where D<:Integer =
fromdigits(D, digs; base=base)
function fromdigits(T::Type{<:Integer}, digs::AbstractVector{D}; base::D=D(10)) where D<:Integer
all(d -> d >= 0, digs) || all(d -> d <= 0, digs) || throw(ArgumentError("All the digits must have the same sign"))
x = zero(T)
for i in reverse(eachindex(digs))
d = digs[i]
abs(d) < base || throw(ArgumentError("Invalid digit $d for given base $base"))
base_x = T(Base.Checked.checked_mul(base, x))
x = T(Base.Checked.checked_add(base_x, d))
end
x
end
The version fromdigits_unckecked
does not do any validation of the input. It is equivalent to the evalpoly
/foldr
in the comments, while being slightly faster than both. It does not check for overflow in the computations.
The version fromdigits
does validation of the input and checks for overflows. The digits must be all non-negative or all non-positive and they must be less than base
in absolute value. The digits and the base must be of the same type, and the result will be computed in that type. Optionally, a type can be passed in as the first argument, in which case the result will be computed with that type.
Here are some examples:
julia> fromdigits(fill(1, 19))
1111111111111111111
julia> fromdigits(fill(1, 20))
ERROR: OverflowError: 10 * 1111111111111111111 overflowed for type Int64
Stacktrace:
[1] checked_mul
@ ./checked.jl:288 [inlined]
[2] fromdigits(T::Type{Int64}, digs::Vector{Int64}; base::Int64)
@ Main ./REPL[31]:7
...
julia> fromdigits(BigInt, fill(1, 25))
1111111111111111111111111
julia> fromdigits(digits(UInt32, 12345))
0x00003039
julia> fromdigits(UInt16[0x1, 0x2, 0x3]; base=0x0010)
0x0321
fromdigits_unchecked
is roughly twice as fast as fromdigits
. fromdigits
is about 25% faster than joindigits
, while doing more checks and being more versatile.
julia> @benchmark fromdigits_unchecked(a) setup=(a=rand(0:9,18))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 16.445 ns (0.00% GC)
median time: 16.671 ns (0.00% GC)
mean time: 17.325 ns (0.00% GC)
maximum time: 64.004 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 998
julia> @benchmark fromdigits(a) setup=(a=rand(0:9,18))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 29.861 ns (0.00% GC)
median time: 30.471 ns (0.00% GC)
mean time: 31.110 ns (0.00% GC)
maximum time: 80.827 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 994
julia> @benchmark joindigits(a) setup=(a=rand(0:9,18))
BenchmarkTools.Trial:
memory estimate: 0 bytes
allocs estimate: 0
--------------
minimum time: 39.266 ns (0.00% GC)
median time: 40.510 ns (0.00% GC)
mean time: 40.931 ns (0.00% GC)
maximum time: 100.346 ns (0.00% GC)
--------------
samples: 10000
evals/sample: 991
PS: I have a couple ideas on how to improve the performance even more, but it lengthen the code because it treats Signed/Unsiged
separately. I'll probably put all of this in a package as soon as possible (done). Moreover, the BigInt
implementation can be specialized and optimized as in the previous answer.
fromdigits(digs::AbstractVector{D}; base::D=D(10)) where D<:Integer = fromdigits(D, digs; base=base)
I think the types here exclude one use of such a function -- shouldn't something like undigits([false, true, false, true, false, true]; base=2)
return an Int
? Doing this by bit-shifting should also be faster:
function undigits(v::AbstractVector{Bool})
out = 0
for i in 1:length(v) @inbounds out += (v[i] << (i - 1)) end
out
end
I agree with you that the signature could be more flexible. I intentionally constrained the digits and the base to be of the same type for more safety, but it is probably unnecessary. Also, a specialization can be added for BitVector
. And as you say bitshifting could be used when base is a power of 2
. I've started a package FromDigits,jl
for experimentation and with the goal of developing the fastest, safest, easiest to use, most versatile fromdigits
and fromdigits_unchecked
.
I've updated FromDigits.jl
. Now there is even a test for this case.
Unfortunately until the weekend I will not have time to implement the other features (optimized BitArray
, BigInts
, Unsigned
...).
PS: this is mostly a note to self:
julia> v = [false, true, false, true, false, true]
julia> BitVector(v).chunks[1] |> Int
42
Should we create a PR for this? (@FedericoStra maybe you want to give it a try given that you've spent some time thinking about this)
The hardest part seems (to me) to be a good name. But you could make a PR with a strawman name and then we could figure it out from there.
How bad would it be to have digits
return some Digits
type (probably <: AbstractVector{Int}
to avoid breakage), which would then allow us to do Int(::Digits) = this proposal
?
For performance, I think it may be hard to beat something along the lines of
function digitstoint(A,base=10)
n = 0
l = length(A)
for i = 1:l
n = base*n + A[i]
end
return n
end
julia> digitstoint([1,2,3,4])
1234
julia> @benchmark digitstoint($[1,2,3,4])
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min âĶ max): 3.553 ns âĶ 22.348 ns â GC (min âĶ max): 0.00% âĶ 0.00%
Time (median): 3.883 ns â GC (median): 0.00%
Time (mean Âą Ï): 4.388 ns Âą 1.350 ns â GC (mean Âą Ï): 0.00% Âą 0.00%
ââ â
ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ â
3.55 ns Histogram: frequency by time 7.77 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
In principle, a bitshift could be faster for bases that are powers of two, but in practice
function digitstoint2(A)
n = 0
l = length(A)
for i = 1:l
n = n<<1 + A[i]
end
return n
end
it seems to come out about the same
julia> @benchmark digitstoint2($[1,2,3,4])
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
Range (min âĶ max): 3.820 ns âĶ 39.756 ns â GC (min âĶ max): 0.00% âĶ 0.00%
Time (median): 3.831 ns â GC (median): 0.00%
Time (mean Âą Ï): 4.175 ns Âą 1.168 ns â GC (mean Âą Ï): 0.00% Âą 0.00%
ââ â â â â â â
ââââââââ
â
âââ
ââââââââââââââââââââââââââââââââââââââââââââââ â
3.82 ns Histogram: log(frequency) by time 8.33 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
See also: https://stackoverflow.com/questions/68493608/convert-an-integer-vector-to-an-integer
That's basically the version by @FedericoStra, except there's no error checking and the base is not generic so that's probably why it's faster.
Ref Advent of Code 2021, Day 3 "Binary Diagnostic" ð
@mschauer just need:
evalpoly(2, reverse(x))
where x
is the BitVector of Bool[]
you're left with
We have the
digits
function forinteger -> digits
:However, we don't have the "inverse function" that implements
digits -> integer
. A top of the head implementation (ignoringbase
andpad
keyword arguments) could be something like (cc: @ararslan)(Very ugly alternative:
parse(Int, join(digits))
)such that
I feel we should have something like this, what do you think?
Potentially issues/difficulties:
Int
. We would need to dynamically extend toBigInt
or similar which is obviously suboptimal.