JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.72k stars 5.48k forks source link

digits2integer: inverse digits function #40393

Open carstenbauer opened 3 years ago

carstenbauer commented 3 years ago

We have the digits function for integer -> digits:

julia> digits(1234)
4-element Vector{Int64}:
 4
 3
 2
 1

However, we don't have the "inverse function" that implements digits -> integer. A top of the head implementation (ignoring base and pad keyword arguments) could be something like (cc: @ararslan)

digits2integer(digits) = sum(((i, x),) -> x * 10^(i - 1), pairs(digits))

(Very ugly alternative: parse(Int, join(digits)))

such that

julia> digits2integer(digits(1234))
1234

I feel we should have something like this, what do you think?

Potentially issues/difficulties:

giordano commented 3 years ago

More?

Support non-10 bases, like digits does (I know you said you're ignoring the base, but that's easy to add in Alex's implementation 😅)

simeonschaub commented 3 years ago

This can be made O(n) instead of O(n log(n)) by not calculating 10^(i-1) in each iteration, but building it up iteratively instead. Not sure that will make a big difference for the ranges Int can cover though. Should we also be careful and check that abs(x) < 10 for each x? As for names, perhaps this could even be a method of the Int constructor Int(::AbstractArray{<:Integer}; base=10)? Or would that be potentially confusing?

KristofferC commented 3 years ago

by not calculating 10^(i-1) in each iteration,

You can also just precompute them, that's what I see people typically do in various parsers. Assumes you will only go up to e.g. Int64 though.

Or would that be potentially confusing?

I would say so, yes.

Seelengrab commented 3 years ago

I don't think precomputing is worth it here - also won't work with different bases

julia> rev2(arr; base=10) = begin                                                                                                                                                                                  
        res = 0                                                                                                                                                                                                    
        factor = 1                                                                                                                                                                                                 
        length(arr) > 19 && throw(ArgumentError("Only up to 19 digits (resulting in maximally typemax(Int64) are supported"))                                                                                      
        for x in arr                                                                                                                                                                                               
          abs(x) >= base && throw(ArgumentError("Invalid digit $x for given base $base"))                                                                                                                          
          res += x*factor                                                                                                                                                                                          
          factor *= base                                                                                                                                                                                             
        end                                                                                                                                                                                                        
        res                                                                                                                                                                                                        
        end                                                                                                                                                                                                        
rev2 (generic function with 1 method)                                                                                                                                                                              

julia> rev3(arr; base=10) = begin                                                                                                                                                                                  
        res = 0                                                                                                                                                                                                    
        length(arr) > 19 && throw(ArgumentError("Only up to 19 digits (resulting in maximally typemax(Int64) are supported"))                                                                                      
        factors = [ 1 10 100 1_000 10_000 100_000 1_000_000 10_000_000 100_000_000 1_000_000_000 10_000_000_000 100_000_000_000 1_000_000_000_000 10_000_000_000_000 100_000_000_000_000 1_000_000_000_000_000 10_0
00_000_000_000_000 100_000_000_000_000_000 1_000_000_000_000_000_000 ]                                                                                                                                             
        for (i,x) in enumerate(arr)                                                                                                                                                                                
          abs(x) >= base && throw(ArgumentError("Invalid digit $x for given base $base"))                                                                                                                          
          res += x*factors[i]                                                                                                                                                                                      
        end                                                                                                                                                                                                        
        res                                                                                                                                                                                                        
        end                                                                                                                                                                                                        
rev3 (generic function with 1 method)                                                                                                                                                                              

julia> @benchmark rev2(a) setup=(a=rand(0:9, 18)) evals=1
BenchmarkTools.Trial:                                    
  memory estimate:  0 bytes                              
  allocs estimate:  0                                    
  --------------                                         
  minimum time:     1.800 Ξs (0.00% GC)                  
  median time:      1.900 Ξs (0.00% GC)                  
  mean time:        2.048 Ξs (0.00% GC)                  
  maximum time:     36.500 Ξs (0.00% GC)                 
  --------------                                         
  samples:          10000                                
  evals/sample:     1                                    

julia> @benchmark rev3(a) setup=(a=rand(0:9, 18)) evals=1
BenchmarkTools.Trial:                                    
  memory estimate:  240 bytes                            
  allocs estimate:  1                                    
  --------------                                         
  minimum time:     1.900 Ξs (0.00% GC)                  
  median time:      2.000 Ξs (0.00% GC)                  
  mean time:        2.849 Ξs (0.00% GC)                  
  maximum time:     296.700 Ξs (0.00% GC)                
  --------------                                         
  samples:          10000                                
  evals/sample:     1                                    

@simd and @inbounds had no effect here.

mgkuhn commented 3 years ago

@evalpoly

julia> @evalpoly(10, digits(1234)...)
1234
Seelengrab commented 3 years ago

Nice one, @mgkuhn ! But sadly slower than the naive loop version from the spoiler tag above:

julia> @benchmark evalpoly(10, (a...,)) setup=(a=rand(0:9, 18)) evals=1
BenchmarkTools.Trial:                                                  
  memory estimate:  160 bytes                                          
  allocs estimate:  1                                                  
  --------------                                                       
  minimum time:     2.600 Ξs (0.00% GC)                                
  median time:      2.800 Ξs (0.00% GC)                                
  mean time:        4.417 Ξs (0.00% GC)                                
  maximum time:     1.790 ms (0.00% GC)                                
  --------------                                                       
  samples:          10000                                              
  evals/sample:     1                                                  

though it may be different for different bases

mcabbott commented 3 years ago

I think it's fast if you remove the splat, just call evalpoly(10, digits(-1234)) etc.

What this doesn't do is any checking that the digits are single digits, and all the same sign.

Seelengrab commented 3 years ago

Indeed - not any better than the version with error checking though. Branch predictors are truly marvellous!

julia> @benchmark evalpoly(10, a) setup=(a=rand(0:9, 18)) evals=1
BenchmarkTools.Trial:                                            
  memory estimate:  0 bytes                                      
  allocs estimate:  0                                            
  --------------                                                 
  minimum time:     1.800 Ξs (0.00% GC)                          
  median time:      1.900 Ξs (0.00% GC)                          
  mean time:        2.207 Ξs (0.00% GC)                          
  maximum time:     314.900 Ξs (0.00% GC)                        
  --------------                                                 
  samples:          10000                                        
  evals/sample:     1                                            

evalpoly is probably faster for longer inputs, but those would become larger than typemax(Int64)

DNF2 commented 3 years ago

This is dramatically slower than what I am seeing on 1.6, though it's slightly tricky to benchmark:

jl> @btime evalpoly(10, a) setup=(a=rand(0:9, 18)) evals=1;
  0.001 ns (0 allocations: 0 bytes)

jl> @btime evalpoly(10, a) setup=(a=rand(0:9, 18));
  15.716 ns (0 allocations: 0 bytes)

jl> @btime evalpoly(10, $(Ref(a))[]);
  16.650 ns (0 allocations: 0 bytes)

Even including input checking, it's around 20ns.

DNF2 commented 3 years ago

Here's one suggestion for an implementation. I suggest the name joindigits since it is a bit similar to the join function, which, incidentally has this behaviour

jl> join([2,3,4,5])
"2345"

There is also a joinpath function that is somewhat similar.

function joindigits(v, base::Integer=10)
    maxlength = log(base, typemax(base))
    ind = findlast(!iszero, v)
    if isnothing(ind)
        return zero(base)
    end
    vlength = length(v) + ind - lastindex(v)
    0 < vlength <= ceil(maxlength) || error("Max input length is $(ceil(Int, maxlength)), discounting trailing zeros.")
    all(x->0<=x<base, v) || error("Digits are out of range")
    val = evalpoly(base, v)
    if val < 0
        error("Numerical overflow.")
    end
    return val
end

The error messages are just placeholders for now.

The implementation checks length and range of digits, and also discounts trailing zeros in the inputs which will not affect the final number. I wanted to use the base input to select the output type, and using it as a keyword argument. Unfortunately, that means you cannot dispatch on its type, e.g. for BigInts. Therefore it is just a positional argument here. Not sure if this is the right interface.

using BenchmarkTools
base = 10
n = typemax(Int)
v = digits(n; base=base)

@btime joindigits($v, $base)
#  39.376 ns (0 allocations: 0 bytes)
@btime joindigits($v, $(Int128(base)))
#  66.360 ns (0 allocations: 0 bytes)

This will fail if the input vector is a vector with non-standard indexing, due to the behaviour of evalpoly, I've opened an issue about this here: https://github.com/JuliaLang/julia/issues/40454

The implementation will fail if one tries to use BigInt as base, so I made an alternative implementation for BigInt:

function joindigits(v, base::BigInt=10)
    all(x->0<=x<base, v) || error("Digits are out of range")
    return _bigevalpoly(base, v)
end

function _bigevalpoly(n::BigInt, p)
    ex = big(last(p))
    for i in reverse(eachindex(p))[2:end]
        Base.GMP.MPZ.mul!(ex, n)
        Base.GMP.MPZ.add_ui!(ex, p[i])
    end
    return ex
end

Performance of this is not too bad (for BigInt):

@btime joindigits($v, $(big(base)))
# 252.676 ns (3 allocations: 48 bytes)

Note that the BigInt implementation will work for non-standard indexing.

FedericoStra commented 3 years ago

I've spent a bit of time fiddling with this problem and came up with this

fromdigits_unchecked(digs::AbstractVector{D}; base::D=D(10)) where D<:Integer =
    fromdigits_unchecked(D, digs; base=base)

function fromdigits_unchecked(T::Type{<:Integer}, digs::AbstractVector{D}; base::D=(10)) where D<:Integer
    # return evalpoly(base, digs)
    # return foldr((d, x) -> base*x + d, digs; init=zero(T))
    x = zero(T)
    for i in reverse(eachindex(digs))
        x = base*x + digs[i]
    end
    x
end

fromdigits(digs::AbstractVector{D}; base::D=D(10)) where D<:Integer =
    fromdigits(D, digs; base=base)

function fromdigits(T::Type{<:Integer}, digs::AbstractVector{D}; base::D=D(10)) where D<:Integer
    all(d -> d >= 0, digs) || all(d -> d <= 0, digs) || throw(ArgumentError("All the digits must have the same sign"))
    x = zero(T)
    for i in reverse(eachindex(digs))
        d = digs[i]
        abs(d) < base || throw(ArgumentError("Invalid digit $d for given base $base"))
        base_x = T(Base.Checked.checked_mul(base, x))
        x = T(Base.Checked.checked_add(base_x, d))
    end
    x
end

The version fromdigits_unckecked does not do any validation of the input. It is equivalent to the evalpoly/foldr in the comments, while being slightly faster than both. It does not check for overflow in the computations.

The version fromdigits does validation of the input and checks for overflows. The digits must be all non-negative or all non-positive and they must be less than base in absolute value. The digits and the base must be of the same type, and the result will be computed in that type. Optionally, a type can be passed in as the first argument, in which case the result will be computed with that type.

Here are some examples:

julia> fromdigits(fill(1, 19))
1111111111111111111

julia> fromdigits(fill(1, 20))
ERROR: OverflowError: 10 * 1111111111111111111 overflowed for type Int64
Stacktrace:
 [1] checked_mul
   @ ./checked.jl:288 [inlined]
 [2] fromdigits(T::Type{Int64}, digs::Vector{Int64}; base::Int64)
   @ Main ./REPL[31]:7
 ...

julia> fromdigits(BigInt, fill(1, 25))
1111111111111111111111111

julia> fromdigits(digits(UInt32, 12345))
0x00003039

julia> fromdigits(UInt16[0x1, 0x2, 0x3]; base=0x0010)
0x0321

Benchmarks

fromdigits_unchecked is roughly twice as fast as fromdigits. fromdigits is about 25% faster than joindigits, while doing more checks and being more versatile.

julia> @benchmark fromdigits_unchecked(a) setup=(a=rand(0:9,18))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     16.445 ns (0.00% GC)
  median time:      16.671 ns (0.00% GC)
  mean time:        17.325 ns (0.00% GC)
  maximum time:     64.004 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     998

julia> @benchmark fromdigits(a) setup=(a=rand(0:9,18))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     29.861 ns (0.00% GC)
  median time:      30.471 ns (0.00% GC)
  mean time:        31.110 ns (0.00% GC)
  maximum time:     80.827 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     994

julia> @benchmark joindigits(a) setup=(a=rand(0:9,18))
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     39.266 ns (0.00% GC)
  median time:      40.510 ns (0.00% GC)
  mean time:        40.931 ns (0.00% GC)
  maximum time:     100.346 ns (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     991

PS: I have a couple ideas on how to improve the performance even more, but it lengthen the code because it treats Signed/Unsiged separately. I'll probably put all of this in a package as soon as possible (done). Moreover, the BigInt implementation can be specialized and optimized as in the previous answer.

mcabbott commented 3 years ago
fromdigits(digs::AbstractVector{D}; base::D=D(10)) where D<:Integer =
    fromdigits(D, digs; base=base)

I think the types here exclude one use of such a function -- shouldn't something like undigits([false, true, false, true, false, true]; base=2) return an Int? Doing this by bit-shifting should also be faster:

function undigits(v::AbstractVector{Bool})
  out = 0
  for i in 1:length(v) @inbounds out += (v[i] << (i - 1)) end
  out
end
FedericoStra commented 3 years ago

I agree with you that the signature could be more flexible. I intentionally constrained the digits and the base to be of the same type for more safety, but it is probably unnecessary. Also, a specialization can be added for BitVector. And as you say bitshifting could be used when base is a power of 2. I've started a package FromDigits,jl for experimentation and with the goal of developing the fastest, safest, easiest to use, most versatile fromdigits and fromdigits_unchecked.

I've updated FromDigits.jl. Now there is even a test for this case.

Unfortunately until the weekend I will not have time to implement the other features (optimized BitArray, BigInts, Unsigned...).

PS: this is mostly a note to self:

julia> v = [false, true, false, true, false, true]

julia> BitVector(v).chunks[1] |> Int
42
carstenbauer commented 3 years ago

Should we create a PR for this? (@FedericoStra maybe you want to give it a try given that you've spent some time thinking about this)

StefanKarpinski commented 3 years ago

The hardest part seems (to me) to be a good name. But you could make a PR with a strawman name and then we could figure it out from there.

ararslan commented 3 years ago

How bad would it be to have digits return some Digits type (probably <: AbstractVector{Int} to avoid breakage), which would then allow us to do Int(::Digits) = this proposal?

brenhinkeller commented 3 years ago

For performance, I think it may be hard to beat something along the lines of

function digitstoint(A,base=10)
    n = 0
    l = length(A)
    for i = 1:l
        n = base*n + A[i]
    end
    return n
end
julia> digitstoint([1,2,3,4])
1234

julia> @benchmark digitstoint($[1,2,3,4])
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min â€Ķ max):  3.553 ns â€Ķ 22.348 ns  ┊ GC (min â€Ķ max): 0.00% â€Ķ 0.00%
 Time  (median):     3.883 ns              ┊ GC (median):    0.00%
 Time  (mean Âą σ):   4.388 ns Âą  1.350 ns  ┊ GC (mean Âą σ):  0.00% Âą 0.00%

  █▁                     ▂
  ██▃▂▆▂▄▂▁█▁▃▂▁▁█▃▂▁▁▁▂▁█▂▂▂▂▂▁▁▂▂▁▂▂▁▂▁▁▂▁▁▂▂▂▁▂▂▂▁▂▁▂▁▁▁▄ ▃
  3.55 ns        Histogram: frequency by time        7.77 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

In principle, a bitshift could be faster for bases that are powers of two, but in practice

function digitstoint2(A)
    n = 0
    l = length(A)
    for i = 1:l
        n = n<<1 + A[i]
    end
    return n
end

it seems to come out about the same

julia> @benchmark digitstoint2($[1,2,3,4])
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min â€Ķ max):  3.820 ns â€Ķ 39.756 ns  ┊ GC (min â€Ķ max): 0.00% â€Ķ 0.00%
 Time  (median):     3.831 ns              ┊ GC (median):    0.00%
 Time  (mean Âą σ):   4.175 ns Âą  1.168 ns  ┊ GC (mean Âą σ):  0.00% Âą 0.00%

  █▁  ▃    ▃     ▃       ▃                                 ▃ ▁
  ███▆█▆▆▅▅█▄▅▁▁▃█▁▁▃▃▃▁▁█▃▁▁▁▃▃▁▁▁▁▁▃▁▃▁▁▃▁▁▁▁▁▁▁▁▁▃▁▁▁▁▁▁█ █
  3.82 ns      Histogram: log(frequency) by time     8.33 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

See also: https://stackoverflow.com/questions/68493608/convert-an-integer-vector-to-an-integer

Seelengrab commented 3 years ago

That's basically the version by @FedericoStra, except there's no error checking and the base is not generic so that's probably why it's faster.

mschauer commented 2 years ago

Ref Advent of Code 2021, Day 3 "Binary Diagnostic" 😄

Moelf commented 2 years ago

@mschauer just need:

evalpoly(2, reverse(x))

where x is the BitVector of Bool[] you're left with