JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.68k stars 5.48k forks source link

BitVector has slow lexicographic compare due to storage order #28993

Open chethega opened 6 years ago

chethega commented 6 years ago
julia> using BenchmarkTools
julia> a=BitVector(rand(Bool,1<<10)); b=copy(a);
julia> @btime ==($a,$b)
  13.865 ns (0 allocations: 0 bytes)
true
julia> @btime <($a,$b)
  2.252 μs (0 allocations: 0 bytes)
false
julia> versioninfo()
Julia Version 1.0.0
Commit 5d4eaca0c9 (2018-08-08 20:58 UTC)

The immediate problem is that lexicographic comparison falls back to AbstractArray. However, it is not clear how to write a fast (chunk-wise) comparison with the current layout, because the most significant bit of the bitvector (wrt julia's convention for vector comparison) is stored in the least significant bit of the unterlying UInt64 (wrt hardware integer comparison convention). See:

julia> a=BitVector(rand(Bool,1<<6));
julia> a
64-element BitArray{1}:
  true
  true
 false
 false
  true
     ⋮
  true
 false
  true
 false

julia> bitstring(a.chunks[1])
"0101011000101101010100010100000110100011100101000000101011110011"

I'd propose to switch the layout of BitVector to match the machine convention. Then, one would need to also replace trailing_zeros by leading_zeros for iterations, but this has just as good hardware support. Or is there a strong reason for the current layout?

mbauman commented 6 years ago

Changing the storage order is a big change that I'd guess lots of folks rely upon and would probably be breaking. Not sure about other tradeoffs here, but in this specific case we could use a bit-twiddling algorithm to flip the bits and compare on the result. That'd likely be much faster than doing up to 64 bit-extractions.

chethega commented 6 years ago

Darn, should have brought this up before 1.0. I guess one can then use the following, which is at least branchfree ans should be faster than reversing:

julia> function rev_isless(a::UInt64, b::UInt64)
       bt = trailing_zeros(xor(a,b))
       return  0!= (0x01 & (a>>bt))
       end
simonfxr commented 6 years ago

Testing the least significant bit can be done without extracting it directly and building a mask instead:

rev_isless(a::UInt, b::UInt) = begin
    d = xor(x, y)
    lsb = d & -d;
    return (y & lsb) != 0
end
simonfxr commented 6 years ago

Here is the full code, shall I make a pull request?


function bvcmp(a::BitVector, b::BitVector)
    # the last chunk of a and b might not be fully populated
    # assume those unused bits are zero, is this correct?
    n = min(length(a.chunks), length(b.chunks))
    @inbounds for i in 1:n
        x, y = a.chunks[i], b.chunks[i]
        if x != y
            # compute differing bits
            d = xor(x, y)
            # compute a mask where only the least significant bit is set
            lsb = d & -d;
            # a < b <=> x & lsb == 0 and y & lsb == 1

            # on x86 `y & lsb` can be fused into a `test y, lsb` instruction :-)
            return y & lsb != 0 ? -1 : 1
        end
    end
    return cmp(length(a), length(b))
end

bvisless(va, vb) = bvcmp(va, vb) < 0

Benchmarks: (on 0.7.0 linux x64)

julia> @btime cmp($(falses(1)), $(falses(1)))
  9.601 ns (0 allocations: 0 bytes)
0

julia> @btime bvcmp($(falses(1)), $(falses(1)))
  4.815 ns (0 allocations: 0 bytes)
0

julia> @btime cmp($(falses(10000)), $(falses(10000)))
  23.969 μs (0 allocations: 0 bytes)
0

julia> @btime bvcmp($(falses(10000)), $(falses(10000)))
  120.681 ns (0 allocations: 0 bytes)
0

Roughly 200x as fast :-)

KristofferC commented 6 years ago

A PR with performance improvements and perhaps some extra tests thrown in there is always welcome :)

JeffBezanson commented 6 years ago

Yes please PR that! I don't think lexicographic comparison of BitVectors is important enough to change the layout, but it's very much worth optimizing with that code.