Open chethega opened 6 years ago
Changing the storage order is a big change that I'd guess lots of folks rely upon and would probably be breaking. Not sure about other tradeoffs here, but in this specific case we could use a bit-twiddling algorithm to flip the bits and compare on the result. That'd likely be much faster than doing up to 64 bit-extractions.
Darn, should have brought this up before 1.0. I guess one can then use the following, which is at least branchfree ans should be faster than reversing:
julia> function rev_isless(a::UInt64, b::UInt64)
bt = trailing_zeros(xor(a,b))
return 0!= (0x01 & (a>>bt))
end
Testing the least significant bit can be done without extracting it directly and building a mask instead:
rev_isless(a::UInt, b::UInt) = begin
d = xor(x, y)
lsb = d & -d;
return (y & lsb) != 0
end
Here is the full code, shall I make a pull request?
function bvcmp(a::BitVector, b::BitVector)
# the last chunk of a and b might not be fully populated
# assume those unused bits are zero, is this correct?
n = min(length(a.chunks), length(b.chunks))
@inbounds for i in 1:n
x, y = a.chunks[i], b.chunks[i]
if x != y
# compute differing bits
d = xor(x, y)
# compute a mask where only the least significant bit is set
lsb = d & -d;
# a < b <=> x & lsb == 0 and y & lsb == 1
# on x86 `y & lsb` can be fused into a `test y, lsb` instruction :-)
return y & lsb != 0 ? -1 : 1
end
end
return cmp(length(a), length(b))
end
bvisless(va, vb) = bvcmp(va, vb) < 0
Benchmarks: (on 0.7.0 linux x64)
julia> @btime cmp($(falses(1)), $(falses(1)))
9.601 ns (0 allocations: 0 bytes)
0
julia> @btime bvcmp($(falses(1)), $(falses(1)))
4.815 ns (0 allocations: 0 bytes)
0
julia> @btime cmp($(falses(10000)), $(falses(10000)))
23.969 μs (0 allocations: 0 bytes)
0
julia> @btime bvcmp($(falses(10000)), $(falses(10000)))
120.681 ns (0 allocations: 0 bytes)
0
Roughly 200x as fast :-)
A PR with performance improvements and perhaps some extra tests thrown in there is always welcome :)
Yes please PR that! I don't think lexicographic comparison of BitVectors is important enough to change the layout, but it's very much worth optimizing with that code.
The immediate problem is that lexicographic comparison falls back to
AbstractArray
. However, it is not clear how to write a fast (chunk-wise) comparison with the current layout, because the most significant bit of the bitvector (wrt julia's convention for vector comparison) is stored in the least significant bit of the unterlying UInt64 (wrt hardware integer comparison convention). See:I'd propose to switch the layout of
BitVector
to match the machine convention. Then, one would need to also replacetrailing_zeros
byleading_zeros
for iterations, but this has just as good hardware support. Or is there a strong reason for the current layout?