`reinterpret` to `Int64` can increase throughput

Moelf commented 6 months ago

julia> hamming_distance(x1::T, x2::T) where T<:Integer = count_ones(x1 ⊻ x2)

julia> function hamming_distance(x1, x2)
           s = 0
           @inbounds @simd for i in eachindex(x1, x2)
               s += hamming_distance(x1[i], x2[i])
           end
           s
       end

julia> a = rand(Int8, 1024); b = rand(Int8, 1024);

julia> @be hamming_distance($a,$b)
Benchmark: 2973 samples with 263 evaluations
min    110.281 ns
median 112.148 ns
mean   112.414 ns
max    171.270 ns

julia> @be hamming_distance(reinterpret(UInt64, $a), reinterpret(UInt64, $b))
Benchmark: 3055 samples with 803 evaluations
min    35.022 ns
median 36.007 ns
mean   36.216 ns
max    48.721 ns

btw, there are a lot of unnecessary annotations, for example, count_ones always return Int no matter what already:

julia> Base.url(@which count_ones(0xf2)) "https://github.com/JuliaLang/julia/tree/0b4590a5507d3f3046e5bafc007cacbbfc9b310b/base/int.jl#L415"

domluna commented 6 months ago

does this translate to a speedup in the search function as well? Some things I've done which speedup the hamming_distance function don't seem to make a difference when doing the actual search

domluna commented 6 months ago

julia> q1 = SVector{8, UInt64}(rand(UInt64,8));

julia> q2 = SVector{8, UInt64}(rand(UInt64,8));

julia> @be hamming_distance($q1, $q2)
Benchmark: 5613 samples with 5887 evaluations
min    2.484 ns
median 2.605 ns
mean   2.841 ns
max    7.899 ns

julia> @code_llvm hamming_distance(q1, q2)
; Function Signature: hamming_distance(StaticArraysCore.SArray{Tuple{8}, UInt64, 1, 8}, StaticArraysCore.SArray{Tuple{8}, UInt64, 1, 8})
;  @ /Users/lunaticd/code/tiny-binary-rag/rag.jl:28 within `hamming_distance`
define i64 @julia_hamming_distance_12257(ptr nocapture noundef nonnull readonly align 8 dereferenceable(64) %"x1::SArray", ptr nocapture noundef nonnull readonly align 8 dereferenceable(64) %"x2::SArray") #0 {
top:
;  @ /Users/lunaticd/code/tiny-binary-rag/rag.jl:33 within `hamming_distance`
; ┌ @ simdloop.jl:77 within `macro expansion` @ /Users/lunaticd/code/tiny-binary-rag/rag.jl:34
; │┌ @ /Users/lunaticd/code/tiny-binary-rag/rag.jl:14 within `hamming_distance`
; ││┌ @ int.jl:373 within `xor`
     %0 = load <8 x i64>, ptr %"x1::SArray", align 8
     %1 = load <8 x i64>, ptr %"x2::SArray", align 8
     %2 = xor <8 x i64> %1, %0
; ││└
; ││┌ @ int.jl:415 within `count_ones`
     %3 = call <8 x i64> @llvm.ctpop.v8i64(<8 x i64> %2)
; │└└
; │┌ @ int.jl:87 within `+`
    %4 = call i64 @llvm.vector.reduce.add.v8i64(<8 x i64> %3)
; └└
;  @ /Users/lunaticd/code/tiny-binary-rag/rag.jl:36 within `hamming_distance`
  ret i64 %4
}

It's not going to get better than this.

even if each comparison was 1ns the best over 1M rows would be 1ms, which it's already very close to

julia> @be k_closest($X1, $q1, 100)
Benchmark: 28 samples with 1 evaluation
min    3.207 ms (5 allocs: 3.281 KiB)
median 3.433 ms (5 allocs: 3.281 KiB)
mean   3.562 ms (5 allocs: 3.281 KiB)
max    4.749 ms (5 allocs: 3.281 KiB)

julia> a = rand(UInt64, 8)
8-element Vector{UInt64}:
 0xb08159b400b994e8
 0x719aafe1251df126
 0x4be73d16ff742a65
 0x9af6c257b0e17a28
 0x5c8128573f810275
 0x6cb73fed5ee64e79
 0x57b7c96fb9093e66
 0x55d6863d8cd7de69

julia> @be k_closest($X1, $a, 100)
Benchmark: 21 samples with 1 evaluation
min    4.425 ms (5 allocs: 3.281 KiB)
median 4.511 ms (5 allocs: 3.281 KiB)
mean   4.639 ms (5 allocs: 3.281 KiB)
max    5.490 ms (5 allocs: 3.281 KiB)

the best combo seems to be a StaticArray query vector and a matrix as the DB

domluna commented 6 months ago

julia> @be k_closest_parallel($X1, $q1, 100)
Benchmark: 83 samples with 1 evaluation
min    1.099 ms (58 allocs: 23.906 KiB)
median 1.117 ms (58 allocs: 23.906 KiB)
mean   1.180 ms (58 allocs: 23.906 KiB)
max    2.570 ms (58 allocs: 23.906 KiB)

domluna / tinyrag

`reinterpret` to `Int64` can increase throughput #1