Closed StefanKarpinski closed 4 years ago
What is the trickier implementation?
Oops, edited to add that.
I was on the edge of my seat there for a minute :)
#!/usr/bin/julia -f
macro time_func(f, args...)
args = eval(current_module(), Expr(:tuple, args...))::Tuple
argnames = Symbol[gensym() for i in 1:length(args)]
types = map(typeof, args)
quote
function wrapper($(argnames...))
$(Expr(:meta, :noinline))
$f($(argnames...))
end
function timing_wrapper()
println($f, $types)
wrapper($(args...))
gc()
@time for i in 1:1000000000
wrapper($(args...))
end
gc()
end
timing_wrapper()
end
end
ror1(x::UInt64, k::Int8) = (x >>> (0x3f & k)) | (x << (0x3f & -k))
function ror2(x::UInt64, k::Int8)
Base.llvmcall("""
%3 = tail call i64 asm \"rorq \$1,\$0\", \"=r,{cx},0,~{dirflag},~{fpsr},~{flags}\"(i8 %1, i64 %0)
ret i64 %3
""", UInt64, Tuple{UInt64, UInt8}, x, k)
end
for i in 1:10
println("$i: $(hex(ror1(UInt64(1), Int8(i))))")
end
for i in 1:10
println("$i: $(hex(ror2(UInt64(1), Int8(i))))")
end
code_native(ror1, Tuple{UInt64, Int8})
code_native(ror2, Tuple{UInt64, Int8})
@time_func(ror1, UInt64(1), Int8(10))
@time_func(ror2, UInt64(1), Int8(10))
Output:
1: 8000000000000000
2: 4000000000000000
3: 2000000000000000
4: 1000000000000000
5: 800000000000000
6: 400000000000000
7: 200000000000000
8: 100000000000000
9: 80000000000000
10: 40000000000000
1: 8000000000000000
2: 4000000000000000
3: 2000000000000000
4: 1000000000000000
5: 800000000000000
6: 400000000000000
7: 200000000000000
8: 100000000000000
9: 80000000000000
10: 40000000000000
.text
Filename: rol.jl
Source line: 0
pushq %rbp
movq %rsp, %rbp
Source line: 25
movb %sil, %cl
rorq %cl, %rdi
movq %rdi, %rax
popq %rbp
retq
.text
Filename: rol.jl
Source line: 0
pushq %rbp
movq %rsp, %rbp
Source line: 28
movb %sil, %cl
rorq %cl, %rdi
movq %rdi, %rax
popq %rbp
retq
ror1(UInt64,Int8)
2.076 seconds
ror2(UInt64,Int8)
2.071 seconds
clang -O3 --emit-llvm
outputllvmcall
worksP.S. And by "copied from clang -O3 --emit-llvm
" output I meant the output of the following inline asm.
uint64_t
rotr64(uint64_t x, uint8_t r)
{
asm("rorq %1,%0" : "+r" (x) : "c" (r));
return x;
}
Which is in term adapted from here
So, you add this for Intel platforms, and fall back to the old for ARM, Power, etc.? (or get somebody to figure out the correct inline asm for those platforms)? :+1: For my bit twiddling, this would be very nice, esp. since I'd no longer have to deal with inline asm for each platform myself.
Well, IMHO, although the inline assembly works, it should not be done in julia for most of the case.
In general, it should really be the job of LLVM to emit the best assembly and it already can in recent versions. Inline assembly can probably be used to do something that directly addresses some special hardware features but probably not for general purpose functions and especially not this one since llvm can already do it.
Ok, I thought people had said that llvm didn't support it... It does as of what version?
It doesn't support it as an llvm instruction but the x86_64 backend can recognize the code and emit rol
instruction. As mentioned above, this is llvm 3.6
@yuyichao, you said:
LLVM 3.6.1 seems to be smart enough to optimize it.
I asked as of what version it was supported? (i.e. the earliest version, not what you used). Is that supported in 3.3?
I suppose @StefanKarpinski was on 3.3 and as I just checked it doesn't optmize the function to rol
. But inline assembly doesn't seems to be supported by the old JIT anyway so it won't work.
cc: @VicDrastik
Just an update.
I happened to notice that LLVM (3.7) does not emit rorq
anymore with the pure julia version with FastISel, i.e. current master (disabling FastISel makes it emit rorq
instruction again).
The difference in performance on my Haswell laptop is ~10%. Not sure if this is good enough...
c.c. @Keno
Isn't there an -O
flag for Julia that should make speed/performance tradeoffs such as calling FastISel or not?
Now that we've deprecated the old ror
and rol
meanings, and now that it appears that LLVM is happy generating the x86 intrinsics with 1.0, is it worth exposing these bitwise operations as first-class Julia functions?
julia> ror(x::Int, k::Int) = (x >>> (0x3f & k)) | (x << (0x3f & -k))
ror (generic function with 1 method)
julia> @code_native ror(rand(Int), 3)
.section __TEXT,__text,regular,pure_instructions
; Function ror {
; Location: REPL[1]:1
; Function |; {
; Location: REPL[1]:1
movl %esi, %ecx
decl %eax
rorl %cl, %edi
;}
decl %eax
movl %edi, %eax
retl
nopl (%eax)
;}
Yes, I think we should certainly expose these. That still seems like a lot of instructions for a simple ror
operation. Are the movl
and decl
instructions part of the function preamble and postamble these days?
I get
julia> @code_native ror(rand(Int), 3)
.section __TEXT,__text,regular,pure_instructions
; Function ror {
; Location: REPL[11]:1
; Function |; {
; Location: REPL[11]:1
movl %esi, %ecx
rorq %cl, %rdi
;}
movq %rdi, %rax
retq
nopl (%rax)
;}
Looks like @mbauman is on a 32bit machine?
Nope, that's the official 1.0.0 binary on my Mac (an old westmere/nehalem system). On master I see the same as you, @KristofferC.
Yep, same as you with 1.0.1 binary for me.
Also get the decl, rorl, decl
calls with nightly, so perhaps a difference between source build and binary.
rorl
only operates on the low 32bits though, so something is odd about that.
The whole code fragment operates on 32-bit values, so it's self-consistent.
Since you are on Mac, I presume that you're seeing https://github.com/JuliaLang/julia/issues/28046
1.1.0 release shows the correct output now (https://github.com/JuliaLang/julia/issues/28046)
True, so we can now generate efficient native code, and also have it disassemble correctly on Macs, but shouldn't this issue also include exposing ror
and rol
as first class Julia functions (as per @mbauman's post above)? I think non-expert users may have a hard time figuring out how to generate efficient code, even if they find this issue (example from discourse).
Seems like a good idea.
bump
Someone just needs to make a PR defining these, adding some tests and NEWS.
Should these be implemented for all different integer types (Int8
, Int16
, Int32
, Int64
, Int128
)? Also for the unsigned integer types?
At least the unsigned ones. I'm not sure what the right definition of ror and rol for signed types is except if you want to just rotate them as if they were unsigned, i.e. cast, rotate, cast back.
I had this in my files: (I use rotate with signed ints)
for (T,K) in ((Int128, 0x7f), (Int64, 0x3f), (Int32, 0x1f), (Int16, 0x0f), (Int8, 0x07),
(UInt128, 0x7f), (UInt64, 0x3f), (UInt32, 0x1f), (UInt16, 0x0f), (UInt8, 0x07))
@eval begin
ror(x::$T, k::I) where {I<:Integer} = (x >>> ($K & k)) | (x << ($K & -k))
rol(x::$T, k::I) where {I<:Integer} = (x << ($K & k)) | (x >>> ($K & -k))
end
end
Nice. The colons before the UInt8
literals are unnecessary and they could be computed from sizeof(T)
but that's a nice generic implementation. Having thought a little, I do think that rotating the raw bits of signed integers is the only reasonable thing to do.
I removed the colons. I do not see how to use sizeof(T)
without adding bloat to the functions.
for T in Base.BitInteger_types
mask = UInt8(sizeof(T) << 3 - 1)
@eval begin
ror(x::$T, k::Integer) = (x >>> ($mask & k)) | (x << ($mask & -k))
rol(x::$T, k::Integer) = (x << ($mask & k)) | (x >>> ($mask & -k))
end
end
elegant
So here's kooky idea. We don't have great words for ror
and rol
. Nor do we have great words for popcnt
/count_ones
or tzcnt
/trailing_zeros
and friends. BUT we do have great words for them if we were to treat the bits as elements in an array.
Perhaps this is too clever by half, but I like the idea of exposing this through a special Bits
struct:
struct Bits{T <: Integer} <: AbstractVector{Bool}
data::T
end
Base.size(::Bits{T}) where {T} = (sizeof(T)*8,)
Base.getindex(b::Bits, i::Int) = b.data & (1 << (i-1)) != 0
This is likely more useful than bitstring
, and it can provide the relevant specializations for these hard-to-name bit-twiddling optimizations:
old | alternative |
---|---|
ror(x, k) |
Integer(circshift(Bits(x), k)) |
count_ones(x) |
count(Bits(x)) |
count_zeros(x) |
count(!, Bits(x)) |
leading_zeros(x) |
sizeof(x)*8 - findlast(Bits(x)) |
leading_ones(x) |
sizeof(x)*8 - findlast(!, Bits(x)) |
trailing_zeros(x) |
findfirst(Bits(x)) - 1 |
trailing_ones(x) |
findfirst(!, Bits(x)) - 1 |
Introduce one new name, get rid of 6 of the 37 remaining Base exports that are multi_word_with_underscores
. Alright, so those leading_*
guys look awful, but most of the time you actually want just the findlast(Bits(x))
index. Perhaps the axes should be 0:sizeof(T)-1
, but then we still need a nice name for the function that does the sum(x.*2.^axes(x, 1))
integer reconstitution. It's not really an integer conversion, and even a constructor feels weird — would it take all vectors of bool values? Note that Bits(::BigInt)
could generate the appropriate BitArray
(and back, however we do that).
The biggest downside is that when you're doing bit-twiddling, you pretty much always want to know that you're using those strangely-named intrinsics. This makes them look like any other array function (because they are).
I like the idea of exposing this through a special Bits struct
For the record, this is exactly what the Bits
package does. It needs a bit more love (I ported a small C++ library of mine, which I cleaned up partially in the package, but stopped after getting the bits I needed at that time), for example to specialize some array methods like count
. My big question was indeed what the axes should be. It's currently 1:n
, so in a way it's really like an (immutable) BitVector
but with only one word.
The dual of Bits(x)
would be something like Support(x)
, which is like a BitSet
with only one word, which contains the indexes of ones in x
(it's not published yet in the library).
The obvious way to implement bit rotation is this:
This has a couple of issues, however. First, rotating by more that a word is broken:
Second, the native code is awful:
Both can be improved with a slightly trickier implementation:
That's way better machine code – but
ror
is an x86 instruction – this should just boil down to that. Given that LLVM does not expose rotate instructions, what do we have to do here to get this to emit a single x86 instruction?