JuliaIO / LibExpat.jl

Julia interface to the Expat XML parser library
Other
9 stars 32 forks source link

Make cfunction's local variables #83

Closed musm closed 7 years ago

musm commented 7 years ago

close https://github.com/JuliaIO/LibExpat.jl/issues/77

musm commented 7 years ago

@stevengj it may actually be better to make these global constantant refs instead of allocating new pointers every time the make_parser function is called, e.g.


const cb = Ref{Ptr{Void}}()

function test()
cb[] = cfunction( ...)
end

but perhaps not worth it?

stevengj commented 7 years ago

@musm, it would certainly be possible, e.g. you could initialize them in __init__. I'm not sure it makes any measurable difference, however, since on subsequent calls cfunction just returns the a pointer to the previously compiled code. (It doesn't "allocate" a pointer, since pointers themselves aren't heap-allocated.)

In a quick benchmark, it seems like calling cfunction repeatedly is about the same speed as looking up a Ref value, and both involve no allocations:

julia> using BenchmarkTools, Compat;

julia> foo(x) = x + 1
foo (generic function with 1 method)

julia> f() = cfunction(foo, Int, (Int,))
f (generic function with 1 method)

julia> g() = C_NULL
g (generic function with 1 method)

julia> @btime f();
  1.825 ns (0 allocations: 0 bytes)

julia> @btime g();
  0.031 ns (0 allocations: 0 bytes)

julia> const cb = Ref{Ptr{Void}}()
Base.RefValue{Ptr{Void}}(Ptr{Void} @0x0000000109733040)

julia> cb[] = f()
Ptr{Void} @0x00000001195fe640

julia> f_cached() = cb[]
f_cached (generic function with 1 method)

julia> @btime f_cached();
  1.827 ns (0 allocations: 0 bytes)
stevengj commented 7 years ago

Actually, it looks like the cfunction pointer is just inlined in the compiled code nowadays, so there is zero overhead:

julia> @code_llvm f()

define i8* @julia_f_61016() #0 !dbg !5 {
top:
  ret i8* bitcast (i64 (i64)* @jlcapi_foo_60882 to i8*)
}

julia> @code_llvm f_cached()

define i8* @julia_f_cached_61063() #0 !dbg !5 {
top:
  %0 = load i8*, i8** inttoptr (i64 4426056496 to i8**), align 16
  ret i8* %0
}
musm commented 7 years ago

Yeah the only difference is in the case of directly calling the cfunction, the operation is a bitcast: ret i8* bitcast (i64 (i64)* @jlcapi_foo_61627 to i8*) and using a const Ref that is initialized in __init__ the function is then just a load operation %0 = load i8*, i8** inttoptr (i64 169524560 to i8**), align 16 ret i8* %0 which essentially have the same cost as you mention, so the difference is basically nil

stevengj commented 7 years ago

bitcast is just LLVM keeping track of the type, I think, it doesn't actually translate to any machine instruction. The native code just pushes a literal address into the return register:

julia> @code_native f()
    .section    __TEXT,__text,regular,pure_instructions
Filename: REPL[3]
    pushq   %rbp
    movq    %rsp, %rbp
Source line: 1
    movabsq $4815779184, %rax       ## imm = 0x11F0AF570
    popq    %rbp
    retq
stevengj commented 7 years ago

Whereas the cached version requires an additional instruction to dereference the ref pointer:

julia> @code_native f_cached()
    .section    __TEXT,__text,regular,pure_instructions
Filename: REPL[8]
    pushq   %rbp
    movq    %rsp, %rbp
Source line: 1
    movabsq $4528309056, %rax       ## imm = 0x10DE88340
    movq    (%rax), %rax
    popq    %rbp
    retq
    nopw    %cs:(%rax,%rax)

However, the additional cost doesn't seem to be measurable by BenchmarkTools. (And who knows if it actually costs any extra cycles thanks to pipelining etc... CPUs are complicated.)

In any case, the non-cached version is simpler, so I would just stick with that.