`jl_cumulative_compile_time_ns_before` tracks compilation time per thread rather than per task

rgankema commented 3 years ago

For my work at RelationalAI I'm trying to gather metrics on how much time is spent in run-time compilation while evaluating queries. I was hoping to use jl_cumulative_compile_time_ns_[before/after] for this, a little bit like this:

function evaluate_query(query_fn)
    time_before = time_ns()
    compile_before = jl_cumulative_compile_time_ns_before()

    query_fn()

    compile_after = jl_cumulative_compile_time_ns_after()
    time_after = time_ns()
    inc_metric!(metrics, "query_eval_total_time", time_after - time_before)
    inc_metric!(metrics, "query_jl_compile_time", compile_after - compile_before)
end

This works fine if there are no background tasks and queries run sequentially. It also works fine in a multi-threaded scenario, as long as only a single task runs at-a-time at any given thread, because jl_cumulative_compile_time_ns_[before/after] seem to use separate counters per thread. However, it breaks when multiple tasks are multiplexed on the same thread. For instance, if one call of evaluate_query has a very long running query_fn, and other tasks that do work that incur compilation time are running at the same time on that thread, that first evaluate_query call will also record all the compilation time triggered by those others tasks. This means that we're overestimating how much time we spent in compilation for that given query, potentially by a very large margin.

To see this behavior in action, please consider the following MRE:

using Base.Threads: @async

function do_something_and_measure_compilation_time(a, b)
    compile_before = Base.cumulative_compile_time_ns_before()

    # Function that we want to measure.
    x = do_something(a, b)

    compile_after = Base.cumulative_compile_time_ns_after()
    compile_duration_s = (compile_after - compile_before) / 1_000_000_000
    @show compile_duration_s
    return x
end

# Simulates a long running function that costs a lot more time to run than to compile.
function do_something(a, b)
    x = foo(a, a + 3, b, b + 5)
    sleep(5)
    return x
end

# Something to compile.
function foo(x1, x2, y1, y2)
    xs = [x1:x2...]
    ys = [y1:y2...]
    return sum(xs) * max(xs...) * sum(ys) * max(ys...)
end

# Do a couple of runs. We should see compilation time at first, but that should quickly go
# away.
do_something_and_measure_compilation_time(1, 2)
do_something_and_measure_compilation_time(1, 2)
do_something_and_measure_compilation_time(1, 2)
do_something_and_measure_compilation_time(1, 2)

# Call foo with types we haven't seen before, in the background.
@async begin
    sleep(1)
    for T1 in [Int32, Int16, Int8]
        for T2 in [Int32, Int16, Int8]
            @time foo(zero(T1), one(T1), zero(T2), one(T2))
        end
    end
end

# Final run, this time we'll measure compilation time from the above, even though it's
# not compilation that comes from this task.
do_something_and_measure_compilation_time(1, 2)

This outputs:

compile_duration_s = 0.024834394
compile_duration_s = 0.002964853
compile_duration_s = 0.0
compile_duration_s = 0.0
  0.037290 seconds (87.25 k allocations: 5.084 MiB, 99.79% compilation time)
  0.077509 seconds (169.89 k allocations: 10.369 MiB, 99.61% compilation time)
  0.068803 seconds (170.97 k allocations: 10.438 MiB, 99.69% compilation time)
  0.008911 seconds (4.97 k allocations: 271.748 KiB, 99.42% compilation time)
  0.009805 seconds (4.92 k allocations: 266.718 KiB, 99.52% compilation time)
  0.014783 seconds (4.97 k allocations: 269.733 KiB, 98.73% compilation time)
  0.009418 seconds (4.97 k allocations: 270.107 KiB, 99.38% compilation time)
  0.009743 seconds (4.97 k allocations: 269.671 KiB, 99.47% compilation time)
  0.013891 seconds (4.92 k allocations: 266.905 KiB, 99.33% compilation time)
compile_duration_s = 0.298222365

As you can see, the last call to do_something_and_measure_compilation_time recorded the compilation time of the background task. Unfortunately that behavior makes it unusable for the use-case described above. We'd therefore like to request that jl_cumulative_compile_time_ns_[before/after] is changed such that it keeps a counter per task, rather than per thread. Thanks!

NHDaly commented 3 years ago

Here's another, maybe a bit simpler example, which shows the same affect as above using @time. In this program, you can see that timing one Task which doesn't have any compilation time, running concurrently with another Task on the same thread that does reports the compilation time in the call to @time, even though it wasn't caused by that Task.

(Note that this example doesn't work if you copy/paste into the REPL; i think because it's compiling the REPL display output at the same time as the program runs, so both @time calls show some compilation time. So instead, I run this example as a script.)

# async-time-example.jl

const running = Threads.Atomic{Bool}(true)

loop() = while running[] ; yield(); GC.safepoint(); end

# Run once as warmup
running[] = true
t = @async loop()
sleep(0.1) ; running[] = false

wait(t)

# ----------------------------

running[] = true

# Start timing a task, which doesn't have any compilation time.
t = @async @time loop()

# Finish timing the task. Note no compilation time.
sleep(0.1) ; running[] = false

wait(t)

# ----------------------------

running[] = true

# Now start timing the *same* task, which shouldn't have any compilation time.
t = @async @time while running[] ; yield(); GC.safepoint(); end

# But start another task *on the same thread,* which DOES spend some compilation time.
t2 = @async (@eval f() = 2+2; @eval f())

# Finish timing the original task, and observe that the compilation time from `t2`
# was recorded by the `@time` in `t`.
sleep(0.1) ; running[] = false

wait(t) ; wait(t2)

And here's the output:

$ julia ~/Downloads/async-time-example.jl
  0.079854 seconds (72 allocations: 4.594 KiB)
  0.093527 seconds (2.27 k allocations: 158.235 KiB, 4.59% compilation time)

The request here is to record compilation time via task-local storage, instead of through thread-local storage.

NHDaly commented 3 years ago

Worth noting that actually this is not unique to compilation time. The number of allocations are tracked globally as well, as shown in this simple example:

julia> function f(x::Array{Any})  x[1] + x[1]  end
f (generic function with 1 method)

julia> t = @async @time sleep(10)
Task (runnable) @0x000000010b059000

julia> @time for _ in 1:3000000 f(Any[2]) end
  0.204893 seconds (3.00 M allocations: 274.658 MiB, 16.42% gc time)

julia>  10.005931 seconds (3.04 M allocations: 277.261 MiB, 0.34% gc time, 0.19% compilation time)

This would be great to change for all the metric in @time, though i don't know how feasible that is.

vtjnash commented 3 years ago

IIRC, the code is currently taking extra effort to make this counter global (instead of per-thread). The challenge may be deciding if we should ever "charge" thread costs (time, memory, etc.) to the parent Tasks (all that called 'wait' on it), and how to handle task thread-migration.

ElSeidy commented 3 years ago

Tracking this discussion. I just came across the same issue :)

NHDaly commented 3 years ago

Thanks @vtjnash - that makes sense. Interesting!

Yeah, after thinking more about it, i can see why the global accounting is actually really valuable for some situations.

It seems like maybe we'd ideally want both types of metrics? In addition to the global stuff we currently report, it would be great to also be able to track:

how many allocations did this Task and all child tasks involved in my span perform?
how much time was spent in compilation for this task and all child tasks?
how much CPU time did this task and my child tasks actually take when running on the CPU?
how much time did this task and child tasks spend blocking, not running?

Does that make sense to you? That we could (have the option to) record both types of metrics: task-local (and maybe including any child tasks that we wait on - as you say), and global?

NHDaly commented 3 years ago

@vtjnash -

IIRC, the code is currently taking extra effort to make this counter global (instead of per-thread).

After looking at this a bit more, i don't think the jl_cumulative_compile_time_ns_before is global though. It's per-thread.

And actually, @janrous-rai just pointed out to me that I think there's actually a race condition here, because if multiple Tasks scheduled on the same thread are both measuring @time, the first one to finish will disable jl_measure_compile_time for that thread, and the other one will miss some of its compilation time because it was disabled out from under it! 😮

https://github.com/JuliaLang/julia/blob/5dbf45a2997acd26e0f72abaa320c9d0d54956a1/src/jitlayers.cpp#L89

To fix the race condition, do you think we could at least change jl_measure_compile_time to be: A) an thread-safe and/or atomic variable B) a counter instead of a boolean? So that only the it will remain true as long as at least one Task on the thread is currently measuring compilation time?

Alternatively, is it really so expensive to measure the compilation time that we need to enable/disable it? Could we just keep this as a cumulative count of all compilation time, and remove the jl_measure_compile_time check entirely?

JuliaLang / julia

`jl_cumulative_compile_time_ns_before` tracks compilation time per thread rather than per task #41271