Closed DilumAluthge closed 3 years ago
I'll add counters and @assert
s to the while loops to try and find out where it's getting suck.
CI is now passing on master, so I'll close this.
Reopening this.
Failure on 1.5 and 1.6: https://github.com/chriselrod/ThreadingUtilities.jl/runs/1725829471?check_suite_focus=true
One timeout is on Linux, and one timeout is on macOS. So presumably it's not an OS-related bug.
Maybe run some kind of loop on your local machines?
E.g.
import Pkg
while true
Pkg.test("ThreadingUtilities")
end
You could do the loop in bash
instead of Julia.
And maybe run all of that inside of rr
? So that when you finally catch a timeout, you can replay it?
Here's another failure, on macOS: https://github.com/chriselrod/ThreadingUtilities.jl/runs/1726038737
Got another failure, this time on Linux: https://github.com/chriselrod/ThreadingUtilities.jl/runs/1726061663
And on this one, we can see the versioninfo
:
Testing Running tests...
Julia Version 1.6.0-beta1
Commit b84990e1ac (2021-01-08 12:42 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
Ubuntu 18.04.5 LTS
uname: Linux 5.4.0-1036-azure #38~18.04.1-Ubuntu SMP Wed Jan 6 18:26:30 UTC 2021 x86_64 x86_64
CPU: Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz:
speed user nice sys idle irq
#1 2397 MHz 395 s 0 s 87 s 562 s 0 s
#2 2397 MHz 270 s 0 s 109 s 643 s 0 s
Memory: 6.791389465332031 GB (4148.625 MB free)
Uptime: 116.0 sec
Load Avg: 0.99 0.4 0.15
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-11.0.0 (ORCJIT, haswell)
Environment:
JULIA_NUM_THREADS = 5
JULIA_LOAD_PATH = @:/tmp/jl_KHoQmi
LEIN_HOME = /usr/local/lib/lein
M2_HOME = /usr/share/apache-maven-3.6.3
ANDROID_HOME = /usr/local/lib/android/sdk
JAVA_HOME_11_X64 = /usr/lib/jvm/adoptopenjdk-11-hotspot-amd64
JAVA_HOME_12_X64 = /usr/lib/jvm/adoptopenjdk-12-hotspot-amd64
GITHUB_ACTION_PATH = /home/runner/work/_actions/julia-actions/julia-runtest/v1
SWIFT_PATH = /usr/share/swift/usr/bin
JAVA_HOME = /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64
GRADLE_HOME = /usr/share/gradle
PIPX_HOME = "/opt/pipx"
HOME = /home/runner
JAVA_HOME_8_X64 = /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64
DEPLOYMENT_BASEPATH = /opt/runner
GITHUB_EVENT_PATH = /home/runner/work/_temp/_github_workflow/event.json
HOMEBREW_REPOSITORY = "/home/linuxbrew/.linuxbrew/Homebrew"
GITHUB_PATH = /home/runner/work/_temp/_runner_file_commands/add_path_6a2c13ff-4d32-45c3-9f22-fd85b6f914b6
JAVA_HOME_7_X64 = /usr/lib/jvm/zulu-7-azure-amd64
HOMEBREW_PREFIX = "/home/linuxbrew/.linuxbrew"
ANT_HOME = /usr/share/ant
PATH = /opt/hostedtoolcache/julia/1.6.0-beta1/x64/bin:/home/linuxbrew/.linuxbrew/bin:/home/linuxbrew/.linuxbrew/sbin:/opt/pipx_bin:/usr/share/rust/.cargo/bin:/home/runner/.config/composer/vendor/bin:/home/runner/.dotnet/tools:/snap/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
SELENIUM_JAR_PATH = /usr/share/java/selenium-server-standalone.jar
HOMEBREW_CELLAR = "/home/linuxbrew/.linuxbrew/Cellar"
JULIA_LOAD_PATH = @:/tmp/jl_KHoQmi
┌ Info:
└ Sys.CPU_THREADS = 2
┌ Info:
└ VectorizationBase.NUM_CORES = 2
┌ Info:
└ length(ThreadingUtilities.TASKS) = 1
[ Info: Beginning to run the ThreadingUtilities test suite
Test Summary: | Pass Total
THREADPOOL | 4 4
0.472754 seconds (741.11 k allocations: 41.440 MiB, 19.56% gc time, 99.36% compilation time)
Test Summary: | Pass Total
Internals | 2 2
0.000345 seconds (229 allocations: 29.820 KiB)
Error: The operation was canceled.
Here's a failing job: https://github.com/chriselrod/ThreadingUtilities.jl/runs/1726160608
And here's a passing job on the same commit: https://github.com/chriselrod/ThreadingUtilities.jl/runs/1726160653
Log for the failing job:
┌ Info:
└ Threads.nthreads() = 5
┌ Info:
└ Sys.CPU_THREADS = 2
┌ Info:
└ VectorizationBase.NUM_CORES = 2
┌ Info:
└ length(ThreadingUtilities.TASKS) = 1
[ Info: Beginning to run the ThreadingUtilities test suite
Test Summary: | Pass Total
THREADPOOL | 4 4
0.468356 seconds (741.25 k allocations: 41.449 MiB, 19.53% gc time, 99.50% compilation time)
Test Summary: | Pass Total
Internals | 3 3
0.004237 seconds (266 allocations: 32.570 KiB, 92.60% compilation time)
Skipping Base.active_repl
Skipping Base.active_repl_backend
Skipping Base.Filesystem.JL_O_RANDOM
Skipping Base.Filesystem.JL_O_SEQUENTIAL
Skipping Base.Filesystem.JL_O_SHORT_LIVED
Skipping Base.Filesystem.JL_O_TEMPORARY
Skipping Base.BinaryPlatforms.compiler_abi
Skipping Base.active_repl
Skipping Base.active_repl_backend
Skipping Base.Filesystem.JL_O_RANDOM
Skipping Base.Filesystem.JL_O_SEQUENTIAL
Skipping Base.Filesystem.JL_O_SHORT_LIVED
Skipping Base.Filesystem.JL_O_TEMPORARY
Skipping Base.BinaryPlatforms.compiler_abi
Error: The operation was canceled.
Log for the passing job:
┌ Info:
└ Threads.nthreads() = 5
┌ Info:
└ Sys.CPU_THREADS = 3
┌ Info:
└ VectorizationBase.NUM_CORES = 3
┌ Info:
└ length(ThreadingUtilities.TASKS) = 2
[ Info: Beginning to run the ThreadingUtilities test suite
Test Summary: | Pass Total
THREADPOOL | 4 4
0.299020 seconds (741.38 k allocations: 41.456 MiB, 2.36% gc time, 98.06% compilation time)
Test Summary: | Pass Total
Internals | 3 3
0.004492 seconds (256 allocations: 32.102 KiB, 80.02% compilation time)
Skipping Base.active_repl
Skipping Base.active_repl_backend
Skipping Base.Filesystem.JL_O_RANDOM
Skipping Base.Filesystem.JL_O_SEQUENTIAL
Skipping Base.Filesystem.JL_O_SHORT_LIVED
Skipping Base.Filesystem.JL_O_TEMPORARY
Skipping Base.BinaryPlatforms.compiler_abi
Skipping Base.active_repl
Skipping Base.active_repl_backend
Skipping Base.Filesystem.JL_O_RANDOM
Skipping Base.Filesystem.JL_O_SEQUENTIAL
Skipping Base.Filesystem.JL_O_SHORT_LIVED
Skipping Base.Filesystem.JL_O_TEMPORARY
Skipping Base.BinaryPlatforms.compiler_abi
Test Summary: | Pass Total
Aqua | 7 7
10.775370 seconds (2.96 M allocations: 175.123 MiB, 0.57% gc time, 19.82% compilation time)
[ Info: foo 1
[ Info: foo 2
[ Info: foo 3
[ Info: foo 4
[ Info: foo 5
[ Info: foo 6
[ Info: foo 7
Test Summary: | Pass Total
ThreadingUtilities.jl | 2 2
[ Info: Finished running the ThreadingUtilities test suite
Testing ThreadingUtilities tests passed
This is where it gets really interesting. From my PR that triggered these jobs, looking specifically at the commit that those jobs are from:
It never got to line 83 (the @info "foo 1"
) line.
So this implies that the timeout is actually happening inside the Aqua.test_all(ThreadingUtilities)
.
I suspect the timeout is in __init__()
then.
Me too. I'm particularly suspicious of that pause()
inside a while loop.
pause
only takes a few nanoseconds.
Yeah but what if the condition in the while loop never evaluates to false?
julia> using VectorizationBase
julia> @code_native syntax=:intel pause()
.text
; ┌ @ VectorizationBase.jl:317 within `pause'
pause
ret
nop word ptr cs:[rax + rax]
; └
pause
is an assembly instruction that hints to the CPU that you're inside a spin loop, waiting for another thread to do something.
Yes, I suspect that the condition never evaluates to false
.
In this job, we never successfully import the package, which smells like an init problem: https://github.com/chriselrod/ThreadingUtilities.jl/runs/1726922690
But in a different job, we did successfully import the package. But it still timed out: https://github.com/chriselrod/ThreadingUtilities.jl/runs/1726922889
What is supposed to happen:
m = THREADPOOL[tid]
GC.@preserve m _atomic_min!(pointer(m), SPIN) # set to SPIN
We set it to SPIN
.
We then wake up the thread:
t = Task(m); t.sticky = true # create and pin
# set to tid, we have tasks 2...nthread, from 1-based ind perspective
ccall(:jl_set_task_tid, Cvoid, (Any, Cint), t, tid % Cint)
TASKS[tid] = t
wake_thread!(tid) # task should immediately sleep
When awake, it runs:
function (tt::ThreadTask)()
p = pointer(tt)
memory = tt.memory
max_wait = 1 << 20
wait_counter = max_wait
GC.@preserve memory begin
while true
if _atomic_cas_cmp!(p, TASK, LOCK)
_call(p)
@assert _atomic_cas_cmp!(p, LOCK, SPIN)
wait_counter = 0
continue
end
pause()
if (wait_counter += 1) > max_wait
wait_counter = 0
if _atomic_cas_cmp!(p, SPIN, WAIT)
wait()
_call(p)
@assert _atomic_cas_cmp!(p, LOCK, SPIN)
end
end
end
end
end
It should be set to SPIN
(because we just set it to SPIN
), and then immediately the wait_counter
should be greater than max_wait
, so that if _atomic_cas_cmp!(p, SPIN, WAIT)
is true
, and it gets set to WAIT
.
Then of course
while !_atomic_cas_cmp!(pointer(m), WAIT, WAIT)
pause()
end
_atomic_cas_cmp!(pointer(m), WAIT, WAIT)
is true, and thus !
the condition is false and the loop is done.
Check out the logs for https://github.com/chriselrod/ThreadingUtilities.jl/runs/1726933020
This job ran on commit https://github.com/chriselrod/ThreadingUtilities.jl/pull/12/commits/b9054e071415abb19848ddf51ae10c322d382cd7
Could you add @show _atomic_max!(p, SPIN)
?
Yes. It'd also help to know which loop it is in at the time, so maybe @show @__LINE__, _atomic_max!(p, SPIN)
I'm about to leave the house- want to push those changes to my branch? You should be able to.
Okay, thanks for all your help so far. I can already see here:
[ Info: threadtask beginning to pause...
ERROR: [ Info: threadtask finished pausing
InitError: KeyError: key 0x9b3e6abc04ee57a9 not found
That the current state is uninitialized junk, rather than one of the enum
values.
I just tried moving the setting-to-0 inside of the closure, maybe that helps.
But the junk shouldn't really have been a problem, as the setting inside __init__
should've reached the thread eventually?
Everything should be fixed as of the "add utils.jl" commit. I'll reopen this if we see more intermittent timeouts.
Some of the jobs are failing, and others are timing out.