JuliaSIMD / ThreadingUtilities.jl

Utilities for low overhead threading in Julia.
MIT License
17 stars 4 forks source link

CI is intermittently timing out on master #3

Closed DilumAluthge closed 3 years ago

DilumAluthge commented 3 years ago

Some of the jobs are failing, and others are timing out.

chriselrod commented 3 years ago

I'll add counters and @asserts to the while loops to try and find out where it's getting suck.

DilumAluthge commented 3 years ago

CI is now passing on master, so I'll close this.

DilumAluthge commented 3 years ago

Reopening this.

chriselrod commented 3 years ago

Failure on 1.5 and 1.6: https://github.com/chriselrod/ThreadingUtilities.jl/runs/1725829471?check_suite_focus=true

DilumAluthge commented 3 years ago

One timeout is on Linux, and one timeout is on macOS. So presumably it's not an OS-related bug.

DilumAluthge commented 3 years ago

Maybe run some kind of loop on your local machines?

E.g.

import Pkg

while true
    Pkg.test("ThreadingUtilities")
end

You could do the loop in bash instead of Julia.

And maybe run all of that inside of rr? So that when you finally catch a timeout, you can replay it?

DilumAluthge commented 3 years ago

Here's another failure, on macOS: https://github.com/chriselrod/ThreadingUtilities.jl/runs/1726038737

DilumAluthge commented 3 years ago

Got another failure, this time on Linux: https://github.com/chriselrod/ThreadingUtilities.jl/runs/1726061663

And on this one, we can see the versioninfo:

     Testing Running tests...
Julia Version 1.6.0-beta1
Commit b84990e1ac (2021-01-08 12:42 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
      Ubuntu 18.04.5 LTS
  uname: Linux 5.4.0-1036-azure #38~18.04.1-Ubuntu SMP Wed Jan 6 18:26:30 UTC 2021 x86_64 x86_64
  CPU: Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz: 
              speed         user         nice          sys         idle          irq
       #1  2397 MHz        395 s          0 s         87 s        562 s          0 s
       #2  2397 MHz        270 s          0 s        109 s        643 s          0 s

  Memory: 6.791389465332031 GB (4148.625 MB free)
  Uptime: 116.0 sec
  Load Avg:  0.99  0.4  0.15
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.0 (ORCJIT, haswell)
Environment:
  JULIA_NUM_THREADS = 5
  JULIA_LOAD_PATH = @:/tmp/jl_KHoQmi
  LEIN_HOME = /usr/local/lib/lein
  M2_HOME = /usr/share/apache-maven-3.6.3
  ANDROID_HOME = /usr/local/lib/android/sdk
  JAVA_HOME_11_X64 = /usr/lib/jvm/adoptopenjdk-11-hotspot-amd64
  JAVA_HOME_12_X64 = /usr/lib/jvm/adoptopenjdk-12-hotspot-amd64
  GITHUB_ACTION_PATH = /home/runner/work/_actions/julia-actions/julia-runtest/v1
  SWIFT_PATH = /usr/share/swift/usr/bin
  JAVA_HOME = /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64
  GRADLE_HOME = /usr/share/gradle
  PIPX_HOME = "/opt/pipx"
  HOME = /home/runner
  JAVA_HOME_8_X64 = /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64
  DEPLOYMENT_BASEPATH = /opt/runner
  GITHUB_EVENT_PATH = /home/runner/work/_temp/_github_workflow/event.json
  HOMEBREW_REPOSITORY = "/home/linuxbrew/.linuxbrew/Homebrew"
  GITHUB_PATH = /home/runner/work/_temp/_runner_file_commands/add_path_6a2c13ff-4d32-45c3-9f22-fd85b6f914b6
  JAVA_HOME_7_X64 = /usr/lib/jvm/zulu-7-azure-amd64
  HOMEBREW_PREFIX = "/home/linuxbrew/.linuxbrew"
  ANT_HOME = /usr/share/ant
  PATH = /opt/hostedtoolcache/julia/1.6.0-beta1/x64/bin:/home/linuxbrew/.linuxbrew/bin:/home/linuxbrew/.linuxbrew/sbin:/opt/pipx_bin:/usr/share/rust/.cargo/bin:/home/runner/.config/composer/vendor/bin:/home/runner/.dotnet/tools:/snap/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
  SELENIUM_JAR_PATH = /usr/share/java/selenium-server-standalone.jar
  HOMEBREW_CELLAR = "/home/linuxbrew/.linuxbrew/Cellar"
  JULIA_LOAD_PATH = @:/tmp/jl_KHoQmi
┌ Info: 
└   Sys.CPU_THREADS = 2
┌ Info: 
└   VectorizationBase.NUM_CORES = 2
┌ Info: 
└   length(ThreadingUtilities.TASKS) = 1
[ Info: Beginning to run the ThreadingUtilities test suite
Test Summary: | Pass  Total
THREADPOOL    |    4      4
  0.472754 seconds (741.11 k allocations: 41.440 MiB, 19.56% gc time, 99.36% compilation time)
Test Summary: | Pass  Total
Internals     |    2      2
  0.000345 seconds (229 allocations: 29.820 KiB)
Error: The operation was canceled.
DilumAluthge commented 3 years ago

Here's a failing job: https://github.com/chriselrod/ThreadingUtilities.jl/runs/1726160608

And here's a passing job on the same commit: https://github.com/chriselrod/ThreadingUtilities.jl/runs/1726160653

Log for the failing job:

┌ Info: 
└   Threads.nthreads() = 5
┌ Info: 
└   Sys.CPU_THREADS = 2
┌ Info: 
└   VectorizationBase.NUM_CORES = 2
┌ Info: 
└   length(ThreadingUtilities.TASKS) = 1
[ Info: Beginning to run the ThreadingUtilities test suite
Test Summary: | Pass  Total
THREADPOOL    |    4      4
  0.468356 seconds (741.25 k allocations: 41.449 MiB, 19.53% gc time, 99.50% compilation time)
Test Summary: | Pass  Total
Internals     |    3      3
  0.004237 seconds (266 allocations: 32.570 KiB, 92.60% compilation time)
Skipping Base.active_repl
Skipping Base.active_repl_backend
Skipping Base.Filesystem.JL_O_RANDOM
Skipping Base.Filesystem.JL_O_SEQUENTIAL
Skipping Base.Filesystem.JL_O_SHORT_LIVED
Skipping Base.Filesystem.JL_O_TEMPORARY
Skipping Base.BinaryPlatforms.compiler_abi
Skipping Base.active_repl
Skipping Base.active_repl_backend
Skipping Base.Filesystem.JL_O_RANDOM
Skipping Base.Filesystem.JL_O_SEQUENTIAL
Skipping Base.Filesystem.JL_O_SHORT_LIVED
Skipping Base.Filesystem.JL_O_TEMPORARY
Skipping Base.BinaryPlatforms.compiler_abi
Error: The operation was canceled.

Log for the passing job:

┌ Info: 
└   Threads.nthreads() = 5
┌ Info: 
└   Sys.CPU_THREADS = 3
┌ Info: 
└   VectorizationBase.NUM_CORES = 3
┌ Info: 
└   length(ThreadingUtilities.TASKS) = 2
[ Info: Beginning to run the ThreadingUtilities test suite
Test Summary: | Pass  Total
THREADPOOL    |    4      4
  0.299020 seconds (741.38 k allocations: 41.456 MiB, 2.36% gc time, 98.06% compilation time)
Test Summary: | Pass  Total
Internals     |    3      3
  0.004492 seconds (256 allocations: 32.102 KiB, 80.02% compilation time)
Skipping Base.active_repl
Skipping Base.active_repl_backend
Skipping Base.Filesystem.JL_O_RANDOM
Skipping Base.Filesystem.JL_O_SEQUENTIAL
Skipping Base.Filesystem.JL_O_SHORT_LIVED
Skipping Base.Filesystem.JL_O_TEMPORARY
Skipping Base.BinaryPlatforms.compiler_abi
Skipping Base.active_repl
Skipping Base.active_repl_backend
Skipping Base.Filesystem.JL_O_RANDOM
Skipping Base.Filesystem.JL_O_SEQUENTIAL
Skipping Base.Filesystem.JL_O_SHORT_LIVED
Skipping Base.Filesystem.JL_O_TEMPORARY
Skipping Base.BinaryPlatforms.compiler_abi
Test Summary: | Pass  Total
Aqua          |    7      7
 10.775370 seconds (2.96 M allocations: 175.123 MiB, 0.57% gc time, 19.82% compilation time)
[ Info: foo 1
[ Info: foo 2
[ Info: foo 3
[ Info: foo 4
[ Info: foo 5
[ Info: foo 6
[ Info: foo 7
Test Summary:         | Pass  Total
ThreadingUtilities.jl |    2      2
[ Info: Finished running the ThreadingUtilities test suite
     Testing ThreadingUtilities tests passed 
DilumAluthge commented 3 years ago

This is where it gets really interesting. From my PR that triggered these jobs, looking specifically at the commit that those jobs are from:

https://github.com/chriselrod/ThreadingUtilities.jl/blob/d08385f1eedf5dfa4a9631390cc5badd5b37eb9d/test/runtests.jl#L78-L85

DilumAluthge commented 3 years ago

It never got to line 83 (the @info "foo 1") line.

So this implies that the timeout is actually happening inside the Aqua.test_all(ThreadingUtilities).

DilumAluthge commented 3 years ago

It might not be Aqua. https://github.com/chriselrod/ThreadingUtilities.jl/runs/1726363081?check_suite_focus=true

chriselrod commented 3 years ago

I suspect the timeout is in __init__() then.

DilumAluthge commented 3 years ago

Me too. I'm particularly suspicious of that pause() inside a while loop.

chriselrod commented 3 years ago

pause only takes a few nanoseconds.

DilumAluthge commented 3 years ago

Yeah but what if the condition in the while loop never evaluates to false?

chriselrod commented 3 years ago
julia> using VectorizationBase

julia> @code_native syntax=:intel pause()
        .text
; ┌ @ VectorizationBase.jl:317 within `pause'
        pause
        ret
        nop     word ptr cs:[rax + rax]
; └

pause is an assembly instruction that hints to the CPU that you're inside a spin loop, waiting for another thread to do something.

Yes, I suspect that the condition never evaluates to false.

DilumAluthge commented 3 years ago

In this job, we never successfully import the package, which smells like an init problem: https://github.com/chriselrod/ThreadingUtilities.jl/runs/1726922690

DilumAluthge commented 3 years ago

But in a different job, we did successfully import the package. But it still timed out: https://github.com/chriselrod/ThreadingUtilities.jl/runs/1726922889

chriselrod commented 3 years ago

What is supposed to happen:

        m = THREADPOOL[tid]
        GC.@preserve m _atomic_min!(pointer(m), SPIN) # set to SPIN

We set it to SPIN. We then wake up the thread:

        t = Task(m); t.sticky = true # create and pin
        # set to tid, we have tasks 2...nthread, from 1-based ind perspective
        ccall(:jl_set_task_tid, Cvoid, (Any, Cint), t, tid % Cint)
        TASKS[tid] = t
        wake_thread!(tid) # task should immediately sleep

When awake, it runs:

function (tt::ThreadTask)()
    p = pointer(tt)
    memory = tt.memory
    max_wait = 1 << 20
    wait_counter = max_wait
    GC.@preserve memory begin
        while true
            if _atomic_cas_cmp!(p, TASK, LOCK)
                _call(p)
                @assert _atomic_cas_cmp!(p, LOCK, SPIN)
                wait_counter = 0
                continue
            end
            pause()
            if (wait_counter += 1) > max_wait
                wait_counter = 0
                if _atomic_cas_cmp!(p, SPIN, WAIT)
                    wait()
                    _call(p)
                    @assert _atomic_cas_cmp!(p, LOCK, SPIN)
                end
            end
        end
    end
end

It should be set to SPIN (because we just set it to SPIN), and then immediately the wait_counter should be greater than max_wait, so that if _atomic_cas_cmp!(p, SPIN, WAIT) is true, and it gets set to WAIT. Then of course

        while !_atomic_cas_cmp!(pointer(m), WAIT, WAIT)
            pause()
        end

_atomic_cas_cmp!(pointer(m), WAIT, WAIT) is true, and thus ! the condition is false and the loop is done.

DilumAluthge commented 3 years ago

Check out the logs for https://github.com/chriselrod/ThreadingUtilities.jl/runs/1726933020

This job ran on commit https://github.com/chriselrod/ThreadingUtilities.jl/pull/12/commits/b9054e071415abb19848ddf51ae10c322d382cd7

chriselrod commented 3 years ago

Could you add @show _atomic_max!(p, SPIN)?

DilumAluthge commented 3 years ago

Like this? https://github.com/DilumAluthge/ThreadingUtilities.jl/commit/6fb0ecb42756af66477a3d770c74915b7a2da1f8

chriselrod commented 3 years ago

Yes. It'd also help to know which loop it is in at the time, so maybe @show @__LINE__, _atomic_max!(p, SPIN)

DilumAluthge commented 3 years ago

I'm about to leave the house- want to push those changes to my branch? You should be able to.

chriselrod commented 3 years ago

Okay, thanks for all your help so far. I can already see here:

 [ Info: threadtask beginning to pause...
ERROR: [ Info: threadtask finished pausing
InitError: KeyError: key 0x9b3e6abc04ee57a9 not found

That the current state is uninitialized junk, rather than one of the enum values.

I just tried moving the setting-to-0 inside of the closure, maybe that helps.

But the junk shouldn't really have been a problem, as the setting inside __init__ should've reached the thread eventually?

DilumAluthge commented 3 years ago

On master: https://github.com/chriselrod/ThreadingUtilities.jl/runs/1727659327

chriselrod commented 3 years ago

Everything should be fixed as of the "add utils.jl" commit. I'll reopen this if we see more intermittent timeouts.