JuliaGPU / Metal.jl

Metal programming in Julia
MIT License
355 stars 38 forks source link

More sporadic 1.11 hangs #412

Closed christiangnrd closed 1 month ago

christiangnrd commented 2 months ago

Now that 1.11 compilations are fixed, we're back to sporadic hanging of CI.

See example 1, example 2, example 3.

Bonus local output ``` (Metal) pkg> test Testing Metal Status `/private/var/folders/4g/lnkpkf3s4rxd_wbl8vwnqs4r0000gn/T/jl_aiC8B7/Project.toml` [79e6a3ab] Adapt v4.0.4 [ab4f0b2a] BFloat16s v0.5.0 [6e4b80f9] BenchmarkTools v1.5.0 [0c68f7d7] GPUArrays v10.3.1 [63c18a36] KernelAbstractions v0.9.25 [929cbde3] LLVM v9.0.0 [dde4c033] Metal v1.3.0 `~/.julia/dev/Metal` [e86c9b32] ObjectiveC v3.0.0 [276daf66] SpecialFunctions v2.4.0 [90137ffa] StaticArrays v1.9.7 [10745b16] Statistics v1.11.1 [ade2ca70] Dates v1.11.0 [8ba89e20] Distributed v1.11.0 [b77e0a4c] InteractiveUtils v1.11.0 [37e2e46d] LinearAlgebra v1.11.0 [de0858da] Printf v1.11.0 [3fa0cd96] REPL v1.11.0 [9a3f8284] Random v1.11.0 [2f01184e] SparseArrays v1.11.0 [8dfed614] Test v1.11.0 Status `/private/var/folders/4g/lnkpkf3s4rxd_wbl8vwnqs4r0000gn/T/jl_aiC8B7/Manifest.toml` [79e6a3ab] Adapt v4.0.4 [a9b6321e] Atomix v0.1.0 [ab4f0b2a] BFloat16s v0.5.0 [6e4b80f9] BenchmarkTools v1.5.0 [fa961155] CEnum v0.5.0 [523fee87] CodecBzip2 v0.8.4 [ffbed154] DocStringExtensions v0.9.3 [e2ba6199] ExprTools v0.1.10 [0c68f7d7] GPUArrays v10.3.1 [46192b85] GPUArraysCore v0.1.6 [61eb1bfa] GPUCompiler v0.27.4 [92d709cd] IrrationalConstants v0.2.2 [692b3bcd] JLLWrappers v1.5.0 [682c06a0] JSON v0.21.4 [63c18a36] KernelAbstractions v0.9.25 [929cbde3] LLVM v9.0.0 [2ab3a3ac] LogExpFunctions v0.3.28 [1914dd2f] MacroTools v0.5.13 [dde4c033] Metal v1.3.0 `~/.julia/dev/Metal` [d8793406] ObjectFile v0.4.2 [e86c9b32] ObjectiveC v3.0.0 [69de0a69] Parsers v2.8.1 [aea7be01] PrecompileTools v1.2.1 [21216c6a] Preferences v1.4.3 [189a3867] Reexport v1.2.2 [ae029012] Requires v1.3.0 [6c6a2e73] Scratch v1.2.1 [276daf66] SpecialFunctions v2.4.0 [90137ffa] StaticArrays v1.9.7 [1e83bf80] StaticArraysCore v1.4.3 [10745b16] Statistics v1.11.1 [53d494c1] StructIO v0.3.1 [a759f4b9] TimerOutputs v0.5.24 [3bb67fe8] TranscodingStreams v0.11.2 [013be700] UnsafeAtomics v0.2.1 [d80eeb9a] UnsafeAtomicsLLVM v0.2.1 [6e34b625] Bzip2_jll v1.0.8+1 [2e619515] Expat_jll v2.6.2+0 [f52de702] LLVMDowngrader_jll v0.3.0+1 [dad2f222] LLVMExtra_jll v0.0.33+0 [7106de7a] LibMPDec_jll v2.5.1+0 ⌅ [e9f186c6] Libffi_jll v3.2.2+1 [458c3c95] OpenSSL_jll v3.0.14+0 [efe28fd5] OpenSpecFun_jll v0.5.5+0 [93d3a430] Python_jll v3.10.14+0 [76ed43ae] SQLite_jll v3.45.3+0 [ffd25f8a] XZ_jll v5.4.6+0 [0dad84c5] ArgTools v1.1.2 [56f22d72] Artifacts v1.11.0 [2a0f44e3] Base64 v1.11.0 [ade2ca70] Dates v1.11.0 [8ba89e20] Distributed v1.11.0 [f43a241f] Downloads v1.6.0 [7b1f6079] FileWatching v1.11.0 [b77e0a4c] InteractiveUtils v1.11.0 [4af54fe1] LazyArtifacts v1.11.0 [b27032c2] LibCURL v0.6.4 [76f85450] LibGit2 v1.11.0 [8f399da3] Libdl v1.11.0 [37e2e46d] LinearAlgebra v1.11.0 [56ddb016] Logging v1.11.0 [d6f4376e] Markdown v1.11.0 [a63ad114] Mmap v1.11.0 [ca575930] NetworkOptions v1.2.0 [44cfe95a] Pkg v1.11.0 [de0858da] Printf v1.11.0 [9abbd945] Profile v1.11.0 [3fa0cd96] REPL v1.11.0 [9a3f8284] Random v1.11.0 [ea8e919c] SHA v0.7.0 [9e88b42a] Serialization v1.11.0 [6462fe0b] Sockets v1.11.0 [2f01184e] SparseArrays v1.11.0 [f489334b] StyledStrings v1.11.0 [fa267f1f] TOML v1.0.3 [a4e569a6] Tar v1.10.0 [8dfed614] Test v1.11.0 [cf7118a7] UUIDs v1.11.0 [4ec0a83e] Unicode v1.11.0 [e66e0078] CompilerSupportLibraries_jll v1.1.1+0 [deac9b47] LibCURL_jll v8.6.0+0 [e37daf67] LibGit2_jll v1.7.2+0 [29816b5a] LibSSH2_jll v1.11.0+1 [c8ffd9c3] MbedTLS_jll v2.28.6+0 [14a3606d] MozillaCACerts_jll v2023.12.12 [4536629a] OpenBLAS_jll v0.3.27+1 [05823500] OpenLibm_jll v0.8.1+2 [bea87d4a] SuiteSparse_jll v7.7.0+0 [83775a58] Zlib_jll v1.2.13+1 [8e850b90] libblastrampoline_jll v5.11.0+0 [8e850ede] nghttp2_jll v1.59.0+0 [3f19e933] p7zip_jll v17.4.0+2 Info Packages marked with ⌅ have new versions available but compatibility constraints restrict them from upgrading. Testing Running tests... ┌ Info: System information: │ macOS 14.6.1, Darwin 23.6.0 │ │ Toolchain: │ - Julia: 1.11.0-rc3 │ - LLVM: 16.0.6 │ │ Julia packages: │ - Metal.jl: 1.3.0 │ - LLVMDowngrader_jll: 0.3.0+1 │ │ 1 device: └ - Apple M2 Max (64.000 KiB allocated) [ Info: Running 8 tests in parallel. If this is too many, specify the `--jobs` argument to the tests, or set the JULIA_CPU_THREADS environment variable. | | ---------------- CPU ---------------- | Test (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB) | metallib (8) | 0.62 | 0.01 | 2.0 | 214.88 | 547.77 | pool (9) | 1.14 | 0.03 | 2.5 | 325.48 | 595.09 | From worker 8: Starting recording with the Blank template and GPU, Time Profiler, Metal Application, Metal GPU Counters, Metal Resource Events, os_signpost Instruments. Attaching to: julia (23790). From worker 8: Ctrl-C to stop the recording From worker 8: Stopping recording... metal (7) | 4.59 | 0.17 | 3.8 | 930.59 | 813.38 | From worker 7: ┌ Warning: Skipping script tests From worker 7: └ @ Main ~/.julia/dev/Metal/test/scripts.jl:9 scripts (7) | 0.55 | 0.00 | 0.0 | 78.42 | 822.00 | From worker 10: 2024-08-29 10:43:58.916 julia[23792:284856] Metal GPU Frame Capture Enabled From worker 8: Recording completed. Saving output file... From worker 8: Output file saved as: julia_1.trace From worker 8: [ Info: System trace saved to /private/var/folders/4g/lnkpkf3s4rxd_wbl8vwnqs4r0000gn/T/jl_pNVkk4/julia_1.trace; open the resulting trace in Instruments profiling (8) | 9.36 | 0.01 | 0.1 | 114.46 | 570.70 | From worker 10: [ Info: GPU frame capture saved to /private/var/folders/4g/lnkpkf3s4rxd_wbl8vwnqs4r0000gn/T/jl_UfApxE/julia_1.gputrace; open the resulting trace in Xcode From worker 10: [ Info: GPU frame capture saved to /private/var/folders/4g/lnkpkf3s4rxd_wbl8vwnqs4r0000gn/T/jl_UfApxE/julia_2.gputrace; open the resulting trace in Xcode capturing (10) | 9.87 | 0.18 | 1.8 | 1080.20 | 719.06 | execution (5) | 18.02 | 0.28 | 1.6 | 2198.88 | 766.20 | mps/matrix (11) | 0.58 | 0.00 | 0.0 | 82.76 | 556.66 | mps/size (11) | 0.04 | 0.00 | 0.0 | 1.41 | 564.39 | examples (4) | 23.70 | 0.57 | 2.4 | 2662.96 | 2038.67 | mps/vector (11) | 0.13 | 0.00 | 0.0 | 12.66 | 565.38 | random (9) | 24.50 | 0.40 | 1.6 | 3500.97 | 978.94 | gpuarrays/indexing scalar (4) | 6.65 | 0.08 | 1.2 | 1091.20 | 2089.61 | kernelabstractions (6) | 31.11 | 0.65 | 2.1 | 4947.82 | 1138.53 | mps/linalg (5) | 16.84 | 0.35 | 2.1 | 3096.05 | 976.09 | mps/copy (8) | 33.03 | 0.75 | 2.3 | 6544.86 | 989.70 | array (2) | 45.54 | 0.71 | 1.6 | 6176.39 | 1094.27 | gpuarrays/interface (2) | 0.94 | 0.01 | 1.4 | 236.47 | 1142.72 | gpuarrays/indexing find (5) | 15.72 | 0.42 | 2.7 | 5267.93 | 1181.33 | gpuarrays/math/power (4) | 20.75 | 0.48 | 2.3 | 4750.50 | 2305.39 | gpuarrays/reductions/any all count (2) | 8.32 | 0.12 | 1.5 | 1691.97 | 1204.00 | gpuarrays/uniformscaling (4) | 4.67 | 0.02 | 0.5 | 502.92 | 2437.33 | gpuarrays/linalg/mul!/vector-matrix (6) | 28.20 | 0.41 | 1.5 | 5208.05 | 1341.78 | gpuarrays/math/intrinsics (6) | 0.91 | 0.00 | 0.0 | 107.54 | 1347.39 | gpuarrays/indexing multidimensional (8) | 25.80 | 0.45 | 1.7 | 5203.15 | 1279.70 | gpuarrays/reductions/mapreducedim!_large (2) | 27.58 | 0.79 | 2.9 | 6800.30 | 1778.45 | gpuarrays/linalg/norm (6) | 28.24 | 0.54 | 1.9 | 6401.18 | 1584.08 | gpuarrays/reductions/reducedim! (11) | 67.84 | 1.30 | 1.9 | 12473.56 | 1249.77 | gpuarrays/vectors (11) | 0.18 | 0.00 | 0.0 | 27.23 | 1252.45 | gpuarrays/linalg/mul!/matrix-matrix (4) | 38.05 | 0.45 | 1.2 | 5872.34 | 2530.66 | gpuarrays/statistics (8) | 26.77 | 0.40 | 1.5 | 4574.18 | 1491.39 | gpuarrays/linalg (9) | 73.68 | 1.55 | 2.1 | 14299.43 | 1578.47 | gpuarrays/random (11) | 8.23 | 0.08 | 1.0 | 1146.98 | 1366.86 | gpuarrays/constructors (6) | 16.62 | 0.15 | 0.9 | 1804.54 | 1701.91 | gpuarrays/base (4) | 17.28 | 0.44 | 2.6 | 3418.14 | 2573.62 | gpuarrays/reductions/== isequal (8) | 25.87 | 0.53 | 2.1 | 5815.18 | 1610.95 | gpuarrays/reductions/mapreducedim! (11) | 26.83 | 0.43 | 1.6 | 5249.56 | 1554.47 | gpuarrays/reductions/minimum maximum extrema (5) | 105.34 | 2.37 | 2.2 | 22907.86 | 1966.19 | gpuarrays/reductions/reduce (6) | 56.45 | 1.32 | 2.3 | 13019.02 | 2042.95 | gpuarrays/reductions/mapreduce (2) | 81.31 | 1.81 | 2.2 | 17933.78 | 2046.83 | gpuarrays/reductions/sum prod (4) | 77.72 | 1.54 | 2.0 | 16463.69 | 3235.77 | gpuarrays/broadcasting (9) | 108.09 | 1.93 | 1.8 | 19730.51 | 2196.94 | ┌ Warning: Test timed out: device/intrinsics └ @ Main ~/.julia/dev/Metal/test/runtests.jl:247 From worker 7: From worker 7: ====================================================================================== From worker 7: Information request received. A stacktrace will print followed by a 1.0 second profile From worker 7: ====================================================================================== From worker 7: From worker 7: signal (29): Information request: 29 From worker 7: __psynch_cvwait at /usr/lib/system/libsystem_kernel.dylib (unknown line) From worker 7: unknown function (ip: 0x0) From worker 7: __psynch_cvwait at /usr/lib/system/libsystem_kernel.dylib (unknown line) From worker 7: unknown function (ip: 0x0) From worker 7: From worker 7: ============================================================== From worker 7: Profile collected. A report will print at the next yield point From worker 7: ============================================================== From worker 7: ┌ Warning: Couldn't kill worker 7, killing process 23789 forcefully └ @ Main ~/.julia/dev/Metal/test/runtests.jl:256 From worker 7: From worker 7: [23789] signal 15: Terminated: 15 From worker 7: in expression starting at /Users/christian/.julia/dev/Metal/test/device/intrinsics.jl:4 From worker 7: __psynch_cvwait at /usr/lib/system/libsystem_kernel.dylib (unknown line) From worker 7: unknown function (ip: 0x0) From worker 7: __psynch_cvwait at /usr/lib/system/libsystem_kernel.dylib (unknown line) From worker 7: unknown function (ip: 0x0) From worker 7: Allocations: 35316542 (Pool: 35315514; Big: 1028); GC: 24 device/intrinsics (7) | failed at 2024-08-29T10:52:12.607 Testing finished in 8 minutes, 23 seconds, 33 milliseconds device/intrinsics: Error During Test at none:1 Got exception outside of a @test ProcessExitedException(7) Test Summary: | Pass Error Broken Total Time Overall | 9617 1 9 9627 metallib | 25 25 pool | 5 5 metal | 131 131 scripts | 0 profiling | 1 1 capturing | 24 24 execution | 35 35 mps/matrix | 75 75 mps/size | 9 9 examples | 4 4 mps/vector | 34 34 random | 817 1 818 gpuarrays/indexing scalar | 399 399 kernelabstractions | 2179 8 2187 mps/linalg | 45 45 mps/copy | 144 144 array | 374 374 gpuarrays/interface | 7 7 gpuarrays/indexing find | 45 45 gpuarrays/math/power | 60 60 gpuarrays/reductions/any all count | 101 101 gpuarrays/uniformscaling | 56 56 gpuarrays/linalg/mul!/vector-matrix | 140 140 gpuarrays/math/intrinsics | 10 10 gpuarrays/indexing multidimensional | 89 89 gpuarrays/reductions/mapreducedim!_large | 40 40 gpuarrays/linalg/norm | 264 264 gpuarrays/reductions/reducedim! | 160 160 gpuarrays/vectors | 10 10 gpuarrays/linalg/mul!/matrix-matrix | 360 360 gpuarrays/statistics | 52 52 gpuarrays/linalg | 397 397 gpuarrays/random | 52 52 gpuarrays/constructors | 832 832 gpuarrays/base | 93 93 gpuarrays/reductions/== isequal | 248 248 gpuarrays/reductions/mapreducedim! | 260 260 gpuarrays/reductions/minimum maximum extrema | 555 555 gpuarrays/reductions/reduce | 220 220 gpuarrays/reductions/mapreduce | 330 330 gpuarrays/reductions/sum prod | 636 636 gpuarrays/broadcasting | 299 299 device/intrinsics | 1 1 FAILURE Error in testset device/intrinsics: Error During Test at none:1 Got exception outside of a @test ProcessExitedException(7) ERROR: LoadError: Test run finished with errors in expression starting at /Users/christian/.julia/dev/Metal/test/runtests.jl:410 ERROR: Package Metal errored during testing ```

Maybe related to #329?

maleadt commented 2 months ago

https://buildkite.com/julialang/metal-dot-jl/builds/1187#01919a97-f080-4e37-8f60-7422426ea8ef is curious, as the metal test suite doesn't launch any kernels. So maybe this isn't due to bad codegen?

christiangnrd commented 1 month ago

I believe the offending commit is https://github.com/JuliaLang/julia/commit/5f36833b71e8bcce6ab9c89e8e98246638e465a1 from https://github.com/JuliaLang/julia/pull/50144.

I was able to reliably reproduce the hang by running

for _ in 1:100
    include("test/metal.jl")
end

in two different terminals simultaneously. (Both 1.11)

It always hangs at the second wait_completed https://github.com/JuliaGPU/Metal.jl/blob/04b5481f58f393538ed5efac1408c5ee98c616b9/test/metal.jl#L317 of the testset.

I don't know if the offending Julia commit is the cause of the bug or if it just uncovered it, but hopefully this is a good starting point for someone with a better understanding of this part of the codebase to investigate this further.

christiangnrd commented 1 month ago

This might also help? Collected a sample of the hung process from activity monitor.

hang.txt

maleadt commented 1 month ago

MWE:

using Metal, .MTL

dev = first(devices())
cmdq = MTLCommandQueue(dev)

cmdbuf = MTLCommandBuffer(cmdq)
scheduled = Ref(false)
completed = Ref(false)
on_scheduled(cmdbuf) do buf
    GC.gc(true)
    scheduled[] = true
end
on_completed(cmdbuf) do buf
    GC.gc(true)
    completed[] = true
end
@assert scheduled[] == false
@assert completed[] == false
@assert cmdbuf.status == MTL.MTLCommandBufferStatusNotEnqueued
enqueue!(cmdbuf)
@assert cmdbuf.status == MTL.MTLCommandBufferStatusEnqueued
commit!(cmdbuf)
print("start... ")
wait_completed(cmdbuf)
println("stop")

Hangs on 1.10 and 1.12 on the first iteration.