Closed freemin7 closed 1 year ago
The massive failure log isn't too relevant: CUDA just gets in a broken state pretty early on, which causes every subsequent operation to fail. Try running single-threadedly, or under compute-sanitizer, to narrow the issue down.
Doing a single threaded run. No idea what a compute sanitizer is.
See the CUDA.jl and NVIDIA documentation. You can run the tests under compute sanitizer by passing --sanitize
to the test suite.
I am running /home/joto/.julia/artifacts/913584335ab836f9781a0325178d0949c193f50b/bin/compute-sanitizer --tool memcheck --launch-timeout=0 --target-processes=all --report-api-errors=no julia -e "using Pkg; using CUDA; Pkg.test(\"CUDA\")" >> gpu_log_sanatize.log
if you expect me to run something else tell me what you want me to do. The instructions were unclear if you haven't heard "compute-sanitize" before.
There's a dedicated section on the use of compute-sanitizer
in the CUDA.jl docs: https://cuda.juliagpu.org/stable/development/debugging/#compute-sanitizer
You can run the tests under compute sanitizer by passing
--sanitize
to the test suite.
help?> Pkg.test
Pkg.test(; kwargs...)
Keyword arguments:
• julia_args::Union{Cmd, Vector{String}}: options to be passed the test process.
So you run Pkg.test(; julia_args=`--sanitize`)
EDIT: sorry, meant to show test_args
, not julia_args
.
Also better pass --quickfail
to the test process so that it doesn't attempt to continue, since CUDA is broken anyway.
Yes and i read that section. I also read the buildkite recipe. That section left me unsure whether i am doing the right thing. Telling RTFM to people who tried to RFTM and communicated their attempt in order to verify their assumption is pretty discouraging to people.
Thank you for your actionable answers. Although i am not sure whether test process refers to the Pkg test call or the process i run using Pkg in.
julia -e "using Pkg; using CUDA; Pkg.test( \"CUDA\" ; julia_args=[\"--sanitize\", \"--quickfail\"] );" >> sanitize_GPU.log
Fails with:
Testing Running tests...
ERROR: unknown option `--sanitize`
ERROR: Package CUDA errored during testing
Stacktrace:
[1] pkgerror(msg::String)
@ Pkg.Types ~/.julia/juliaup/julia-1.7.3+0.x64/share/julia/stdlib/v1.7/Pkg/src/Types.jl:68
[2] test(ctx::Pkg.Types.Context, pkgs::Vector{Pkg.Types.PackageSpec}; coverage::Bool, julia_args::Cmd, test_args::Cmd, test_fn::Nothing, force_latest_compatible_version::Bool, allow_earlier_backwards_compatible_versions::Bool, allow_reresolve::Bool)
@ Pkg.Operations ~/.julia/juliaup/julia-1.7.3+0.x64/share/julia/stdlib/v1.7/Pkg/src/Operations.jl:1672
[3] test(ctx::Pkg.Types.Context, pkgs::Vector{Pkg.Types.PackageSpec}; coverage::Bool, test_fn::Nothing, julia_args::Vector{String}, test_args::Cmd, force_latest_compatible_version::Bool, allow_earlier_backwards_compatible_versions::Bool, allow_reresolve::Bool, kwargs::Base.Pairs{Symbol, Base.TTY, Tuple{Symbol}, NamedTuple{(:io,), Tuple{Base.TTY}}})
@ Pkg.API ~/.julia/juliaup/julia-1.7.3+0.x64/share/julia/stdlib/v1.7/Pkg/src/API.jl:421
[4] test(pkgs::Vector{Pkg.Types.PackageSpec}; io::Base.TTY, kwargs::Base.Pairs{Symbol, Vector{String}, Tuple{Symbol}, NamedTuple{(:julia_args,), Tuple{Vector{String}}}})
@ Pkg.API ~/.julia/juliaup/julia-1.7.3+0.x64/share/julia/stdlib/v1.7/Pkg/src/API.jl:149
[5] #test#87
@ ~/.julia/juliaup/julia-1.7.3+0.x64/share/julia/stdlib/v1.7/Pkg/src/API.jl:142 [inlined]
[6] #test#86
@ ~/.julia/juliaup/julia-1.7.3+0.x64/share/julia/stdlib/v1.7/Pkg/src/API.jl:141 [inlined]
[7] top-level scope
@ none:1
julia -e "using Pkg; using CUDA; Pkg.test( julia_args=[\"--sanitize\", \"--quickfail\"] );" >> sanitize_GPU.log
doesn't work either.
Neither did julia --quickfail -e "using Pkg; using CUDA; Pkg.test( \"CUDA\" ; julia_args=[\"--sanitize\"] );" >> sanitize_GPU.log
.
Like i am trying to not be complicated but i never ran tests beyond ]test
or Pkg.test("Name")
so appreciate your patience and frustration.
Me pointing out the section in the documentation wasn't in bad faith, but instead sincerely trying to be helpful. Just ask for details if something isn't clear, I'm always happy to help as you surely know already; and consider my POV too where spending time explaining things you know already is just as well time wasted. Thanks.
Those options, --quickfail
and --sanitize
etc, are arguments to the test suite, i.e., to runtests.jl
: https://github.com/JuliaGPU/CUDA.jl/blob/4f9a80696d12ff50622a1ed060e7577baa1f5c82/test/runtests.jl#L38-L54
You can pass those by setting the test_args
argument to Pkg.test
, as the snippet above shows. For example, try julia --project -e 'using Pkg; Pkg.test(; test_args=
--help)'
in a CUDA.jl check-out (or without --project
but passing CUDA
to Pkg.test
if you don't have the package dev
ved).
I have a GTX 970 in some PC I could try and test with, to see if I can reproduce anything. Won't be right away though, I'm pretty busy at the moment.
A single threaded run on master using compute sanitizer doesn't seem to run in so many errors (none so far). I will next run a multi threaded run on master. After that i try out the new(?) 4.0 release the release i had problems with.
How many threads are you running by default? Does a regular (i.e. not using --sanitize
) single-threaded run produce errors? It's possible that an OOM somehow breaks one of CUDA's libraries; that's something we've seen in the past... (but typically resulting in INTERNAL_ERROR exceptions, not a sticky launch failure)
In my original run i used the auto option on a 4 core machine. Running those experiments takes time. I will run the suggested single threaded test run on master next once the sanitized single threaded run completes.
sanitized, quickfail, one thread: sanitized_fail_gpu.log quickfail, one thread no_sanitized_gpu.log
Hmm, the failure under compute-sanitizer isn't very helpful. Could you verify if the sorting
test fails in isolation? You can pass the test name to the test process, so ;test_args = `sorting`
joto@PRIMERGY-TX140-S2:~/CUDA.jl$ julia --project -e 'using Pkg; Pkg.test(;test_args=["sorting"])'
The latest version of Julia in the `release` channel is 1.8.2+0.x64. You currently have `1.7.3+0.x64` installed. Run:
juliaup update
to install Julia 1.8.2+0.x64 and update the `release` channel to that version.
┌ Warning: The active manifest file is an older format with no julia version entry. Dependencies may have been resolved with a different julia version.
└ @ ~/CUDA.jl/Manifest.toml:0
Testing CUDA
Status `/tmp/jl_t96jg3/Project.toml`
[79e6a3ab] Adapt v3.4.0
[ab4f0b2a] BFloat16s v0.4.2
[052768ef] CUDA v4.0.0 `~/CUDA.jl`
[864edb3b] DataStructures v0.18.13
[7a1cc6ca] FFTW v1.5.0
[0c68f7d7] GPUArrays v8.5.0
[a98d9a8b] Interpolations v0.14.6
[872c559c] NNlib v0.8.10
[276daf66] SpecialFunctions v2.1.7
[a759f4b9] TimerOutputs v0.5.21
[76a88914] CUDA_Runtime_jll v0.2.3+1
[ade2ca70] Dates `@stdlib/Dates`
[8ba89e20] Distributed `@stdlib/Distributed`
[37e2e46d] LinearAlgebra `@stdlib/LinearAlgebra`
[de0858da] Printf `@stdlib/Printf`
[3fa0cd96] REPL `@stdlib/REPL`
[9a3f8284] Random `@stdlib/Random`
[2f01184e] SparseArrays `@stdlib/SparseArrays`
[10745b16] Statistics `@stdlib/Statistics`
[8dfed614] Test `@stdlib/Test`
Status `/tmp/jl_t96jg3/Manifest.toml`
[621f4979] AbstractFFTs v1.2.1
[79e6a3ab] Adapt v3.4.0
[13072b0f] AxisAlgorithms v1.0.1
[ab4f0b2a] BFloat16s v0.4.2
[fa961155] CEnum v0.4.2
[052768ef] CUDA v4.0.0 `~/CUDA.jl`
[1af6417a] CUDA_Runtime_Discovery v0.1.0
[d360d2e6] ChainRulesCore v1.15.6
[9e997f8a] ChangesOfVariables v0.1.4
[34da2185] Compat v4.3.0
[864edb3b] DataStructures v0.18.13
[ffbed154] DocStringExtensions v0.9.2
[e2ba6199] ExprTools v0.1.8
[7a1cc6ca] FFTW v1.5.0
[0c68f7d7] GPUArrays v8.5.0
[46192b85] GPUArraysCore v0.1.2
[61eb1bfa] GPUCompiler v0.16.4
[a98d9a8b] Interpolations v0.14.6
[3587e190] InverseFunctions v0.1.8
[92d709cd] IrrationalConstants v0.1.1
[692b3bcd] JLLWrappers v1.4.1
[929cbde3] LLVM v4.14.0
[2ab3a3ac] LogExpFunctions v0.3.18
[872c559c] NNlib v0.8.10
[6fe1bfb0] OffsetArrays v1.12.8
[bac558e1] OrderedCollections v1.4.1
[21216c6a] Preferences v1.3.0
[74087812] Random123 v1.6.0
[e6cf234a] RandomNumbers v1.5.3
[c84ed2f1] Ratios v0.4.3
[189a3867] Reexport v1.2.2
[ae029012] Requires v1.3.0
[276daf66] SpecialFunctions v2.1.7
[90137ffa] StaticArrays v1.5.9
[1e83bf80] StaticArraysCore v1.4.0
[a759f4b9] TimerOutputs v0.5.21
[efce3f68] WoodburyMatrices v0.5.5
[4ee394cb] CUDA_Driver_jll v0.2.0+0
[76a88914] CUDA_Runtime_jll v0.2.3+1
[f5851436] FFTW_jll v3.3.10+0
[1d5cc7b8] IntelOpenMP_jll v2018.0.3+2
[dad2f222] LLVMExtra_jll v0.0.16+0
[856f044c] MKL_jll v2022.2.0+0
[efe28fd5] OpenSpecFun_jll v0.5.5+0
[0dad84c5] ArgTools `@stdlib/ArgTools`
[56f22d72] Artifacts `@stdlib/Artifacts`
[2a0f44e3] Base64 `@stdlib/Base64`
[ade2ca70] Dates `@stdlib/Dates`
[8ba89e20] Distributed `@stdlib/Distributed`
[f43a241f] Downloads `@stdlib/Downloads`
[7b1f6079] FileWatching `@stdlib/FileWatching`
[b77e0a4c] InteractiveUtils `@stdlib/InteractiveUtils`
[4af54fe1] LazyArtifacts `@stdlib/LazyArtifacts`
[b27032c2] LibCURL `@stdlib/LibCURL`
[76f85450] LibGit2 `@stdlib/LibGit2`
[8f399da3] Libdl `@stdlib/Libdl`
[37e2e46d] LinearAlgebra `@stdlib/LinearAlgebra`
[56ddb016] Logging `@stdlib/Logging`
[d6f4376e] Markdown `@stdlib/Markdown`
[a63ad114] Mmap `@stdlib/Mmap`
[ca575930] NetworkOptions `@stdlib/NetworkOptions`
[44cfe95a] Pkg `@stdlib/Pkg`
[de0858da] Printf `@stdlib/Printf`
[3fa0cd96] REPL `@stdlib/REPL`
[9a3f8284] Random `@stdlib/Random`
[ea8e919c] SHA `@stdlib/SHA`
[9e88b42a] Serialization `@stdlib/Serialization`
[1a1011a3] SharedArrays `@stdlib/SharedArrays`
[6462fe0b] Sockets `@stdlib/Sockets`
[2f01184e] SparseArrays `@stdlib/SparseArrays`
[10745b16] Statistics `@stdlib/Statistics`
[fa267f1f] TOML `@stdlib/TOML`
[a4e569a6] Tar `@stdlib/Tar`
[8dfed614] Test `@stdlib/Test`
[cf7118a7] UUIDs `@stdlib/UUIDs`
[4ec0a83e] Unicode `@stdlib/Unicode`
[e66e0078] CompilerSupportLibraries_jll `@stdlib/CompilerSupportLibraries_jll`
[deac9b47] LibCURL_jll `@stdlib/LibCURL_jll`
[29816b5a] LibSSH2_jll `@stdlib/LibSSH2_jll`
[c8ffd9c3] MbedTLS_jll `@stdlib/MbedTLS_jll`
[14a3606d] MozillaCACerts_jll `@stdlib/MozillaCACerts_jll`
[4536629a] OpenBLAS_jll `@stdlib/OpenBLAS_jll`
[05823500] OpenLibm_jll `@stdlib/OpenLibm_jll`
[83775a58] Zlib_jll `@stdlib/Zlib_jll`
[8e850b90] libblastrampoline_jll `@stdlib/libblastrampoline_jll`
[8e850ede] nghttp2_jll `@stdlib/nghttp2_jll`
[3f19e933] p7zip_jll `@stdlib/p7zip_jll`
Testing Running tests...
┌ Warning: You are running the CUDA.jl test suite with only a single thread; this will take a long time.
│ Consider launching Julia with `--threads auto` to run tests in parallel.
└ @ Main ~/CUDA.jl/test/runtests.jl:62
┌ Info: System information:
│ CUDA runtime 11.8, artifact installation
│ CUDA driver 11.7
│ NVIDIA driver 515.76.0
│
│ Libraries:
│ - CUBLAS: 11.11.3
│ - CURAND: 10.3.0
│ - CUFFT: 10.9.0
│ - CUSOLVER: 11.4.1
│ - CUSPARSE: 11.7.5
│ - CUPTI: 18.0.0
│ - NVML: 11.0.0+515.76
│
│ Toolchain:
│ - Julia: 1.7.3
│ - LLVM: 12.0.1
│ - PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
│ - Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80
│
│ 1 device:
└ 0: NVIDIA GeForce GTX 960 (sm_52, 3.594 GiB / 4.000 GiB available)
[ Info: Testing using 1 device(s): 0. NVIDIA GeForce GTX 960 (UUID bd1a7845-aa1a-2050-764b-4dd1848d0de4)
| | ---------------- GPU ---------------- | ---------------- CPU ---------------- |
Test (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB) | GC (s) | GC % | Alloc (MB) | RSS (MB) |
sorting (2) | 229.77 | 0.04 | 0.0 | 543.84 | 370.50 | 9.19 | 4.0 | 31018.11 | 3974.09 |
Testing finished in 3 minutes, 55 seconds, 87 milliseconds
Test Summary: | Pass Total
Overall | 272 272
SUCCESS
Testing CUDA tests passed
Well, that's annoying. I'm not sure how to proceed here, as the error is so vague (ProcessExisted(2) doesn't make much sense, I wonder if Pkg or Distributed are swallowing errors here). Nothing in dmesg
?
You could also try kicking off a regular, single-threaded run; look for the PID of the test process in top
or so; attach GDB using sudo gdb --pid=$PID
and just have it continue by entering c
and pressing Enter. Hopefully, when that process crashes, you get to see some additional info in the GDB terminal, where you then can also do bt
to get a backtrace to error.
I finally caught what crashes the worker. I am literally running out of 16GB of memory when testing on master.
Is there a way to get a list of all test modules so i can test them one by one?
Yes, by passing --list
to the test suite:
❯ julia --project -e 'using Pkg; Pkg.test(;test_args=`--list`)'
Available tests:
- apiutils
- array
- broadcast
- codegen
- cublas
- cudadrv
- ...
Note that you can also see the CPU memory usage (allocated bytes, RSS) in the output.
I just tested on a GTX 970 with 32GiB of system RAM, and all tests passed.
Okay. I think we can close the issue for now.
I am feeling confident that with the current state of master i will not have a problems. Should i be able to have a reproduceable error with the current release i had initially had problems with i reopen the issue.
Version info
Details on Julia:
STDOUT of test run available here: gpu.log