JuliaGPU / CUDA.jl

CUDA programming in Julia.
https://juliagpu.org/cuda/
Other
1.21k stars 219 forks source link

Many errors running test suite on GTX 960 4GB #1650

Closed freemin7 closed 1 year ago

freemin7 commented 1 year ago
                                                  |          | ---------------- GPU ---------------- | ---------------- CPU ---------------- |
Test                                     (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB) | GC (s) | GC % | Alloc (MB) | RSS (MB) |
initialization                                (2) |    10.11 |   0.00 |  0.0 |       0.00 |    37.50 |   0.13 |  1.3 |     491.47 |   851.27 |
gpuarrays/indexing scalar                     (3) |    50.90 |   0.04 |  0.1 |       0.01 |    42.25 |   2.09 |  4.1 |    4721.27 |   853.88 |
gpuarrays/math/power                          (2) |   118.16 |   0.00 |  0.0 |       0.01 |   143.25 |   7.36 |  6.2 |   12934.24 |  1058.42 |
gpuarrays/linalg/mul!/vector-matrix           (3) |   134.19 |   0.02 |  0.0 |       0.02 |   144.25 |   6.90 |  5.1 |   14463.31 |  1183.36 |
gpuarrays/interface                           (3) |    11.10 |   0.00 |  0.0 |       0.00 |    42.25 |   0.61 |  5.5 |     926.76 |  1183.36 |
gpuarrays/indexing multidimensional           (2) |    83.89 |   0.00 |  0.0 |       1.21 |    44.25 |   3.69 |  4.4 |    8060.88 |  1058.52 |
gpuarrays/linalg                              (5) |         failed at 2022-10-25T02:53:06.724
gpuarrays/reductions/reducedim!               (4) |         failed at 2022-10-25T02:53:27.328
gpuarrays/reductions/any all count            (3) |         failed at 2022-10-25T02:53:34.996
gpuarrays/uniformscaling                      (6) |    51.86 |   0.06 |  0.1 |       0.01 |    42.25 |   2.10 |  4.0 |    3060.76 |   857.83 |
gpuarrays/math/intrinsics                     (8) |    39.83 |   0.03 |  0.1 |       0.00 |    42.25 |   1.24 |  3.1 |    2343.86 |   857.90 |
gpuarrays/statistics                          (8) |         failed at 2022-10-25T02:59:20.498
gpuarrays/linalg/mul!/matrix-matrix           (7) |   460.49 |   0.10 |  0.0 |       0.12 |   145.25 |  18.95 |  4.1 |   26165.38 |  1342.38 |
gpuarrays/reductions/minimum maximum extrema  (2) |         failed at 2022-10-25T03:01:32.813
gpuarrays/constructors                        (7) |   137.33 |   0.06 |  0.0 |       0.08 |    42.25 |   5.55 |  4.0 |    7158.22 |  1394.94 |
gpuarrays/random                             (10) |   132.28 |   0.06 |  0.0 |       0.03 |    42.25 |   5.70 |  4.3 |    6900.06 |   858.31 |
gpuarrays/base                                (7) |   137.98 |   0.00 |  0.0 |       8.90 |    42.25 |   8.84 |  6.4 |   10045.55 |  1468.73 |
gpuarrays/linalg/norm                         (6) |         failed at 2022-10-25T03:07:07.603
gpuarrays/reductions/== isequal              (10) |         failed at 2022-10-25T03:09:09.918
gpuarrays/reductions/mapreduce                (9) |         failed at 2022-10-25T03:11:28.957
gpuarrays/reductions/mapreducedim!           (11) |         failed at 2022-10-25T03:17:23.615
apiutils                                     (14) |     0.53 |   0.00 |  0.0 |       0.00 |    37.50 |   0.00 |  0.0 |       1.87 |   858.24 |
gpuarrays/reductions/reduce                  (12) |         failed at 2022-10-25T03:18:45.505
broadcast                                    (15) |   112.74 |   0.06 |  0.1 |       0.00 |    42.25 |   5.93 |  5.3 |    5892.05 |   858.19 |
codegen                                      (15) |         failed at 2022-10-25T03:21:37.278
gpuarrays/broadcasting                        (7) |  1031.21 |   0.11 |  0.0 |       2.00 |    44.25 |  52.45 |  5.1 |   56720.33 |  2252.94 |
cudadrv                                       (7) |    29.95 |   0.00 |  0.0 |       0.00 |    44.25 |   1.24 |  4.1 |    1400.78 |  2335.63 |
array                                        (14) |         failed at 2022-10-25T03:24:48.050
curand                                       (17) |     1.58 |   0.00 |  0.0 |       0.00 |    43.50 |   0.02 |  1.4 |      42.12 |   858.71 |
cufft                                         (7) |    93.64 |   0.06 |  0.1 |     233.38 |   190.62 |   4.18 |  4.5 |    4256.35 |  2613.39 |
cublas                                       (16) |   416.87 |   0.40 |  0.1 |      14.39 |   160.25 |  21.61 |  5.2 |   23505.78 |  1235.55 |
examples                                      (7) |         failed at 2022-10-25T03:29:40.830
cusparse                                     (17) |   259.18 |   0.25 |  0.1 |      10.53 |    84.25 |   9.69 |  3.7 |   11291.05 |   858.71 |
exceptions                                   (18) |   216.72 |   0.00 |  0.0 |       0.00 |    37.50 |   0.03 |  0.0 |      63.54 |   859.18 |
nvml                                         (18) |     1.17 |   0.00 |  0.0 |       0.00 |    37.50 |   0.04 |  3.1 |      29.50 |   859.18 |
iterator                                     (19) |    10.98 |   0.06 |  0.5 |       1.93 |    38.50 |   0.40 |  3.6 |     683.21 |   859.59 |
nvtx                                         (18) |     1.50 |   0.00 |  0.0 |       0.00 |    37.50 |   0.10 |  6.6 |     148.05 |   859.18 |
pointer                                      (19) |     1.05 |   0.00 |  0.0 |       0.00 |    38.50 |   0.00 |  0.0 |      14.22 |   859.59 |
pool                                         (18) |     5.75 |   0.00 |  0.0 |       0.00 |    37.50 |   0.60 | 10.4 |     275.75 |   859.18 |
linalg                                       (17) |         failed at 2022-10-25T03:31:47.043
execution                                    (16) |         failed at 2022-10-25T03:32:48.519
random                                       (19) |         failed at 2022-10-25T03:32:51.551
utils                                        (22) |     3.64 |   0.00 |  0.0 |       0.00 |    37.50 |   0.06 |  1.7 |     124.85 |   859.75 |

Version info

Details on Julia:

julia> versioninfo()
Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E3-1230 v3 @ 3.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, haswell)

julia> CUDA.versioninfo()
CUDA toolkit 11.7, artifact installation
NVIDIA driver 515.76.0, for CUDA 11.7
CUDA driver 11.7

Libraries: 

- CUBLAS: 11.10.1
- CURAND: 10.2.10
- CUFFT: 10.7.2
- CUSOLVER: 11.3.5
- CUSPARSE: 11.7.3
- CUPTI: 17.0.0
- NVML: 11.0.0+515.76
- CUDNN: 8.30.2 (for CUDA 11.5.0)
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)

Toolchain:
- Julia: 1.7.3
- LLVM: 12.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

1 device:
  0: NVIDIA GeForce GTX 960 (sm_52, 3.562 GiB / 4.000 GiB available

STDOUT of test run available here: gpu.log

maleadt commented 1 year ago

The massive failure log isn't too relevant: CUDA just gets in a broken state pretty early on, which causes every subsequent operation to fail. Try running single-threadedly, or under compute-sanitizer, to narrow the issue down.

freemin7 commented 1 year ago

Doing a single threaded run. No idea what a compute sanitizer is.

maleadt commented 1 year ago

See the CUDA.jl and NVIDIA documentation. You can run the tests under compute sanitizer by passing --sanitize to the test suite.

freemin7 commented 1 year ago

I am running /home/joto/.julia/artifacts/913584335ab836f9781a0325178d0949c193f50b/bin/compute-sanitizer --tool memcheck --launch-timeout=0 --target-processes=all --report-api-errors=no julia -e "using Pkg; using CUDA; Pkg.test(\"CUDA\")" >> gpu_log_sanatize.log if you expect me to run something else tell me what you want me to do. The instructions were unclear if you haven't heard "compute-sanitize" before.

maleadt commented 1 year ago

There's a dedicated section on the use of compute-sanitizer in the CUDA.jl docs: https://cuda.juliagpu.org/stable/development/debugging/#compute-sanitizer

maleadt commented 1 year ago

You can run the tests under compute sanitizer by passing --sanitize to the test suite.

help?> Pkg.test
  Pkg.test(; kwargs...)

  Keyword arguments:

    •  julia_args::Union{Cmd, Vector{String}}: options to be passed the test process.

So you run Pkg.test(; julia_args=`--sanitize`)

EDIT: sorry, meant to show test_args, not julia_args.

maleadt commented 1 year ago

Also better pass --quickfail to the test process so that it doesn't attempt to continue, since CUDA is broken anyway.

freemin7 commented 1 year ago

Yes and i read that section. I also read the buildkite recipe. That section left me unsure whether i am doing the right thing. Telling RTFM to people who tried to RFTM and communicated their attempt in order to verify their assumption is pretty discouraging to people.

Thank you for your actionable answers. Although i am not sure whether test process refers to the Pkg test call or the process i run using Pkg in.

freemin7 commented 1 year ago

julia -e "using Pkg; using CUDA; Pkg.test( \"CUDA\" ; julia_args=[\"--sanitize\", \"--quickfail\"] );" >> sanitize_GPU.log Fails with:

     Testing Running tests...
ERROR: unknown option `--sanitize`
ERROR:  Package CUDA errored during testing
Stacktrace:
 [1] pkgerror(msg::String)
   @ Pkg.Types ~/.julia/juliaup/julia-1.7.3+0.x64/share/julia/stdlib/v1.7/Pkg/src/Types.jl:68
 [2] test(ctx::Pkg.Types.Context, pkgs::Vector{Pkg.Types.PackageSpec}; coverage::Bool, julia_args::Cmd, test_args::Cmd, test_fn::Nothing, force_latest_compatible_version::Bool, allow_earlier_backwards_compatible_versions::Bool, allow_reresolve::Bool)
   @ Pkg.Operations ~/.julia/juliaup/julia-1.7.3+0.x64/share/julia/stdlib/v1.7/Pkg/src/Operations.jl:1672
 [3] test(ctx::Pkg.Types.Context, pkgs::Vector{Pkg.Types.PackageSpec}; coverage::Bool, test_fn::Nothing, julia_args::Vector{String}, test_args::Cmd, force_latest_compatible_version::Bool, allow_earlier_backwards_compatible_versions::Bool, allow_reresolve::Bool, kwargs::Base.Pairs{Symbol, Base.TTY, Tuple{Symbol}, NamedTuple{(:io,), Tuple{Base.TTY}}})
   @ Pkg.API ~/.julia/juliaup/julia-1.7.3+0.x64/share/julia/stdlib/v1.7/Pkg/src/API.jl:421
 [4] test(pkgs::Vector{Pkg.Types.PackageSpec}; io::Base.TTY, kwargs::Base.Pairs{Symbol, Vector{String}, Tuple{Symbol}, NamedTuple{(:julia_args,), Tuple{Vector{String}}}})
   @ Pkg.API ~/.julia/juliaup/julia-1.7.3+0.x64/share/julia/stdlib/v1.7/Pkg/src/API.jl:149
 [5] #test#87
   @ ~/.julia/juliaup/julia-1.7.3+0.x64/share/julia/stdlib/v1.7/Pkg/src/API.jl:142 [inlined]
 [6] #test#86
   @ ~/.julia/juliaup/julia-1.7.3+0.x64/share/julia/stdlib/v1.7/Pkg/src/API.jl:141 [inlined]
 [7] top-level scope
   @ none:1

julia -e "using Pkg; using CUDA; Pkg.test( julia_args=[\"--sanitize\", \"--quickfail\"] );" >> sanitize_GPU.log doesn't work either. Neither did julia --quickfail -e "using Pkg; using CUDA; Pkg.test( \"CUDA\" ; julia_args=[\"--sanitize\"] );" >> sanitize_GPU.log.

Like i am trying to not be complicated but i never ran tests beyond ]test or Pkg.test("Name") so appreciate your patience and frustration.

maleadt commented 1 year ago

Me pointing out the section in the documentation wasn't in bad faith, but instead sincerely trying to be helpful. Just ask for details if something isn't clear, I'm always happy to help as you surely know already; and consider my POV too where spending time explaining things you know already is just as well time wasted. Thanks.


Those options, --quickfail and --sanitize etc, are arguments to the test suite, i.e., to runtests.jl: https://github.com/JuliaGPU/CUDA.jl/blob/4f9a80696d12ff50622a1ed060e7577baa1f5c82/test/runtests.jl#L38-L54 You can pass those by setting the test_args argument to Pkg.test, as the snippet above shows. For example, try julia --project -e 'using Pkg; Pkg.test(; test_args=--help)' in a CUDA.jl check-out (or without --project but passing CUDA to Pkg.test if you don't have the package devved).

maleadt commented 1 year ago

I have a GTX 970 in some PC I could try and test with, to see if I can reproduce anything. Won't be right away though, I'm pretty busy at the moment.

freemin7 commented 1 year ago

A single threaded run on master using compute sanitizer doesn't seem to run in so many errors (none so far). I will next run a multi threaded run on master. After that i try out the new(?) 4.0 release the release i had problems with.

maleadt commented 1 year ago

How many threads are you running by default? Does a regular (i.e. not using --sanitize) single-threaded run produce errors? It's possible that an OOM somehow breaks one of CUDA's libraries; that's something we've seen in the past... (but typically resulting in INTERNAL_ERROR exceptions, not a sticky launch failure)

freemin7 commented 1 year ago

In my original run i used the auto option on a 4 core machine. Running those experiments takes time. I will run the suggested single threaded test run on master next once the sanitized single threaded run completes.

freemin7 commented 1 year ago

sanitized, quickfail, one thread: sanitized_fail_gpu.log quickfail, one thread no_sanitized_gpu.log

maleadt commented 1 year ago

Hmm, the failure under compute-sanitizer isn't very helpful. Could you verify if the sorting test fails in isolation? You can pass the test name to the test process, so ;test_args = `sorting`

freemin7 commented 1 year ago
joto@PRIMERGY-TX140-S2:~/CUDA.jl$ julia --project -e 'using Pkg; Pkg.test(;test_args=["sorting"])'
The latest version of Julia in the `release` channel is 1.8.2+0.x64. You currently have `1.7.3+0.x64` installed. Run:

  juliaup update

to install Julia 1.8.2+0.x64 and update the `release` channel to that version.
┌ Warning: The active manifest file is an older format with no julia version entry. Dependencies may have been resolved with a different julia version.
└ @ ~/CUDA.jl/Manifest.toml:0
     Testing CUDA
      Status `/tmp/jl_t96jg3/Project.toml`
  [79e6a3ab] Adapt v3.4.0
  [ab4f0b2a] BFloat16s v0.4.2
  [052768ef] CUDA v4.0.0 `~/CUDA.jl`
  [864edb3b] DataStructures v0.18.13
  [7a1cc6ca] FFTW v1.5.0
  [0c68f7d7] GPUArrays v8.5.0
  [a98d9a8b] Interpolations v0.14.6
  [872c559c] NNlib v0.8.10
  [276daf66] SpecialFunctions v2.1.7
  [a759f4b9] TimerOutputs v0.5.21
  [76a88914] CUDA_Runtime_jll v0.2.3+1
  [ade2ca70] Dates `@stdlib/Dates`
  [8ba89e20] Distributed `@stdlib/Distributed`
  [37e2e46d] LinearAlgebra `@stdlib/LinearAlgebra`
  [de0858da] Printf `@stdlib/Printf`
  [3fa0cd96] REPL `@stdlib/REPL`
  [9a3f8284] Random `@stdlib/Random`
  [2f01184e] SparseArrays `@stdlib/SparseArrays`
  [10745b16] Statistics `@stdlib/Statistics`
  [8dfed614] Test `@stdlib/Test`
      Status `/tmp/jl_t96jg3/Manifest.toml`
  [621f4979] AbstractFFTs v1.2.1
  [79e6a3ab] Adapt v3.4.0
  [13072b0f] AxisAlgorithms v1.0.1
  [ab4f0b2a] BFloat16s v0.4.2
  [fa961155] CEnum v0.4.2
  [052768ef] CUDA v4.0.0 `~/CUDA.jl`
  [1af6417a] CUDA_Runtime_Discovery v0.1.0
  [d360d2e6] ChainRulesCore v1.15.6
  [9e997f8a] ChangesOfVariables v0.1.4
  [34da2185] Compat v4.3.0
  [864edb3b] DataStructures v0.18.13
  [ffbed154] DocStringExtensions v0.9.2
  [e2ba6199] ExprTools v0.1.8
  [7a1cc6ca] FFTW v1.5.0
  [0c68f7d7] GPUArrays v8.5.0
  [46192b85] GPUArraysCore v0.1.2
  [61eb1bfa] GPUCompiler v0.16.4
  [a98d9a8b] Interpolations v0.14.6
  [3587e190] InverseFunctions v0.1.8
  [92d709cd] IrrationalConstants v0.1.1
  [692b3bcd] JLLWrappers v1.4.1
  [929cbde3] LLVM v4.14.0
  [2ab3a3ac] LogExpFunctions v0.3.18
  [872c559c] NNlib v0.8.10
  [6fe1bfb0] OffsetArrays v1.12.8
  [bac558e1] OrderedCollections v1.4.1
  [21216c6a] Preferences v1.3.0
  [74087812] Random123 v1.6.0
  [e6cf234a] RandomNumbers v1.5.3
  [c84ed2f1] Ratios v0.4.3
  [189a3867] Reexport v1.2.2
  [ae029012] Requires v1.3.0
  [276daf66] SpecialFunctions v2.1.7
  [90137ffa] StaticArrays v1.5.9
  [1e83bf80] StaticArraysCore v1.4.0
  [a759f4b9] TimerOutputs v0.5.21
  [efce3f68] WoodburyMatrices v0.5.5
  [4ee394cb] CUDA_Driver_jll v0.2.0+0
  [76a88914] CUDA_Runtime_jll v0.2.3+1
  [f5851436] FFTW_jll v3.3.10+0
  [1d5cc7b8] IntelOpenMP_jll v2018.0.3+2
  [dad2f222] LLVMExtra_jll v0.0.16+0
  [856f044c] MKL_jll v2022.2.0+0
  [efe28fd5] OpenSpecFun_jll v0.5.5+0
  [0dad84c5] ArgTools `@stdlib/ArgTools`
  [56f22d72] Artifacts `@stdlib/Artifacts`
  [2a0f44e3] Base64 `@stdlib/Base64`
  [ade2ca70] Dates `@stdlib/Dates`
  [8ba89e20] Distributed `@stdlib/Distributed`
  [f43a241f] Downloads `@stdlib/Downloads`
  [7b1f6079] FileWatching `@stdlib/FileWatching`
  [b77e0a4c] InteractiveUtils `@stdlib/InteractiveUtils`
  [4af54fe1] LazyArtifacts `@stdlib/LazyArtifacts`
  [b27032c2] LibCURL `@stdlib/LibCURL`
  [76f85450] LibGit2 `@stdlib/LibGit2`
  [8f399da3] Libdl `@stdlib/Libdl`
  [37e2e46d] LinearAlgebra `@stdlib/LinearAlgebra`
  [56ddb016] Logging `@stdlib/Logging`
  [d6f4376e] Markdown `@stdlib/Markdown`
  [a63ad114] Mmap `@stdlib/Mmap`
  [ca575930] NetworkOptions `@stdlib/NetworkOptions`
  [44cfe95a] Pkg `@stdlib/Pkg`
  [de0858da] Printf `@stdlib/Printf`
  [3fa0cd96] REPL `@stdlib/REPL`
  [9a3f8284] Random `@stdlib/Random`
  [ea8e919c] SHA `@stdlib/SHA`
  [9e88b42a] Serialization `@stdlib/Serialization`
  [1a1011a3] SharedArrays `@stdlib/SharedArrays`
  [6462fe0b] Sockets `@stdlib/Sockets`
  [2f01184e] SparseArrays `@stdlib/SparseArrays`
  [10745b16] Statistics `@stdlib/Statistics`
  [fa267f1f] TOML `@stdlib/TOML`
  [a4e569a6] Tar `@stdlib/Tar`
  [8dfed614] Test `@stdlib/Test`
  [cf7118a7] UUIDs `@stdlib/UUIDs`
  [4ec0a83e] Unicode `@stdlib/Unicode`
  [e66e0078] CompilerSupportLibraries_jll `@stdlib/CompilerSupportLibraries_jll`
  [deac9b47] LibCURL_jll `@stdlib/LibCURL_jll`
  [29816b5a] LibSSH2_jll `@stdlib/LibSSH2_jll`
  [c8ffd9c3] MbedTLS_jll `@stdlib/MbedTLS_jll`
  [14a3606d] MozillaCACerts_jll `@stdlib/MozillaCACerts_jll`
  [4536629a] OpenBLAS_jll `@stdlib/OpenBLAS_jll`
  [05823500] OpenLibm_jll `@stdlib/OpenLibm_jll`
  [83775a58] Zlib_jll `@stdlib/Zlib_jll`
  [8e850b90] libblastrampoline_jll `@stdlib/libblastrampoline_jll`
  [8e850ede] nghttp2_jll `@stdlib/nghttp2_jll`
  [3f19e933] p7zip_jll `@stdlib/p7zip_jll`
     Testing Running tests...
┌ Warning: You are running the CUDA.jl test suite with only a single thread; this will take a long time.
│ Consider launching Julia with `--threads auto` to run tests in parallel.
└ @ Main ~/CUDA.jl/test/runtests.jl:62
┌ Info: System information:
│ CUDA runtime 11.8, artifact installation
│ CUDA driver 11.7
│ NVIDIA driver 515.76.0
│ 
│ Libraries: 
│ - CUBLAS: 11.11.3
│ - CURAND: 10.3.0
│ - CUFFT: 10.9.0
│ - CUSOLVER: 11.4.1
│ - CUSPARSE: 11.7.5
│ - CUPTI: 18.0.0
│ - NVML: 11.0.0+515.76
│ 
│ Toolchain:
│ - Julia: 1.7.3
│ - LLVM: 12.0.1
│ - PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
│ - Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80
│ 
│ 1 device:
└   0: NVIDIA GeForce GTX 960 (sm_52, 3.594 GiB / 4.000 GiB available)
[ Info: Testing using 1 device(s): 0. NVIDIA GeForce GTX 960 (UUID bd1a7845-aa1a-2050-764b-4dd1848d0de4)
               |          | ---------------- GPU ---------------- | ---------------- CPU ---------------- |
Test  (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB) | GC (s) | GC % | Alloc (MB) | RSS (MB) |
sorting    (2) |   229.77 |   0.04 |  0.0 |     543.84 |   370.50 |   9.19 |  4.0 |   31018.11 |  3974.09 |
Testing finished in 3 minutes, 55 seconds, 87 milliseconds

Test Summary: | Pass  Total
  Overall     |  272    272
    SUCCESS
     Testing CUDA tests passed 
maleadt commented 1 year ago

Well, that's annoying. I'm not sure how to proceed here, as the error is so vague (ProcessExisted(2) doesn't make much sense, I wonder if Pkg or Distributed are swallowing errors here). Nothing in dmesg?

You could also try kicking off a regular, single-threaded run; look for the PID of the test process in top or so; attach GDB using sudo gdb --pid=$PID and just have it continue by entering c and pressing Enter. Hopefully, when that process crashes, you get to see some additional info in the GDB terminal, where you then can also do bt to get a backtrace to error.

freemin7 commented 1 year ago

I finally caught what crashes the worker. I am literally running out of 16GB of memory when testing on master.

Screenshot from 2022-10-26 21-04-46

Is there a way to get a list of all test modules so i can test them one by one?

maleadt commented 1 year ago

Yes, by passing --list to the test suite:

❯ julia --project -e 'using Pkg; Pkg.test(;test_args=`--list`)'
Available tests:
 - apiutils
 - array
 - broadcast
 - codegen
 - cublas
 - cudadrv
 - ...

Note that you can also see the CPU memory usage (allocated bytes, RSS) in the output.

maleadt commented 1 year ago

I just tested on a GTX 970 with 32GiB of system RAM, and all tests passed.

freemin7 commented 1 year ago

Okay. I think we can close the issue for now.

I am feeling confident that with the current state of master i will not have a problems. Should i be able to have a reproduceable error with the current release i had initially had problems with i reopen the issue.