ERROR_COOPERATIVE_LAUNCH_TOO_LARGE during tests

Describe the bug

When I run ] test CUDA in an environment with CUDA 1.0.2 installed and nothing else, all but 1 tests succeed, but I also get a Got exception outside of a @test error. The full output of testing CUDA is in the collapsed section:

Click to expand

``` Testing CUDA Status `/tmp/jl_HfoXBt/Manifest.toml` [621f4979] AbstractFFTs v0.5.0 [79e6a3ab] Adapt v2.0.2 [b99e7846] BinaryProvider v0.5.10 [fa961155] CEnum v0.4.1 [052768ef] CUDA v1.0.2 [bbf7d656] CommonSubexpressions v0.2.0 [e66e0078] CompilerSupportLibraries_jll v0.3.3+0 [864edb3b] DataStructures v0.17.18 [163ba53b] DiffResults v1.0.2 [b552c78f] DiffRules v1.0.1 [e2ba6199] ExprTools v0.1.1 [7a1cc6ca] FFTW v1.2.2 [f5851436] FFTW_jll v3.3.9+5 [1a297f60] FillArrays v0.8.11 [f6369f11] ForwardDiff v0.10.10 [0c68f7d7] GPUArrays v4.0.0 [61eb1bfa] GPUCompiler v0.4.0 [1d5cc7b8] IntelOpenMP_jll v2018.0.3+0 [929cbde3] LLVM v1.6.0 [856f044c] MKL_jll v2020.1.216+0 [1914dd2f] MacroTools v0.5.5 [a6bfbf70] NNPACK_jll v2018.6.22+0 [872c559c] NNlib v0.7.0 [77ba4419] NaNMath v0.3.3 [efe28fd5] OpenSpecFun_jll v0.5.3+3 [bac558e1] OrderedCollections v1.2.0 [189a3867] Reexport v0.2.0 [ae029012] Requires v1.0.1 [276daf66] SpecialFunctions v0.10.3 [90137ffa] StaticArrays v0.12.3 [a759f4b9] TimerOutputs v0.5.6 [2a0f44e3] Base64 [ade2ca70] Dates [8ba89e20] Distributed [b77e0a4c] InteractiveUtils [76f85450] LibGit2 [8f399da3] Libdl [37e2e46d] LinearAlgebra [56ddb016] Logging [d6f4376e] Markdown [44cfe95a] Pkg [de0858da] Printf [3fa0cd96] REPL [9a3f8284] Random [ea8e919c] SHA [9e88b42a] Serialization [6462fe0b] Sockets [2f01184e] SparseArrays [10745b16] Statistics [8dfed614] Test [cf7118a7] UUIDs [4ec0a83e] Unicode [ Info: Testing using device GeForce GTX 1050 Ti (compute capability 6.1.0, 3.474 GiB available memory) on CUDA driver 10.2.0 and toolkit 10.2.89 [ Info: Skipping the following tests: cutensor, device/wmma Test (Worker) | Time (s) | GPU GC (s) | GPU GC % | GPU Alloc (MB) | CPU GC (s) | CPU GC % | CPU Alloc (MB) | RSS (MB) initialization (2) | 2.50 | 0.00 | 0.0 | 0.00 | 0.04 | 1.4 | 160.93 | 629.09 apiutils (2) | 0.16 | 0.00 | 0.0 | 0.00 | 0.00 | 0.0 | 5.70 | 629.09 array (2) | 47.52 | 0.30 | 0.6 | 5.20 | 1.77 | 3.7 | 6924.38 | 1040.95 broadcast (2) | 18.29 | 0.00 | 0.0 | 0.00 | 0.43 | 2.4 | 1646.32 | 1040.95 codegen (2) | 3.04 | 0.00 | 0.0 | 0.00 | 0.07 | 2.4 | 318.60 | 1102.64 cublas (2) | 45.37 | 0.03 | 0.1 | 11.11 | 1.56 | 3.4 | 6836.11 | 1316.70 cudnn (2) | 50.61 | 0.01 | 0.0 | 0.62 | 1.49 | 2.9 | 5939.98 | 2040.67 cufft (2) | 15.67 | 0.01 | 0.0 | 144.16 | 0.52 | 3.3 | 2047.53 | 2239.42 curand (2) | 3.61 | 0.00 | 0.0 | 0.01 | 0.11 | 2.9 | 441.04 | 2241.21 cusolver (2) | 37.94 | 0.05 | 0.1 | 1128.67 | 1.34 | 3.5 | 5399.58 | 2599.66 cusparse (2) | 38.48 | 0.01 | 0.0 | 10.73 | 0.95 | 2.5 | 4312.71 | 3036.90 examples (2) | 88.15 | 0.00 | 0.0 | 0.00 | 0.19 | 0.2 | 27.27 | 3036.90 execution (2) | failed at 2020-06-23T23:38:24.595 forwarddiff (3) | 58.78 | 0.30 | 0.5 | 0.00 | 1.00 | 1.7 | 4139.50 | 686.37 iterator (3) | 1.93 | 0.00 | 0.0 | 1.25 | 0.05 | 2.8 | 288.63 | 686.37 memory (3) | 1.13 | 0.00 | 0.0 | 0.00 | 0.19 | 16.8 | 106.56 | 686.37 nnlib (3) | 2.15 | 0.14 | 6.3 | 0.00 | 0.03 | 1.6 | 247.19 | 821.21 nvtx (3) | 0.75 | 0.00 | 0.0 | 0.00 | 0.00 | 0.0 | 73.31 | 821.70 pointer (3) | 0.10 | 0.00 | 0.0 | 0.00 | 0.01 | 11.6 | 6.74 | 823.27 statistics (3) | 10.40 | 0.00 | 0.0 | 0.00 | 0.33 | 3.2 | 1553.32 | 862.48 threading (3) | 0.16 | 0.00 | 0.0 | 0.00 | 0.00 | 0.0 | 22.29 | 862.52 utils (3) | 0.14 | 0.00 | 0.0 | 0.00 | 0.00 | 0.0 | 13.12 | 862.54 cuda/context (3) | 0.57 | 0.00 | 0.0 | 0.00 | 0.00 | 0.0 | 32.40 | 969.55 cuda/devices (3) | 0.23 | 0.00 | 0.0 | 0.00 | 0.00 | 0.0 | 32.27 | 969.55 cuda/errors (3) | 0.11 | 0.00 | 0.0 | 0.00 | 0.01 | 9.1 | 21.27 | 969.55 cuda/events (3) | 0.11 | 0.00 | 0.0 | 0.00 | 0.00 | 0.0 | 14.63 | 969.55 cuda/execution (3) | 0.65 | 0.00 | 0.0 | 0.00 | 0.03 | 4.5 | 88.37 | 969.55 cuda/memory (3) | 1.08 | 0.00 | 0.0 | 0.00 | 0.03 | 3.0 | 172.74 | 969.55 cuda/module (3) | 0.23 | 0.00 | 0.0 | 0.00 | 0.01 | 4.7 | 31.99 | 969.55 cuda/occupancy (3) | 0.07 | 0.00 | 0.0 | 0.00 | 0.00 | 0.0 | 8.53 | 969.55 cuda/profile (3) | 0.23 | 0.00 | 0.0 | 0.00 | 0.01 | 4.7 | 51.25 | 969.55 cuda/stream (3) | 0.13 | 0.00 | 0.0 | 0.00 | 0.00 | 0.0 | 20.14 | 969.55 cuda/version (3) | 0.00 | 0.00 | 0.0 | 0.00 | 0.00 | 0.0 | 0.12 | 969.55 cusolver/cusparse (3) | 13.05 | 0.00 | 0.0 | 0.15 | 0.52 | 4.0 | 2152.45 | 1301.50 device/array (3) | 2.15 | 0.00 | 0.0 | 0.00 | 0.06 | 2.6 | 276.38 | 1303.89 device/cuda (3) | 67.80 | 0.00 | 0.0 | 0.01 | 1.15 | 1.7 | 4582.23 | 1468.15 device/pointer (3) | 4.11 | 0.00 | 0.0 | 0.00 | 0.09 | 2.2 | 477.65 | 1472.64 gpuarrays/indexing (3) | 12.05 | 0.00 | 0.0 | 0.12 | 0.39 | 3.2 | 1658.49 | 1523.34 gpuarrays/math (3) | 1.64 | 0.00 | 0.0 | 0.00 | 0.05 | 3.0 | 226.08 | 1527.56 gpuarrays/input output (3) | 0.88 | 0.00 | 0.0 | 0.00 | 0.02 | 1.9 | 114.57 | 1528.15 gpuarrays/value constructors (3) | 5.77 | 0.00 | 0.0 | 0.00 | 0.17 | 2.9 | 689.24 | 1543.17 gpuarrays/interface (3) | 1.63 | 0.00 | 0.0 | 0.00 | 0.03 | 2.1 | 189.29 | 1551.18 gpuarrays/iterator constructors (3) | 8.57 | 0.00 | 0.0 | 0.02 | 0.28 | 3.2 | 1215.82 | 1585.87 gpuarrays/uniformscaling (3) | 4.79 | 0.00 | 0.0 | 0.01 | 0.13 | 2.7 | 555.46 | 1597.46 gpuarrays/linear algebra (3) | 37.82 | 0.01 | 0.0 | 1.42 | 1.06 | 2.8 | 4612.43 | 1754.72 gpuarrays/conversions (3) | 1.96 | 0.00 | 0.0 | 0.01 | 0.07 | 3.4 | 363.61 | 1754.72 gpuarrays/fft (3) | 3.66 | 0.00 | 0.0 | 6.01 | 0.12 | 3.3 | 571.84 | 1869.81 gpuarrays/constructors (3) | 1.04 | 0.00 | 0.1 | 0.04 | 0.00 | 0.0 | 78.06 | 1871.21 gpuarrays/random (3) | 14.69 | 0.00 | 0.0 | 0.00 | 0.50 | 3.4 | 1431.27 | 1880.96 gpuarrays/base (3) | 10.36 | 0.00 | 0.0 | 17.44 | 0.42 | 4.1 | 1763.96 | 1982.53 gpuarrays/mapreduce essentials (3) | 80.34 | 0.01 | 0.0 | 3.19 | 2.32 | 2.9 | 9415.68 | 2194.48 gpuarrays/broadcasting (3) | 50.47 | 0.00 | 0.0 | 1.19 | 1.62 | 3.2 | 6103.84 | 2315.61 gpuarrays/mapreduce derivatives (3) | 142.32 | 0.01 | 0.0 | 3.06 | 3.37 | 2.4 | 11389.85 | 2643.18 gpuarrays/mapreduce (old tests) (3) | 58.47 | 0.01 | 0.0 | 130.18 | 1.40 | 2.4 | 5357.29 | 2895.67 Worker 2 failed running test execution: Some tests did not pass: 79 passed, 0 failed, 1 errored, 0 broken. execution: Error During Test at /home/christophe/.julia/packages/CUDA/42B9G/test/execution.jl:1028 Got exception outside of a @test CUDA error: too many blocks in cooperative launch (code 720, ERROR_COOPERATIVE_LAUNCH_TOO_LARGE) Stacktrace: [1] throw_api_error(::CUDA.cudaError_enum) at /home/christophe/.julia/packages/CUDA/42B9G/lib/cuda/error.jl:103 [2] macro expansion at /home/christophe/.julia/packages/CUDA/42B9G/lib/cuda/error.jl:110 [inlined] [3] cuLaunchCooperativeKernel(::CuFunction, ::UInt32, ::UInt32, ::UInt32, ::UInt32, ::UInt32, ::UInt32, ::Int64, ::CuStream, ::Array{Ptr{Nothing},1}) at /home/christophe/.julia/packages/CUDA/42B9G/lib/utils/call.jl:93 [4] (::CUDA.var"#566#567"{Bool,Int64,CuStream,CuFunction})(::Array{Ptr{Nothing},1}) at /home/christophe/.julia/packages/CUDA/42B9G/lib/cuda/execution.jl:62 [5] macro expansion at /home/christophe/.julia/packages/CUDA/42B9G/lib/cuda/execution.jl:33 [inlined] [6] pack_arguments(::CUDA.var"#566#567"{Bool,Int64,CuStream,CuFunction}, ::CuDeviceArray{Float32,2,CUDA.AS.Global}, ::CuDeviceArray{Float32,2,CUDA.AS.Global}, ::CuDeviceArray{Float32,2,CUDA.AS.Global}) at /home/christophe/.julia/packages/CUDA/42B9G/lib/cuda/execution.jl:10 [7] launch(::CuFunction, ::CuDeviceArray{Float32,2,CUDA.AS.Global}, ::Vararg{CuDeviceArray{Float32,2,CUDA.AS.Global},N} where N; blocks::Int64, threads::Int64, cooperative::Bool, shmem::Int64, stream::CuStream) at /home/christophe/.julia/packages/CUDA/42B9G/lib/cuda/execution.jl:60 [8] #571 at /home/christophe/.julia/packages/CUDA/42B9G/lib/cuda/execution.jl:136 [inlined] [9] macro expansion at /home/christophe/.julia/packages/CUDA/42B9G/lib/cuda/execution.jl:95 [inlined] [10] convert_arguments at /home/christophe/.julia/packages/CUDA/42B9G/lib/cuda/execution.jl:78 [inlined] [11] #cudacall#570 at /home/christophe/.julia/packages/CUDA/42B9G/lib/cuda/execution.jl:135 [inlined] [12] #cudacall#743 at /home/christophe/.julia/packages/CUDA/42B9G/src/compiler/execution.jl:217 [inlined] [13] macro expansion at /home/christophe/.julia/packages/CUDA/42B9G/src/compiler/execution.jl:198 [inlined] [14] call(::CUDA.HostKernel{var"#kernel_vadd#476"(),Tuple{CuDeviceArray{Float32,2,CUDA.AS.Global},CuDeviceArray{Float32,2,CUDA.AS.Global},CuDeviceArray{Float32,2,CUDA.AS.Global}}}, ::CuDeviceArray{Float32,2,CUDA.AS.Global}, ::CuDeviceArray{Float32,2,CUDA.AS.Global}, ::CuDeviceArray{Float32,2,CUDA.AS.Global}; call_kwargs::Base.Iterators.Pairs{Symbol,Integer,Tuple{Symbol,Symbol,Symbol},NamedTuple{(:cooperative, :threads, :blocks),Tuple{Bool,Int64,Int64}}}) at /home/christophe/.julia/packages/CUDA/42B9G/src/compiler/execution.jl:170 [15] (::CUDA.HostKernel{var"#kernel_vadd#476"(),Tuple{CuDeviceArray{Float32,2,CUDA.AS.Global},CuDeviceArray{Float32,2,CUDA.AS.Global},CuDeviceArray{Float32,2,CUDA.AS.Global}}})(::CuDeviceArray{Float32,2,CUDA.AS.Global}, ::Vararg{CuDeviceArray{Float32,2,CUDA.AS.Global},N} where N; kwargs::Base.Iterators.Pairs{Symbol,Integer,Tuple{Symbol,Symbol,Symbol},NamedTuple{(:cooperative, :threads, :blocks),Tuple{Bool,Int64,Int64}}}) at /home/christophe/.julia/packages/CUDA/42B9G/src/compiler/execution.jl:345 [16] top-level scope at /home/christophe/.julia/packages/CUDA/42B9G/src/compiler/execution.jl:109 [17] top-level scope at /home/christophe/.julia/packages/CUDA/42B9G/test/execution.jl:1044 [18] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Test/src/Test.jl:1113 [19] top-level scope at /home/christophe/.julia/packages/CUDA/42B9G/test/execution.jl:1029 [20] include(::String) at ./client.jl:439 [21] #11 at /home/christophe/.julia/packages/CUDA/42B9G/test/runtests.jl:73 [inlined] [22] macro expansion at /home/christophe/.julia/packages/CUDA/42B9G/test/setup.jl:43 [inlined] [23] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Test/src/Test.jl:1113 [inlined] [24] macro expansion at /home/christophe/.julia/packages/CUDA/42B9G/test/setup.jl:43 [inlined] [25] macro expansion at /home/christophe/.julia/packages/CUDA/42B9G/src/utilities.jl:14 [inlined] [26] macro expansion at /home/christophe/.julia/packages/CUDA/42B9G/src/memory.jl:415 [inlined] [27] top-level scope at /home/christophe/.julia/packages/CUDA/42B9G/test/setup.jl:42 [28] eval at ./boot.jl:331 [inlined] [29] runtests(::Function, ::String, ::CuDevice, ::Nothing) at /home/christophe/.julia/packages/CUDA/42B9G/test/setup.jl:53 [30] (::Distributed.var"#104#106"{Distributed.CallMsg{:call_fetch}})() at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Distributed/src/process_messages.jl:294 [31] run_work_thunk(::Distributed.var"#104#106"{Distributed.CallMsg{:call_fetch}}, ::Bool) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Distributed/src/process_messages.jl:79 [32] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Distributed/src/process_messages.jl:294 [inlined] [33] (::Distributed.var"#103#105"{Distributed.CallMsg{:call_fetch},Distributed.MsgHeader,Sockets.TCPSocket})() at ./task.jl:358 Test Summary: | Pass Error Broken Total Overall | 8166 1 1 8168 initialization | 11 11 apiutils | 15 15 array | 151 151 broadcast | 29 29 codegen | 18 18 cublas | 1884 1884 cudnn | 140 140 cufft | 150 150 curand | 101 101 cusolver | 1493 1493 cusparse | 369 369 examples | 7 7 execution | 79 1 80 forwarddiff | 107 107 iterator | 30 30 memory | 10 10 nnlib | 4 4 nvtx | No tests pointer | 13 13 statistics | 12 12 threading | No tests utils | 5 5 cuda/context | 12 12 cuda/devices | 3 3 cuda/errors | 6 6 cuda/events | 6 6 cuda/execution | 15 15 cuda/memory | 48 1 49 cuda/module | 12 12 cuda/occupancy | 1 1 cuda/profile | 2 2 cuda/stream | 7 7 cuda/version | 3 3 cusolver/cusparse | 92 92 device/array | 20 20 device/cuda | 265 265 device/pointer | 57 57 gpuarrays/indexing | 113 113 gpuarrays/math | 8 8 gpuarrays/input output | 5 5 gpuarrays/value constructors | 102 102 gpuarrays/interface | 7 7 gpuarrays/iterator constructors | 24 24 gpuarrays/uniformscaling | 56 56 gpuarrays/linear algebra | 393 393 gpuarrays/conversions | 72 72 gpuarrays/fft | 12 12 gpuarrays/constructors | 335 335 gpuarrays/random | 40 40 gpuarrays/base | 38 38 gpuarrays/mapreduce essentials | 522 522 gpuarrays/broadcasting | 155 155 gpuarrays/mapreduce derivatives | 810 810 gpuarrays/mapreduce (old tests) | 297 297 FAILURE Error in testset execution: Error During Test at /home/christophe/.julia/packages/CUDA/42B9G/test/execution.jl:1028 Got exception outside of a @test CUDA error: too many blocks in cooperative launch (code 720, ERROR_COOPERATIVE_LAUNCH_TOO_LARGE) Stacktrace: [1] throw_api_error(::CUDA.cudaError_enum) at /home/christophe/.julia/packages/CUDA/42B9G/lib/cuda/error.jl:103 [2] macro expansion at /home/christophe/.julia/packages/CUDA/42B9G/lib/cuda/error.jl:110 [inlined] [3] cuLaunchCooperativeKernel(::CuFunction, ::UInt32, ::UInt32, ::UInt32, ::UInt32, ::UInt32, ::UInt32, ::Int64, ::CuStream, ::Array{Ptr{Nothing},1}) at /home/christophe/.julia/packages/CUDA/42B9G/lib/utils/call.jl:93 [4] (::CUDA.var"#566#567"{Bool,Int64,CuStream,CuFunction})(::Array{Ptr{Nothing},1}) at /home/christophe/.julia/packages/CUDA/42B9G/lib/cuda/execution.jl:62 [5] macro expansion at /home/christophe/.julia/packages/CUDA/42B9G/lib/cuda/execution.jl:33 [inlined] [6] pack_arguments(::CUDA.var"#566#567"{Bool,Int64,CuStream,CuFunction}, ::CuDeviceArray{Float32,2,CUDA.AS.Global}, ::CuDeviceArray{Float32,2,CUDA.AS.Global}, ::CuDeviceArray{Float32,2,CUDA.AS.Global}) at /home/christophe/.julia/packages/CUDA/42B9G/lib/cuda/execution.jl:10 [7] launch(::CuFunction, ::CuDeviceArray{Float32,2,CUDA.AS.Global}, ::Vararg{CuDeviceArray{Float32,2,CUDA.AS.Global},N} where N; blocks::Int64, threads::Int64, cooperative::Bool, shmem::Int64, stream::CuStream) at /home/christophe/.julia/packages/CUDA/42B9G/lib/cuda/execution.jl:60 [8] #571 at /home/christophe/.julia/packages/CUDA/42B9G/lib/cuda/execution.jl:136 [inlined] [9] macro expansion at /home/christophe/.julia/packages/CUDA/42B9G/lib/cuda/execution.jl:95 [inlined] [10] convert_arguments at /home/christophe/.julia/packages/CUDA/42B9G/lib/cuda/execution.jl:78 [inlined] [11] #cudacall#570 at /home/christophe/.julia/packages/CUDA/42B9G/lib/cuda/execution.jl:135 [inlined] [12] #cudacall#743 at /home/christophe/.julia/packages/CUDA/42B9G/src/compiler/execution.jl:217 [inlined] [13] macro expansion at /home/christophe/.julia/packages/CUDA/42B9G/src/compiler/execution.jl:198 [inlined] [14] call(::CUDA.HostKernel{var"#kernel_vadd#476"(),Tuple{CuDeviceArray{Float32,2,CUDA.AS.Global},CuDeviceArray{Float32,2,CUDA.AS.Global},CuDeviceArray{Float32,2,CUDA.AS.Global}}}, ::CuDeviceArray{Float32,2,CUDA.AS.Global}, ::CuDeviceArray{Float32,2,CUDA.AS.Global}, ::CuDeviceArray{Float32,2,CUDA.AS.Global}; call_kwargs::Base.Iterators.Pairs{Symbol,Integer,Tuple{Symbol,Symbol,Symbol},NamedTuple{(:cooperative, :threads, :blocks),Tuple{Bool,Int64,Int64}}}) at /home/christophe/.julia/packages/CUDA/42B9G/src/compiler/execution.jl:170 [15] (::CUDA.HostKernel{var"#kernel_vadd#476"(),Tuple{CuDeviceArray{Float32,2,CUDA.AS.Global},CuDeviceArray{Float32,2,CUDA.AS.Global},CuDeviceArray{Float32,2,CUDA.AS.Global}}})(::CuDeviceArray{Float32,2,CUDA.AS.Global}, ::Vararg{CuDeviceArray{Float32,2,CUDA.AS.Global},N} where N; kwargs::Base.Iterators.Pairs{Symbol,Integer,Tuple{Symbol,Symbol,Symbol},NamedTuple{(:cooperative, :threads, :blocks),Tuple{Bool,Int64,Int64}}}) at /home/christophe/.julia/packages/CUDA/42B9G/src/compiler/execution.jl:345 [16] top-level scope at /home/christophe/.julia/packages/CUDA/42B9G/src/compiler/execution.jl:109 [17] top-level scope at /home/christophe/.julia/packages/CUDA/42B9G/test/execution.jl:1044 [18] top-level scope at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Test/src/Test.jl:1113 [19] top-level scope at /home/christophe/.julia/packages/CUDA/42B9G/test/execution.jl:1029 [20] include(::String) at ./client.jl:439 [21] #11 at /home/christophe/.julia/packages/CUDA/42B9G/test/runtests.jl:73 [inlined] [22] macro expansion at /home/christophe/.julia/packages/CUDA/42B9G/test/setup.jl:43 [inlined] [23] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Test/src/Test.jl:1113 [inlined] [24] macro expansion at /home/christophe/.julia/packages/CUDA/42B9G/test/setup.jl:43 [inlined] [25] macro expansion at /home/christophe/.julia/packages/CUDA/42B9G/src/utilities.jl:14 [inlined] [26] macro expansion at /home/christophe/.julia/packages/CUDA/42B9G/src/memory.jl:415 [inlined] [27] top-level scope at /home/christophe/.julia/packages/CUDA/42B9G/test/setup.jl:42 [28] eval at ./boot.jl:331 [inlined] [29] runtests(::Function, ::String, ::CuDevice, ::Nothing) at /home/christophe/.julia/packages/CUDA/42B9G/test/setup.jl:53 [30] (::Distributed.var"#104#106"{Distributed.CallMsg{:call_fetch}})() at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Distributed/src/process_messages.jl:294 [31] run_work_thunk(::Distributed.var"#104#106"{Distributed.CallMsg{:call_fetch}}, ::Bool) at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Distributed/src/process_messages.jl:79 [32] macro expansion at /buildworker/worker/package_linux64/build/usr/share/julia/stdlib/v1.4/Distributed/src/process_messages.jl:294 [inlined] [33] (::Distributed.var"#103#105"{Distributed.CallMsg{:call_fetch},Distributed.MsgHeader,Sockets.TCPSocket})() at ./task.jl:358 ERROR: LoadError: Test run finished with errors in expression starting at /home/christophe/.julia/packages/CUDA/42B9G/test/runtests.jl:434 ERROR: Package CUDA errored during testing ```

To reproduce

To reproduce, I simply run ] test CUDA in the REPL, after activating the environment containing CUDA. Also, I had to comment out using OhMyREPL in my startup.jl in order not to run into #246.

Expected behavior

It seems the CUDA installation was successful, the CUDA driver and toolkit were installed correctly, and CUDA.functional() returns true. So I would expect all the tests to be successful and I wouldn't expect an exception to be thrown outside of a test.

Version info

Details on Julia:

versioninfo()
Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-8700 CPU @ 3.20GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, skylake)

Details on CUDA:

CUDA.versioninfo()

There is no CUDA.versioninfo() for me to run (I get ERROR: UndefVarError: versioninfo not defined). Installed CUDA version is 1.0.2 as given by ] st. Note that CUDA.version() returns v"10.2.0". I take it that's a typo.

Additional context

This is on a computer running Ubuntu 20.04, with CUDA installed in a pristine environment.

JuliaGPU / CUDA.jl

ERROR_COOPERATIVE_LAUNCH_TOO_LARGE during tests #247