JuliaGPU / CUDA.jl

CUDA programming in Julia.
https://juliagpu.org/cuda/
Other
1.21k stars 221 forks source link

CUBLAS and exceptions test failures on Windows #536

Closed gzhang closed 3 years ago

gzhang commented 3 years ago
(@v1.5) pkg> test CUDA
    Testing CUDA
Status `C:\Users\gzhang\AppData\Local\Temp\jl_GYVHeQ\Project.toml`
  [621f4979] AbstractFFTs v0.5.0
  [79e6a3ab] Adapt v2.3.0
  [ab4f0b2a] BFloat16s v0.1.0
  [fa961155] CEnum v0.4.1
  [052768ef] CUDA v2.1.0
  [864edb3b] DataStructures v0.17.20
  [e2ba6199] ExprTools v0.1.3
  [7a1cc6ca] FFTW v1.2.4
  [1a297f60] FillArrays v0.8.14
  [f6369f11] ForwardDiff v0.10.12
  [0c68f7d7] GPUArrays v6.1.1
  [61eb1bfa] GPUCompiler v0.8.3
  [a98d9a8b] Interpolations v0.13.0
  [929cbde3] LLVM v3.3.0
  [1914dd2f] MacroTools v0.5.6
  [872c559c] NNlib v0.7.6
  [189a3867] Reexport v0.2.0
  [ae029012] Requires v1.1.0
  [a759f4b9] TimerOutputs v0.5.7
  [ade2ca70] Dates
  [8ba89e20] Distributed
  [8f399da3] Libdl
  [37e2e46d] LinearAlgebra
  [56ddb016] Logging
  [44cfe95a] Pkg
  [de0858da] Printf
  [3fa0cd96] REPL
  [9a3f8284] Random
  [2f01184e] SparseArrays
  [10745b16] Statistics
  [8dfed614] Test
Status `C:\Users\gzhang\AppData\Local\Temp\jl_GYVHeQ\Manifest.toml`
  [621f4979] AbstractFFTs v0.5.0
  [79e6a3ab] Adapt v2.3.0
  [56f22d72] Artifacts v1.3.0
  [13072b0f] AxisAlgorithms v1.0.0
  [ab4f0b2a] BFloat16s v0.1.0
  [fa961155] CEnum v0.4.1
  [052768ef] CUDA v2.1.0
  [bbf7d656] CommonSubexpressions v0.3.0
  [e66e0078] CompilerSupportLibraries_jll v0.3.4+0
  [864edb3b] DataStructures v0.17.20
  [163ba53b] DiffResults v1.0.2
  [b552c78f] DiffRules v1.0.1
  [e2ba6199] ExprTools v0.1.3
  [7a1cc6ca] FFTW v1.2.4
  [f5851436] FFTW_jll v3.3.9+6
  [1a297f60] FillArrays v0.8.14
  [f6369f11] ForwardDiff v0.10.12
  [0c68f7d7] GPUArrays v6.1.1
  [61eb1bfa] GPUCompiler v0.8.3
  [1d5cc7b8] IntelOpenMP_jll v2018.0.3+0
  [a98d9a8b] Interpolations v0.13.0
  [692b3bcd] JLLWrappers v1.1.3
  [929cbde3] LLVM v3.3.0
  [856f044c] MKL_jll v2020.2.254+0
  [1914dd2f] MacroTools v0.5.6
  [872c559c] NNlib v0.7.6
  [77ba4419] NaNMath v0.3.4
  [6fe1bfb0] OffsetArrays v1.4.0
  [efe28fd5] OpenSpecFun_jll v0.5.3+4
  [bac558e1] OrderedCollections v1.3.2
  [c84ed2f1] Ratios v0.4.0
  [189a3867] Reexport v0.2.0
  [ae029012] Requires v1.1.0
  [6c6a2e73] Scratch v1.0.3
  [276daf66] SpecialFunctions v0.10.3
  [90137ffa] StaticArrays v0.12.5
  [a759f4b9] TimerOutputs v0.5.7
  [efce3f68] WoodburyMatrices v0.5.3
  [2a0f44e3] Base64
  [ade2ca70] Dates
  [8ba89e20] Distributed
  [b77e0a4c] InteractiveUtils
  [76f85450] LibGit2
  [8f399da3] Libdl
  [37e2e46d] LinearAlgebra
  [56ddb016] Logging
  [d6f4376e] Markdown
  [a63ad114] Mmap
  [44cfe95a] Pkg
  [de0858da] Printf
  [3fa0cd96] REPL
  [9a3f8284] Random
  [ea8e919c] SHA
  [9e88b42a] Serialization
  [1a1011a3] SharedArrays
  [6462fe0b] Sockets
  [2f01184e] SparseArrays
  [10745b16] Statistics
  [8dfed614] Test
  [cf7118a7] UUIDs
  [4ec0a83e] Unicode
┌ Info: System information:
│ CUDA toolkit 11.1.0, artifact installation
│ CUDA driver 11.1.0
│ NVIDIA driver 457.9.0
│
│ Libraries:
│ - CUBLAS: 11.2.1
│ - CURAND: 10.2.2
│ - CUFFT: 10.3.0
│ - CUSOLVER: 11.0.0
│ - CUSPARSE: 11.2.0
│ - CUPTI: 14.0.0
│ - NVML: 11.0.0+457.9
│ - CUDNN: 8.0.4 (for CUDA 11.1.0)
│ - CUTENSOR: 1.2.1 (for CUDA 11.1.0)
│
│ Toolchain:
│ - Julia: 1.5.2
│ - LLVM: 9.0.1
│ - PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4
│ - Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75
│
│ 1 device:
└   0: Quadro P2000 (sm_61, 3.150 GiB / 4.000 GiB available)
[ Info: Testing using 1 device(s): 1. Quadro P2000 (UUID 9b0b39dd-2ad4-66d0-d456-01bb0741d565)
[ Info: Skipping the following tests: cutensor\base, cutensor\contractions, cutensor\elementwise_binary, cutensor\elementwise_trinary, cutensor\permutations, cutensor\reductions, device\wmma
                                         |          | ---------------- GPU ---------------- | ---------------- CPU ---------------- |
Test                            (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB) | GC (s) | GC % | Alloc (MB) | RSS (MB) |
initialization                       (2) |     3.72 |   0.00 |  0.0 |       0.00 |      N/A |   0.12 |  3.2 |     481.80 |   621.50 |
apiutils                             (2) |     0.23 |   0.00 |  0.0 |       0.00 |      N/A |   0.00 |  0.0 |       5.37 |   621.50 |
array                                (2) |    57.88 |   0.04 |  0.1 |       5.20 |      N/A |   2.10 |  3.6 |    7420.30 |   739.61 |
broadcast                            (2) |    19.11 |   0.00 |  0.0 |       0.00 |      N/A |   0.49 |  2.6 |    2069.00 |   786.03 |
codegen                              (2) |     3.58 |   0.00 |  0.0 |       0.00 |      N/A |   0.18 |  5.1 |     310.42 |   786.03 |
cublas                               (2) |         failed at 2020-11-10T15:57:44.626
cudnn                                (3) |    58.88 |   0.05 |  0.1 |       0.89 |      N/A |   1.85 |  3.1 |    7356.63 |  1097.09 |
cufft                                (3) |    23.45 |   0.02 |  0.1 |     155.26 |      N/A |   0.80 |  3.4 |    2911.25 |  1106.94 |
curand                               (3) |     0.10 |   0.00 |  0.0 |       0.00 |      N/A |   0.00 |  0.0 |       5.94 |  1106.94 |
cusolver                             (3) |    54.32 |   0.05 |  0.1 |    1233.85 |      N/A |   1.90 |  3.5 |    7259.23 |  1479.03 |
cusparse                             (3) |    26.92 |   0.01 |  0.0 |       8.83 |      N/A |   0.87 |  3.2 |    2872.59 |  1485.80 |
examples                             (3) |    98.90 |   0.00 |  0.0 |       0.00 |      N/A |   0.01 |  0.0 |      23.14 |  1485.80 |
exceptions                           (3) |         failed at 2020-11-10T16:03:37.29
execution                            (4) |    40.05 |   0.04 |  0.1 |       0.09 |      N/A |   1.09 |  2.7 |    5260.15 |   715.48 |
forwarddiff                          (4) |    56.92 |   0.00 |  0.0 |       0.00 |      N/A |   0.96 |  1.7 |    3720.84 |   820.84 |
iterator                             (4) |     1.79 |   0.00 |  0.0 |       1.07 |      N/A |   0.05 |  2.7 |     212.86 |   820.84 |
nnlib                                (4) |     2.44 |   0.00 |  0.1 |       4.00 |      N/A |   0.06 |  2.6 |     245.58 |  1074.89 |
nvml                                 (4) |     0.45 |   0.00 |  0.0 |       0.00 |      N/A |   0.00 |  0.0 |      34.73 |  1074.89 |
nvtx                                 (4) |     0.97 |   0.00 |  0.0 |       0.00 |      N/A |   0.03 |  2.7 |     104.51 |  1074.89 |
pointer                              (4) |     0.23 |   0.00 |  0.3 |       0.00 |      N/A |   0.00 |  0.0 |      12.20 |  1074.89 |
pool                                 (4) |     1.83 |   0.00 |  0.0 |       0.00 |      N/A |   0.26 | 14.0 |     148.39 |  1074.89 |
random                               (4) |     7.44 |   0.00 |  0.0 |       0.02 |      N/A |   0.25 |  3.4 |    1060.49 |  1074.89 |
statistics                           (4) |    11.11 |   0.00 |  0.0 |       0.00 |      N/A |   0.46 |  4.1 |    1589.13 |  1074.89 |
texture                              (4) |    38.17 |   0.00 |  0.0 |       0.09 |      N/A |   1.66 |  4.3 |    5352.10 |  1106.73 |
threading                            (4) |     2.77 |   0.00 |  0.2 |      18.94 |      N/A |   0.10 |  3.5 |     337.01 |  1347.96 |
utils                                (4) |     1.02 |   0.00 |  0.1 |       4.00 |      N/A |   0.03 |  3.1 |      98.40 |  1347.96 |
cudadrv\context                      (4) |     0.42 |   0.00 |  0.0 |       0.00 |      N/A |   0.00 |  0.0 |      15.35 |  1347.96 |
cudadrv\devices                      (4) |     0.22 |   0.00 |  0.0 |       0.00 |      N/A |   0.01 |  6.7 |      23.98 |  1347.96 |
cudadrv\errors                       (4) |     0.11 |   0.00 |  0.0 |       0.00 |      N/A |   0.00 |  0.0 |      16.38 |  1347.96 |
cudadrv\events                       (4) |     0.10 |   0.00 |  0.0 |       0.00 |      N/A |   0.00 |  0.0 |       7.43 |  1347.96 |
cudadrv\execution                    (4) |     0.60 |   0.00 |  0.1 |       0.00 |      N/A |   0.01 |  2.3 |      46.64 |  1347.96 |
cudadrv\memory                       (4) |     1.63 |   0.00 |  0.0 |       0.00 |      N/A |   0.04 |  2.6 |     153.72 |  1347.96 |
cudadrv\module                       (4) |     0.56 |   0.00 |  0.0 |       0.00 |      N/A |   0.02 |  2.7 |      32.89 |  1347.96 |
cudadrv\occupancy                    (4) |     0.09 |   0.00 |  0.0 |       0.00 |      N/A |   0.00 |  0.0 |       6.55 |  1347.96 |
cudadrv\profile                      (4) |     0.23 |   0.00 |  0.0 |       0.00 |      N/A |   0.01 |  6.4 |      41.74 |  1347.96 |
cudadrv\stream                       (4) |     0.14 |   0.00 |  0.0 |       0.00 |      N/A |   0.00 |  0.0 |      12.12 |  1347.96 |
cudadrv\version                      (4) |     0.01 |   0.00 |  0.0 |       0.00 |      N/A |   0.00 |  0.0 |       0.07 |  1347.96 |
cusolver\cusparse                    (4) |    14.47 |   0.00 |  0.0 |       0.19 |      N/A |   0.49 |  3.4 |    1762.71 |  1933.47 |
device\array                         (4) |     1.21 |   0.00 |  0.1 |       0.00 |      N/A |   0.02 |  1.5 |     153.71 |  1933.47 |
device\intrinsics                    (4) |    77.59 |   0.00 |  0.0 |       0.01 |      N/A |   2.10 |  2.7 |    8983.69 |  1933.47 |
device\ldg                           (4) |     3.22 |   0.00 |  0.0 |       0.00 |      N/A |   0.11 |  3.5 |     493.10 |  1933.47 |
gpuarrays\math                       (4) |     2.00 |   0.00 |  0.0 |       0.00 |      N/A |   0.06 |  3.1 |     279.03 |  1933.47 |
gpuarrays\indexing scalar            (4) |     4.75 |   0.00 |  0.0 |       0.00 |      N/A |   0.14 |  2.9 |     637.32 |  1933.47 |
gpuarrays\input output               (4) |     1.10 |   0.00 |  0.0 |       0.00 |      N/A |   0.02 |  2.0 |     132.77 |  1933.47 |
gpuarrays\value constructors         (4) |     5.78 |   0.00 |  0.0 |       0.00 |      N/A |   0.18 |  3.2 |     774.17 |  1933.47 |
gpuarrays\indexing multidimensional  (4) |    16.53 |   0.00 |  0.0 |       0.69 |      N/A |   0.59 |  3.6 |    2179.54 |  1933.47 |
gpuarrays\interface                  (4) |     2.17 |   0.00 |  0.0 |       0.00 |      N/A |   0.09 |  4.0 |     350.75 |  1933.47 |
gpuarrays\iterator constructors      (4) |     1.67 |   0.00 |  0.1 |       0.02 |      N/A |   0.04 |  2.6 |     134.32 |  1933.47 |
gpuarrays\uniformscaling             (4) |     5.73 |   0.00 |  0.0 |       0.01 |      N/A |   0.18 |  3.1 |     710.38 |  1933.47 |
gpuarrays\linear algebra             (4) |    55.28 |   0.01 |  0.0 |       5.24 |      N/A |   1.93 |  3.5 |    6451.88 |  2190.88 |
gpuarrays\conversions                (4) |     2.19 |   0.00 |  0.0 |       0.01 |      N/A |   0.08 |  3.7 |     343.60 |  2190.88 |
gpuarrays\constructors               (4) |     0.90 |   0.00 |  0.2 |       0.03 |      N/A |   0.00 |  0.0 |      71.51 |  2190.88 |
gpuarrays\random                     (4) |    12.04 |   0.00 |  0.0 |       0.03 |      N/A |   0.38 |  3.2 |    1414.46 |  2190.88 |
gpuarrays\base                       (4) |    12.01 |   0.00 |  0.0 |      17.44 |      N/A |   0.56 |  4.7 |    1948.92 |  2190.88 |
gpuarrays\mapreduce essentials       (4) |    92.15 |   0.01 |  0.0 |       3.19 |      N/A |   3.68 |  4.0 |   13373.70 |  2275.11 |
gpuarrays\broadcasting               (4) |    53.02 |   0.00 |  0.0 |       1.19 |      N/A |   2.08 |  3.9 |    7706.26 |  2456.53 |
gpuarrays\mapreduce derivatives      (4) |   146.79 |   0.01 |  0.0 |       3.06 |      N/A |   5.06 |  3.4 |   16886.10 |  2784.34 |
Worker 2 failed running test cublas:
Some tests did not pass: 1911 passed, 9 failed, 0 errored, 0 broken.
cublas: Test Failed at C:\Users\gzhang\.julia\packages\CUDA\0p5fn\test\cublas.jl:738
  Expression: C ≈ h_C
   Evaluated: Float32[-1.2305453f7 1.0461547f7 … -1.0187274f8 -9.26245f7; 6.531846f6 -5.553084f6 … 5.4074972f7 4.9165924f7; … ; 0.44415444 -0.15412481 … 0.2323666 0.5338216; 0.54715234 0.66976196 … 0.2136537 0.43614867] ≈ Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.74713624 0.28174913 … 0.3451866 0.7585906; 0.5790218 0.7087729 … 0.22609818 0.46155262]
Stacktrace:
 [1] record(::Test.DefaultTestSet, ::Union{Test.Error, Test.Fail}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.5\Test\src\Test.jl:775
 [2] top-level scope at C:\Users\gzhang\.julia\packages\CUDA\0p5fn\test\runtests.jl:452
 [3] include(::String) at .\client.jl:457
 [4] top-level scope at none:6
 [5] eval(::Module, ::Any) at .\boot.jl:331
 [6] exec_options(::Base.JLOptions) at .\client.jl:272
 [7] _start() at .\client.jl:506
cublas: Test Failed at C:\Users\gzhang\.julia\packages\CUDA\0p5fn\test\cublas.jl:1002
  Expression: triu(final_C) ≈ triu(h_C)
   Evaluated: Float32[1.4073654 1.3944457 … 1.7368265 2.3285995; 0.0 1.7714791 … 2.097052 2.4623616; … ; 0.0 0.0 … 2.2051368 2.573707; 0.0 0.0 … 0.0 2.8112507] ≈ Float32[0.4576738 0.94403553 … 1.0581814 1.228253; 0.0 1.1114964 … 1.5998691 0.6885209; … ; 0.0 0.0 … 0.45837688 0.9996536; 0.0 0.0 … 0.0 1.6779919]
Stacktrace:
 [1] record(::Test.DefaultTestSet, ::Union{Test.Error, Test.Fail}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.5\Test\src\Test.jl:775
 [2] top-level scope at C:\Users\gzhang\.julia\packages\CUDA\0p5fn\test\runtests.jl:452
 [3] include(::String) at .\client.jl:457
 [4] top-level scope at none:6
 [5] eval(::Module, ::Any) at .\boot.jl:331
 [6] exec_options(::Base.JLOptions) at .\client.jl:272
 [7] _start() at .\client.jl:506
cublas: Test Failed at C:\Users\gzhang\.julia\packages\CUDA\0p5fn\test\cublas.jl:1031
  Expression: triu(final_C) ≈ triu(h_C)
   Evaluated: Float32[2.2067833 1.7991242 … 2.3337712 3.2793531; 0.0 2.354903 … 2.5633497 3.9496772; … ; 0.0 0.0 … 3.6620305 3.9071617; 0.0 0.0 … 0.0 3.8049505] ≈ Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0]
Stacktrace:
 [1] record(::Test.DefaultTestSet, ::Union{Test.Error, Test.Fail}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.5\Test\src\Test.jl:775
 [2] top-level scope at C:\Users\gzhang\.julia\packages\CUDA\0p5fn\test\runtests.jl:452
 [3] include(::String) at .\client.jl:457
 [4] top-level scope at none:6
 [5] eval(::Module, ::Any) at .\boot.jl:331
 [6] exec_options(::Base.JLOptions) at .\client.jl:272
 [7] _start() at .\client.jl:506
cublas: Test Failed at C:\Users\gzhang\.julia\packages\CUDA\0p5fn\test\cublas.jl:752
  Expression: C ≈ h_C
   Evaluated: [18.74598354555508 13.394181356954997 … 45.07852662693904 -21.972050309240295; 7.5061342951161985 7.193425962038685 … 21.870831128750787 -8.970564138881924; … ; -0.4389051400434357 -0.35860692721774223 … 0.8694647570653871 -0.1845682616576638; 0.161981898550573 0.20997429398542058 … 0.013299502499831101 0.23906496397129715] ≈ [23.568521792623002 -1.4290534849669403 … 13.17533209827122 -26.947925732909933; 10.080600131740978 -0.5441805956408505 … 5.341196720524661 -11.25495189841443; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0]
Stacktrace:
 [1] record(::Test.DefaultTestSet, ::Union{Test.Error, Test.Fail}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.5\Test\src\Test.jl:775
 [2] top-level scope at C:\Users\gzhang\.julia\packages\CUDA\0p5fn\test\runtests.jl:452
 [3] include(::String) at .\client.jl:457
 [4] top-level scope at none:6
 [5] eval(::Module, ::Any) at .\boot.jl:331
 [6] exec_options(::Base.JLOptions) at .\client.jl:272
 [7] _start() at .\client.jl:506
cublas: Test Failed at C:\Users\gzhang\.julia\packages\CUDA\0p5fn\test\cublas.jl:1031
  Expression: triu(final_C) ≈ triu(h_C)
   Evaluated: [3.9103682270620195 3.8288504925252895 … 3.3687834797249803 2.9359582872098957; 0.0 3.770665117551075 … 2.969252994283909 3.1502630627353563; … ; 0.0 0.0 … 3.2051220376278855 2.9706958567760404; 0.0 0.0 … 0.0 1.7198429926259082] ≈ [4.415589386550115 4.808910081308539 … 4.977509654518695 4.732295935903332; 0.0 4.615519361023798 … 5.670942433976425 5.988953462643037; … ; 0.0 0.0 … 4.898591737003587 4.733092317212798; 0.0 0.0 … 0.0 3.7949248880899247]
Stacktrace:
 [1] record(::Test.DefaultTestSet, ::Union{Test.Error, Test.Fail}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.5\Test\src\Test.jl:775
 [2] top-level scope at C:\Users\gzhang\.julia\packages\CUDA\0p5fn\test\runtests.jl:452
 [3] include(::String) at .\client.jl:457
 [4] top-level scope at none:6
 [5] eval(::Module, ::Any) at .\boot.jl:331
 [6] exec_options(::Base.JLOptions) at .\client.jl:272
 [7] _start() at .\client.jl:506
cublas: Test Failed at C:\Users\gzhang\.julia\packages\CUDA\0p5fn\test\cublas.jl:752
  Expression: C ≈ h_C
   Evaluated: Complex{Float32}[6.5716047f0 - 18.668325f0im -8.363171f0 - 40.46595f0im … 22.13506f0 - 30.618752f0im 25.809649f0 - 15.860115f0im; 4.5095954f0 + 4.1179533f0im 14.202671f0 + 4.545605f0im … 8.39682f0 + 9.449029f0im 4.782607f0 + 10.155699f0im; … ; 0.56358075f0 - 0.10530421f0im -0.27657855f0 - 0.18788697f0im … 0.9417068f0 - 0.48866725f0im 1.0063087f0 - 0.0028781295f0im; 0.75444937f0 + 0.68582594f0im 1.1453321f0 - 0.07741165f0im … 1.4386584f0 + 0.8686157f0im 0.30632406f0 + 1.0952568f0im] ≈ Complex{Float32}[0.0f0 + 0.0f0im 0.0f0 + 0.0f0im … 0.0f0 + 0.0f0im 0.0f0 + 0.0f0im; 0.0f0 + 0.0f0im 0.0f0 + 0.0f0im … 0.0f0 + 0.0f0im 0.0f0 + 0.0f0im; … ; 0.47953796f0 + 0.3271507f0im 0.25177312f0 + 0.72940016f0im … 0.83835006f0 + 0.18035042f0im 0.50587046f0 + 0.021811485f0im; 0.38605535f0 + 0.5373397f0im 0.7373898f0 + 0.105807185f0im … 0.7955806f0 + 0.7459111f0im 0.046252728f0 + 0.736575f0im]
Stacktrace:
 [1] record(::Test.DefaultTestSet, ::Union{Test.Error, Test.Fail}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.5\Test\src\Test.jl:775
 [2] top-level scope at C:\Users\gzhang\.julia\packages\CUDA\0p5fn\test\runtests.jl:452
 [3] include(::String) at .\client.jl:457
 [4] top-level scope at none:6
 [5] eval(::Module, ::Any) at .\boot.jl:331
 [6] exec_options(::Base.JLOptions) at .\client.jl:272
 [7] _start() at .\client.jl:506
cublas: Test Failed at C:\Users\gzhang\.julia\packages\CUDA\0p5fn\test\cublas.jl:1031
  Expression: triu(final_C) ≈ triu(h_C)
   Evaluated: Complex{Float32}[0.07527118f0 + 6.4920063f0im 1.0060724f0 + 5.3555646f0im … 1.6183192f0 + 6.8529677f0im -0.37093878f0 + 6.4395094f0im; 0.0f0 + 0.0f0im 1.1858541f0 + 5.296947f0im … 0.42090783f0 + 6.8958907f0im -0.27542973f0 + 6.415653f0im; … ; 0.0f0 + 0.0f0im 0.0f0 + 0.0f0im … 1.3251941f0 + 7.0397844f0im -0.1773653f0 + 6.3724833f0im; 0.0f0 + 0.0f0im 0.0f0 + 0.0f0im … 0.0f0 + 0.0f0im 0.9996095f0 + 6.0821276f0im] ≈ Complex{Float32}[0.0f0 + 0.0f0im 0.0f0 + 0.0f0im … 0.0f0 + 0.0f0im 0.0f0 + 0.0f0im; 0.0f0 + 0.0f0im 0.0f0 + 0.0f0im … 0.0f0 + 0.0f0im 0.0f0 + 0.0f0im; … ; 0.0f0 + 0.0f0im 0.0f0 + 0.0f0im … 0.0f0 + 0.0f0im 0.0f0 + 0.0f0im; 0.0f0 + 0.0f0im 0.0f0 + 0.0f0im … 0.0f0 + 0.0f0im 0.0f0 + 0.0f0im]
Stacktrace:
 [1] record(::Test.DefaultTestSet, ::Union{Test.Error, Test.Fail}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.5\Test\src\Test.jl:775
 [2] top-level scope at C:\Users\gzhang\.julia\packages\CUDA\0p5fn\test\runtests.jl:452
 [3] include(::String) at .\client.jl:457
 [4] top-level scope at none:6
 [5] eval(::Module, ::Any) at .\boot.jl:331
 [6] exec_options(::Base.JLOptions) at .\client.jl:272
 [7] _start() at .\client.jl:506
cublas: Test Failed at C:\Users\gzhang\.julia\packages\CUDA\0p5fn\test\cublas.jl:1200
  Expression: C ≈ h_C
   Evaluated: Complex{Float32}[17.230198f0 + 0.0f0im 12.078209f0 + 0.20376357f0im … 13.515779f0 - 0.10561065f0im 10.038786f0 - 2.2866707f0im; 0.0f0 + 0.0f0im 15.325577f0 + 0.0f0im … 12.912725f0 - 1.3274695f0im 9.342388f0 - 1.075415f0im; … ; 0.0f0 + 0.0f0im 0.0f0 + 0.0f0im … 17.28791f0 + 0.0f0im 10.818294f0 - 1.8379226f0im; 0.0f0 + 0.0f0im 0.0f0 + 0.0f0im … 0.0f0 + 0.0f0im 11.699403f0 + 0.0f0im] ≈ Complex{Float32}[36.4495f0 + 0.0f0im 19.158358f0 - 1.668592f0im … 20.20843f0 - 2.6782985f0im 15.660368f0 - 5.7176094f0im; 0.0f0 + 0.0f0im 31.994246f0 + 0.0f0im … 19.699236f0 - 4.4516745f0im 14.077913f0 - 4.5710325f0im; … ; 0.0f0 + 0.0f0im 0.0f0 + 0.0f0im … 34.594345f0 + 0.0f0im 15.8223715f0 - 4.543624f0im; 0.0f0 + 0.0f0im 0.0f0 + 0.0f0im … 0.0f0 + 0.0f0im 24.019485f0 + 0.0f0im]
Stacktrace:
 [1] record(::Test.DefaultTestSet, ::Union{Test.Error, Test.Fail}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.5\Test\src\Test.jl:775
 [2] top-level scope at C:\Users\gzhang\.julia\packages\CUDA\0p5fn\test\runtests.jl:452
 [3] include(::String) at .\client.jl:457
 [4] top-level scope at none:6
 [5] eval(::Module, ::Any) at .\boot.jl:331
 [6] exec_options(::Base.JLOptions) at .\client.jl:272
 [7] _start() at .\client.jl:506
cublas: Test Failed at C:\Users\gzhang\.julia\packages\CUDA\0p5fn\test\cublas.jl:1031
  Expression: triu(final_C) ≈ triu(h_C)
   Evaluated: Complex{Float64}[-0.4455701797520782 + 4.702448983975259im 1.9942241198526371 + 6.530007499629968im … -0.6246010463986564 + 5.272877345424119im 0.6158951282339262 + 5.468338273135506im; 0.0 + 0.0im 0.8713696768913176 + 5.896850816407803im … -1.1914421137911912 + 4.834202763756903im 0.536857352740741 + 5.54919926205237im; … ; 0.0 + 0.0im 0.0 + 0.0im … -1.2713241512460756 + 5.513760382990043im 0.08961118592216222 + 5.415265793894324im; 0.0 + 0.0im 0.0 + 0.0im … 0.0 + 0.0im 1.1330187770188154 + 6.57336780099169im] ≈ Complex{Float64}[-1.2571901062489157 + 11.434009521886804im 0.3523720552787533 + 9.458099092654427im … 0.9041587455573924 + 7.613933932538402im 2.1215642239324386 + 9.741971202210799im; 0.0 + 0.0im -1.066241675413568 + 9.042800694730591im … 1.7908109694067527 + 8.159788709968103im 0.7322235400544855 + 9.973680964117122im; … ; 0.0 + 0.0im 0.0 + 0.0im … 0.558666511767051 + 8.162169351733095im 1.7088958287908496 + 10.278177807632787im; 0.0 + 0.0im 0.0 + 0.0im … 0.0 + 0.0im 1.9974750203197098 + 9.188777687541759im]
Stacktrace:
 [1] record(::Test.DefaultTestSet, ::Union{Test.Error, Test.Fail}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.5\Test\src\Test.jl:775
 [2] top-level scope at C:\Users\gzhang\.julia\packages\CUDA\0p5fn\test\runtests.jl:452
 [3] include(::String) at .\client.jl:457
 [4] top-level scope at none:6
 [5] eval(::Module, ::Any) at .\boot.jl:331
 [6] exec_options(::Base.JLOptions) at .\client.jl:272
 [7] _start() at .\client.jl:506
Worker 3 failed running test exceptions:
Some tests did not pass: 13 passed, 4 failed, 0 errored, 0 broken.
exceptions: Test Failed at C:\Users\gzhang\.julia\packages\CUDA\0p5fn\test\exceptions.jl:28
  Expression: occursin("ERROR: KernelException: exception thrown during kernel execution on device", err)
   Evaluated: occursin("ERROR: KernelException: exception thrown during kernel execution on device", "ERROR: CUDA error: unspecified launch failure (code 719, ERROR_LAUNCH_FAILED)\nStacktrace:\n [1] top-level scope at none:12\nerror in running finalizer: CUDA.KernelException(dev=CUDA.CuDevice(handle=0))\n")
Stacktrace:
 [1] record(::Test.DefaultTestSet, ::Union{Test.Error, Test.Fail}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.5\Test\src\Test.jl:775
 [2] top-level scope at C:\Users\gzhang\.julia\packages\CUDA\0p5fn\test\runtests.jl:452
 [3] include(::String) at .\client.jl:457
 [4] top-level scope at none:6
 [5] eval(::Module, ::Any) at .\boot.jl:331
 [6] exec_options(::Base.JLOptions) at .\client.jl:272
 [7] _start() at .\client.jl:506
exceptions: Test Failed at C:\Users\gzhang\.julia\packages\CUDA\0p5fn\test\exceptions.jl:35
  Expression: occursin("ERROR: KernelException: exception thrown during kernel execution on device", err)
   Evaluated: occursin("ERROR: KernelException: exception thrown during kernel execution on device", "ERROR: CUDA error: unspecified launch failure (code 719, ERROR_LAUNCH_FAILED)\nStacktrace:\n [1] throw_api_error(::CUDA.cudaError_enum) at C:\\Users\\gzhang\\.julia\\packages\\CUDA\\0p5fn\\lib\\cudadrv\\error.jl:97\n [2] macro expansion at C:\\Users\\gzhang\\.julia\\packages\\CUDA\\0p5fn\\lib\\cudadrv\\error.jl:104 [inlined]\n [3] cuMemcpyDtoH_v2(::Ptr{Int64}, ::CuPtr{Int64}, ::Int64) at C:\\Users\\gzhang\\.julia\\packages\\CUDA\\0p5fn\\lib\\utils\\call.jl:93\n [4] #unsafe_copyto!#6 at C:\\Users\\gzhang\\.julia\\packages\\CUDA\\0p5fn\\lib\\cudadrv\\memory.jl:395 [inlined]\n [5] unsafe_copyto! at C:\\Users\\gzhang\\.julia\\packages\\CUDA\\0p5fn\\lib\\cudadrv\\memory.jl:388 [inlined]\n [6] unsafe_copyto! at C:\\Users\\gzhang\\.julia\\packages\\CUDA\\0p5fn\\src\\array.jl:299 [inlined]\n [7] copyto! at C:\\Users\\gzhang\\.julia\\packages\\CUDA\\0p5fn\\src\\array.jl:268 [inlined]\n [8] copyto! at C:\\Users\\gzhang\\.julia\\packages\\CUDA\\0p5fn\\src\\array.jl:272 [inlined]\n [9] copyto_axcheck! at .\\abstractarray.jl:946 [inlined]\n [10] Array at .\\array.jl:562 [inlined]\n [11] Array(::CuArray{Int64,0}) at .\\boot.jl:430\n [12] top-level scope at none:12\nerror in running finalizer: CUDA.KernelException(dev=CUDA.CuDevice(handle=0))\n")
Stacktrace:
 [1] record(::Test.DefaultTestSet, ::Union{Test.Error, Test.Fail}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.5\Test\src\Test.jl:775
 [2] top-level scope at C:\Users\gzhang\.julia\packages\CUDA\0p5fn\test\runtests.jl:452
 [3] include(::String) at .\client.jl:457
 [4] top-level scope at none:6
 [5] eval(::Module, ::Any) at .\boot.jl:331
 [6] exec_options(::Base.JLOptions) at .\client.jl:272
 [7] _start() at .\client.jl:506
exceptions: Test Failed at C:\Users\gzhang\.julia\packages\CUDA\0p5fn\test\exceptions.jl:42
  Expression: occursin("ERROR: KernelException: exception thrown during kernel execution on device", err)
   Evaluated: occursin("ERROR: KernelException: exception thrown during kernel execution on device", "ERROR: CUDA error: unspecified launch failure (code 719, ERROR_LAUNCH_FAILED)\nStacktrace:\n [1] throw_api_error(::CUDA.cudaError_enum) at C:\\Users\\gzhang\\.julia\\packages\\CUDA\\0p5fn\\lib\\cudadrv\\error.jl:97\n [2] macro expansion at C:\\Users\\gzhang\\.julia\\packages\\CUDA\\0p5fn\\lib\\cudadrv\\error.jl:104 [inlined]\n [3] cuMemcpyDtoH_v2(::Ptr{Int64}, ::CuPtr{Int64}, ::Int64) at C:\\Users\\gzhang\\.julia\\packages\\CUDA\\0p5fn\\lib\\utils\\call.jl:93\n [4] #unsafe_copyto!#6 at C:\\Users\\gzhang\\.julia\\packages\\CUDA\\0p5fn\\lib\\cudadrv\\memory.jl:395 [inlined]\n [5] unsafe_copyto! at C:\\Users\\gzhang\\.julia\\packages\\CUDA\\0p5fn\\lib\\cudadrv\\memory.jl:388 [inlined]\n [6] unsafe_copyto! at C:\\Users\\gzhang\\.julia\\packages\\CUDA\\0p5fn\\src\\array.jl:299 [inlined]\n [7] copyto! at C:\\Users\\gzhang\\.julia\\packages\\CUDA\\0p5fn\\src\\array.jl:268 [inlined]\n [8] copyto! at C:\\Users\\gzhang\\.julia\\packages\\CUDA\\0p5fn\\src\\array.jl:272 [inlined]\n [9] copyto_axcheck! at .\\abstractarray.jl:946 [inlined]\n [10] Array at .\\array.jl:562 [inlined]\n [11] Array(::CuArray{Int64,0}) at .\\boot.jl:430\n [12] top-level scope at none:12\nerror in running finalizer: CUDA.KernelException(dev=CUDA.CuDevice(handle=0))\n")
Stacktrace:
 [1] record(::Test.DefaultTestSet, ::Union{Test.Error, Test.Fail}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.5\Test\src\Test.jl:775
 [2] top-level scope at C:\Users\gzhang\.julia\packages\CUDA\0p5fn\test\runtests.jl:452
 [3] include(::String) at .\client.jl:457
 [4] top-level scope at none:6
 [5] eval(::Module, ::Any) at .\boot.jl:331
 [6] exec_options(::Base.JLOptions) at .\client.jl:272
 [7] _start() at .\client.jl:506
exceptions: Test Failed at C:\Users\gzhang\.julia\packages\CUDA\0p5fn\test\exceptions.jl:69
  Expression: occursin("ERROR: KernelException: exception thrown during kernel execution on device", err)
   Evaluated: occursin("ERROR: KernelException: exception thrown during kernel execution on device", "ERROR: CUDA error: unspecified launch failure (code 719, ERROR_LAUNCH_FAILED)\nStacktrace:\n [1] throw_api_error(::CUDA.cudaError_enum) at C:\\Users\\gzhang\\.julia\\packages\\CUDA\\0p5fn\\lib\\cudadrv\\error.jl:97\n [2] macro expansion at C:\\Users\\gzhang\\.julia\\packages\\CUDA\\0p5fn\\lib\\cudadrv\\error.jl:104 [inlined]\n [3] cuCtxSynchronize at C:\\Users\\gzhang\\.julia\\packages\\CUDA\\0p5fn\\lib\\utils\\call.jl:93 [inlined]\n [4] synchronize() at C:\\Users\\gzhang\\.julia\\packages\\CUDA\\0p5fn\\lib\\cudadrv\\context.jl:173\n [5] top-level scope at none:11\nerror in running finalizer: CUDA.KernelException(dev=CUDA.CuDevice(handle=0))\n")
Stacktrace:
 [1] record(::Test.DefaultTestSet, ::Union{Test.Error, Test.Fail}) at D:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.5\Test\src\Test.jl:775
 [2] top-level scope at C:\Users\gzhang\.julia\packages\CUDA\0p5fn\test\runtests.jl:452
 [3] include(::String) at .\client.jl:457
 [4] top-level scope at none:6
 [5] eval(::Module, ::Any) at .\boot.jl:331
 [6] exec_options(::Base.JLOptions) at .\client.jl:272
 [7] _start() at .\client.jl:506

Test Summary:                         | Pass  Fail  Broken  Total
  Overall                             | 8233    13       5   8251
    initialization                    |   25                   25
    apiutils                          |   15                   15
    array                             |  180                  180
    broadcast                         |   29                   29
    codegen                           |    9                    9
    cublas                            | 1911     9           1920
    cudnn                             |  147                  147
    cufft                             |  175                  175
    curand                            |    1                    1
    cusolver                          | 1492                 1492
    cusparse                          |  497                  497
    examples                          |    7                    7
    exceptions                        |   13     4             17
    execution                         |   66                   66
    forwarddiff                       |  107                  107
    iterator                          |   30                   30
    nnlib                             |    4                    4
    nvml                              |    7                    7
    nvtx                              |                     No tests
    pointer                           |   35                   35
    pool                              |   10                   10
    random                            |  101                  101
    statistics                        |   18                   18
    texture                           |   38             4     42
    threading                         |                     No tests
    utils                             |    5                    5
    cudadrv\context                   |   12                   12
    cudadrv\devices                   |    6                    6
    cudadrv\errors                    |    6                    6
    cudadrv\events                    |    6                    6
    cudadrv\execution                 |   15                   15
    cudadrv\memory                    |   49             1     50
    cudadrv\module                    |   11                   11
    cudadrv\occupancy                 |    1                    1
    cudadrv\profile                   |    2                    2
    cudadrv\stream                    |    7                    7
    cudadrv\version                   |    3                    3
    cusolver\cusparse                 |   84                   84
    device\array                      |   18                   18
    device\intrinsics                 |  266                  266
    device\ldg                        |   21                   21
    gpuarrays\math                    |    8                    8
    gpuarrays\indexing scalar         |  249                  249
    gpuarrays\input output            |    5                    5
    gpuarrays\value constructors      |   36                   36
    gpuarrays\indexing multidimensional |   34                   34
    gpuarrays\interface               |    7                    7
    gpuarrays\iterator constructors   |   24                   24
    gpuarrays\uniformscaling          |   56                   56
    gpuarrays\linear algebra          |  389                  389
    gpuarrays\conversions             |   72                   72
    gpuarrays\constructors            |  335                  335
    gpuarrays\random                  |   46                   46
    gpuarrays\base                    |   39                   39
    gpuarrays\mapreduce essentials    |  522                  522
    gpuarrays\broadcasting            |  155                  155
    gpuarrays\mapreduce derivatives   |  827                  827
    FAILURE

Error in testset cublas:
Test Failed at C:\Users\gzhang\.julia\packages\CUDA\0p5fn\test\cublas.jl:738
  Expression: C ≈ h_C
   Evaluated: Float32[-1.2305453f7 1.0461547f7 … -1.0187274f8 -9.26245f7; 6.531846f6 -5.553084f6 … 5.4074972f7 4.9165924f7; … ; 0.44415444 -0.15412481 … 0.2323666 0.5338216; 0.54715234 0.66976196 … 0.2136537 0.43614867] ≈ Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.74713624 0.28174913 … 0.3451866 0.7585906; 0.5790218 0.7087729 … 0.22609818 0.46155262]
Error in testset cublas:
Test Failed at C:\Users\gzhang\.julia\packages\CUDA\0p5fn\test\cublas.jl:1002
  Expression: triu(final_C) ≈ triu(h_C)
   Evaluated: Float32[1.4073654 1.3944457 … 1.7368265 2.3285995; 0.0 1.7714791 … 2.097052 2.4623616; … ; 0.0 0.0 … 2.2051368 2.573707; 0.0 0.0 … 0.0 2.8112507] ≈ Float32[0.4576738 0.94403553 … 1.0581814 1.228253; 0.0 1.1114964 … 1.5998691 0.6885209; … ; 0.0 0.0 … 0.45837688 0.9996536; 0.0 0.0 … 0.0 1.6779919]
Error in testset cublas:
Test Failed at C:\Users\gzhang\.julia\packages\CUDA\0p5fn\test\cublas.jl:1031
  Expression: triu(final_C) ≈ triu(h_C)
   Evaluated: Float32[2.2067833 1.7991242 … 2.3337712 3.2793531; 0.0 2.354903 … 2.5633497 3.9496772; … ; 0.0 0.0 … 3.6620305 3.9071617; 0.0 0.0 … 0.0 3.8049505] ≈ Float32[0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0]
Error in testset cublas:
Test Failed at C:\Users\gzhang\.julia\packages\CUDA\0p5fn\test\cublas.jl:752
  Expression: C ≈ h_C
   Evaluated: [18.74598354555508 13.394181356954997 … 45.07852662693904 -21.972050309240295; 7.5061342951161985 7.193425962038685 … 21.870831128750787 -8.970564138881924; … ; -0.4389051400434357 -0.35860692721774223 … 0.8694647570653871 -0.1845682616576638; 0.161981898550573 0.20997429398542058 … 0.013299502499831101 0.23906496397129715] ≈ [23.568521792623002 -1.4290534849669403 … 13.17533209827122 -26.947925732909933; 10.080600131740978 -0.5441805956408505 … 5.341196720524661 -11.25495189841443; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0]
Error in testset cublas:
Test Failed at C:\Users\gzhang\.julia\packages\CUDA\0p5fn\test\cublas.jl:1031
  Expression: triu(final_C) ≈ triu(h_C)
   Evaluated: [3.9103682270620195 3.8288504925252895 … 3.3687834797249803 2.9359582872098957; 0.0 3.770665117551075 … 2.969252994283909 3.1502630627353563; … ; 0.0 0.0 … 3.2051220376278855 2.9706958567760404; 0.0 0.0 … 0.0 1.7198429926259082] ≈ [4.415589386550115 4.808910081308539 … 4.977509654518695 4.732295935903332; 0.0 4.615519361023798 … 5.670942433976425 5.988953462643037; … ; 0.0 0.0 … 4.898591737003587 4.733092317212798; 0.0 0.0 … 0.0 3.7949248880899247]
Error in testset cublas:
Test Failed at C:\Users\gzhang\.julia\packages\CUDA\0p5fn\test\cublas.jl:752
  Expression: C ≈ h_C
   Evaluated: Complex{Float32}[6.5716047f0 - 18.668325f0im -8.363171f0 - 40.46595f0im … 22.13506f0 - 30.618752f0im 25.809649f0 - 15.860115f0im; 4.5095954f0 + 4.1179533f0im 14.202671f0 + 4.545605f0im … 8.39682f0 + 9.449029f0im 4.782607f0 + 10.155699f0im; … ; 0.56358075f0 - 0.10530421f0im -0.27657855f0 - 0.18788697f0im … 0.9417068f0 - 0.48866725f0im 1.0063087f0 - 0.0028781295f0im; 0.75444937f0 + 0.68582594f0im 1.1453321f0 - 0.07741165f0im … 1.4386584f0 + 0.8686157f0im 0.30632406f0 + 1.0952568f0im] ≈ Complex{Float32}[0.0f0 + 0.0f0im 0.0f0 + 0.0f0im … 0.0f0 + 0.0f0im 0.0f0 + 0.0f0im; 0.0f0 + 0.0f0im 0.0f0 + 0.0f0im … 0.0f0 + 0.0f0im 0.0f0 + 0.0f0im; … ; 0.47953796f0 + 0.3271507f0im 0.25177312f0 + 0.72940016f0im … 0.83835006f0 + 0.18035042f0im 0.50587046f0 + 0.021811485f0im; 0.38605535f0 + 0.5373397f0im 0.7373898f0 + 0.105807185f0im … 0.7955806f0 + 0.7459111f0im 0.046252728f0 + 0.736575f0im]
Error in testset cublas:
Test Failed at C:\Users\gzhang\.julia\packages\CUDA\0p5fn\test\cublas.jl:1031
  Expression: triu(final_C) ≈ triu(h_C)
   Evaluated: Complex{Float32}[0.07527118f0 + 6.4920063f0im 1.0060724f0 + 5.3555646f0im … 1.6183192f0 + 6.8529677f0im -0.37093878f0 + 6.4395094f0im; 0.0f0 + 0.0f0im 1.1858541f0 + 5.296947f0im … 0.42090783f0 + 6.8958907f0im -0.27542973f0 + 6.415653f0im; … ; 0.0f0 + 0.0f0im 0.0f0 + 0.0f0im … 1.3251941f0 + 7.0397844f0im -0.1773653f0 + 6.3724833f0im; 0.0f0 + 0.0f0im 0.0f0 + 0.0f0im … 0.0f0 + 0.0f0im 0.9996095f0 + 6.0821276f0im] ≈ Complex{Float32}[0.0f0 + 0.0f0im 0.0f0 + 0.0f0im … 0.0f0 + 0.0f0im 0.0f0 + 0.0f0im; 0.0f0 + 0.0f0im 0.0f0 + 0.0f0im … 0.0f0 + 0.0f0im 0.0f0 + 0.0f0im; … ; 0.0f0 + 0.0f0im 0.0f0 + 0.0f0im … 0.0f0 + 0.0f0im 0.0f0 + 0.0f0im; 0.0f0 + 0.0f0im 0.0f0 + 0.0f0im … 0.0f0 + 0.0f0im 0.0f0 + 0.0f0im]
Error in testset cublas:
Test Failed at C:\Users\gzhang\.julia\packages\CUDA\0p5fn\test\cublas.jl:1200
  Expression: C ≈ h_C
   Evaluated: Complex{Float32}[17.230198f0 + 0.0f0im 12.078209f0 + 0.20376357f0im … 13.515779f0 - 0.10561065f0im 10.038786f0 - 2.2866707f0im; 0.0f0 + 0.0f0im 15.325577f0 + 0.0f0im … 12.912725f0 - 1.3274695f0im 9.342388f0 - 1.075415f0im; … ; 0.0f0 + 0.0f0im 0.0f0 + 0.0f0im … 17.28791f0 + 0.0f0im 10.818294f0 - 1.8379226f0im; 0.0f0 + 0.0f0im 0.0f0 + 0.0f0im … 0.0f0 + 0.0f0im 11.699403f0 + 0.0f0im] ≈ Complex{Float32}[36.4495f0 + 0.0f0im 19.158358f0 - 1.668592f0im … 20.20843f0 - 2.6782985f0im 15.660368f0 - 5.7176094f0im; 0.0f0 + 0.0f0im 31.994246f0 + 0.0f0im … 19.699236f0 - 4.4516745f0im 14.077913f0 - 4.5710325f0im; … ; 0.0f0 + 0.0f0im 0.0f0 + 0.0f0im … 34.594345f0 + 0.0f0im 15.8223715f0 - 4.543624f0im; 0.0f0 + 0.0f0im 0.0f0 + 0.0f0im … 0.0f0 + 0.0f0im 24.019485f0 + 0.0f0im]
Error in testset cublas:
Test Failed at C:\Users\gzhang\.julia\packages\CUDA\0p5fn\test\cublas.jl:1031
  Expression: triu(final_C) ≈ triu(h_C)
   Evaluated: Complex{Float64}[-0.4455701797520782 + 4.702448983975259im 1.9942241198526371 + 6.530007499629968im … -0.6246010463986564 + 5.272877345424119im 0.6158951282339262 + 5.468338273135506im; 0.0 + 0.0im 0.8713696768913176 + 5.896850816407803im … -1.1914421137911912 + 4.834202763756903im 0.536857352740741 + 5.54919926205237im; … ; 0.0 + 0.0im 0.0 + 0.0im … -1.2713241512460756 + 5.513760382990043im 0.08961118592216222 + 5.415265793894324im; 0.0 + 0.0im 0.0 + 0.0im … 0.0 + 0.0im 1.1330187770188154 + 6.57336780099169im] ≈ Complex{Float64}[-1.2571901062489157 + 11.434009521886804im 0.3523720552787533 + 9.458099092654427im … 0.9041587455573924 + 7.613933932538402im 2.1215642239324386 + 9.741971202210799im; 0.0 + 0.0im -1.066241675413568 + 9.042800694730591im … 1.7908109694067527 + 8.159788709968103im 0.7322235400544855 + 9.973680964117122im; … ; 0.0 + 0.0im 0.0 + 0.0im … 0.558666511767051 + 8.162169351733095im 1.7088958287908496 + 10.278177807632787im; 0.0 + 0.0im 0.0 + 0.0im … 0.0 + 0.0im 1.9974750203197098 + 9.188777687541759im]
Error in testset exceptions:
Test Failed at C:\Users\gzhang\.julia\packages\CUDA\0p5fn\test\exceptions.jl:28
  Expression: occursin("ERROR: KernelException: exception thrown during kernel execution on device", err)
   Evaluated: occursin("ERROR: KernelException: exception thrown during kernel execution on device", "ERROR: CUDA error: unspecified launch failure (code 719, ERROR_LAUNCH_FAILED)\nStacktrace:\n [1] top-level scope at none:12\nerror in running finalizer: CUDA.KernelException(dev=CUDA.CuDevice(handle=0))\n")
Error in testset exceptions:
Test Failed at C:\Users\gzhang\.julia\packages\CUDA\0p5fn\test\exceptions.jl:35
  Expression: occursin("ERROR: KernelException: exception thrown during kernel execution on device", err)
   Evaluated: occursin("ERROR: KernelException: exception thrown during kernel execution on device", "ERROR: CUDA error: unspecified launch failure (code 719, ERROR_LAUNCH_FAILED)\nStacktrace:\n [1] throw_api_error(::CUDA.cudaError_enum) at C:\\Users\\gzhang\\.julia\\packages\\CUDA\\0p5fn\\lib\\cudadrv\\error.jl:97\n [2] macro expansion at C:\\Users\\gzhang\\.julia\\packages\\CUDA\\0p5fn\\lib\\cudadrv\\error.jl:104 [inlined]\n [3] cuMemcpyDtoH_v2(::Ptr{Int64}, ::CuPtr{Int64}, ::Int64) at C:\\Users\\gzhang\\.julia\\packages\\CUDA\\0p5fn\\lib\\utils\\call.jl:93\n [4] #unsafe_copyto!#6 at C:\\Users\\gzhang\\.julia\\packages\\CUDA\\0p5fn\\lib\\cudadrv\\memory.jl:395 [inlined]\n [5] unsafe_copyto! at C:\\Users\\gzhang\\.julia\\packages\\CUDA\\0p5fn\\lib\\cudadrv\\memory.jl:388 [inlined]\n [6] unsafe_copyto! at C:\\Users\\gzhang\\.julia\\packages\\CUDA\\0p5fn\\src\\array.jl:299 [inlined]\n [7] copyto! at C:\\Users\\gzhang\\.julia\\packages\\CUDA\\0p5fn\\src\\array.jl:268 [inlined]\n [8] copyto! at C:\\Users\\gzhang\\.julia\\packages\\CUDA\\0p5fn\\src\\array.jl:272 [inlined]\n [9] copyto_axcheck! at .\\abstractarray.jl:946 [inlined]\n [10] Array at .\\array.jl:562 [inlined]\n [11] Array(::CuArray{Int64,0}) at .\\boot.jl:430\n [12] top-level scope at none:12\nerror in running finalizer: CUDA.KernelException(dev=CUDA.CuDevice(handle=0))\n")
Error in testset exceptions:
Test Failed at C:\Users\gzhang\.julia\packages\CUDA\0p5fn\test\exceptions.jl:42
  Expression: occursin("ERROR: KernelException: exception thrown during kernel execution on device", err)
   Evaluated: occursin("ERROR: KernelException: exception thrown during kernel execution on device", "ERROR: CUDA error: unspecified launch failure (code 719, ERROR_LAUNCH_FAILED)\nStacktrace:\n [1] throw_api_error(::CUDA.cudaError_enum) at C:\\Users\\gzhang\\.julia\\packages\\CUDA\\0p5fn\\lib\\cudadrv\\error.jl:97\n [2] macro expansion at C:\\Users\\gzhang\\.julia\\packages\\CUDA\\0p5fn\\lib\\cudadrv\\error.jl:104 [inlined]\n [3] cuMemcpyDtoH_v2(::Ptr{Int64}, ::CuPtr{Int64}, ::Int64) at C:\\Users\\gzhang\\.julia\\packages\\CUDA\\0p5fn\\lib\\utils\\call.jl:93\n [4] #unsafe_copyto!#6 at C:\\Users\\gzhang\\.julia\\packages\\CUDA\\0p5fn\\lib\\cudadrv\\memory.jl:395 [inlined]\n [5] unsafe_copyto! at C:\\Users\\gzhang\\.julia\\packages\\CUDA\\0p5fn\\lib\\cudadrv\\memory.jl:388 [inlined]\n [6] unsafe_copyto! at C:\\Users\\gzhang\\.julia\\packages\\CUDA\\0p5fn\\src\\array.jl:299 [inlined]\n [7] copyto! at C:\\Users\\gzhang\\.julia\\packages\\CUDA\\0p5fn\\src\\array.jl:268 [inlined]\n [8] copyto! at C:\\Users\\gzhang\\.julia\\packages\\CUDA\\0p5fn\\src\\array.jl:272 [inlined]\n [9] copyto_axcheck! at .\\abstractarray.jl:946 [inlined]\n [10] Array at .\\array.jl:562 [inlined]\n [11] Array(::CuArray{Int64,0}) at .\\boot.jl:430\n [12] top-level scope at none:12\nerror in running finalizer: CUDA.KernelException(dev=CUDA.CuDevice(handle=0))\n")
Error in testset exceptions:
Test Failed at C:\Users\gzhang\.julia\packages\CUDA\0p5fn\test\exceptions.jl:69
  Expression: occursin("ERROR: KernelException: exception thrown during kernel execution on device", err)
   Evaluated: occursin("ERROR: KernelException: exception thrown during kernel execution on device", "ERROR: CUDA error: unspecified launch failure (code 719, ERROR_LAUNCH_FAILED)\nStacktrace:\n [1] throw_api_error(::CUDA.cudaError_enum) at C:\\Users\\gzhang\\.julia\\packages\\CUDA\\0p5fn\\lib\\cudadrv\\error.jl:97\n [2] macro expansion at C:\\Users\\gzhang\\.julia\\packages\\CUDA\\0p5fn\\lib\\cudadrv\\error.jl:104 [inlined]\n [3] cuCtxSynchronize at C:\\Users\\gzhang\\.julia\\packages\\CUDA\\0p5fn\\lib\\utils\\call.jl:93 [inlined]\n [4] synchronize() at C:\\Users\\gzhang\\.julia\\packages\\CUDA\\0p5fn\\lib\\cudadrv\\context.jl:173\n [5] top-level scope at none:11\nerror in running finalizer: CUDA.KernelException(dev=CUDA.CuDevice(handle=0))\n")
ERROR: LoadError: Test run finished with errors
in expression starting at C:\Users\gzhang\.julia\packages\CUDA\0p5fn\test\runtests.jl:483
ERROR: Package CUDA errored during testing
maleadt commented 3 years ago

Are these failures reproducible?

And don't just dump error output in an issue, that's rude. Instead, add some details about your system (what version of Windows, how did you install Julia), the errors (e.g. are they reproducible, can you isolate them, ...), format your post, etc.

gzhang commented 3 years ago

system: win 10 Julia version: 1.5.2 CUDA version: 11.1.1 Nvidia drivers: 457.9

Mark-314 commented 3 years ago

Encountered very similar problem, and those failures are consistent and repeatable

system: win 10 CPU: Intel(R) Core(TM) i7 CPU RAM: 64GB GPU: RTX 2060 Julia version: 1.5.3/1.5.2

Test Summary:                   |  Pass  Fail  Broken  Total
  Overall                             | 10682     8       5  10695
    cublas                            |  1914      6           1920
exceptions                         |    15        2             17

 Info: System information:
│ CUDA toolkit 11.1.1, artifact installation
│ CUDA driver 11.2.0
│ NVIDIA driver 460.20.0
│
│ Libraries:
│ - CUBLAS: 11.3.0
│ - CURAND: 10.2.2
│ - CUFFT: 10.3.0
│ - CUSOLVER: 11.0.1
│ - CUSPARSE: 11.3.0
│ - CUPTI: 14.0.0
│ - NVML: 11.0.0+460.20
│ - CUDNN: 8.0.4 (for CUDA 11.1.0)
│ - CUTENSOR: 1.2.1 (for CUDA 11.1.0)
│
│ Toolchain:
│ - Julia: 1.5.3
│ - LLVM: 9.0.1
│ - PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4
│ - Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75
│
│ 1 device:
└   0: GeForce RTX 2060 (sm_75, 5.052 GiB / 6.000 GiB available)
[ Info: Testing using 1 device(s): 1. GeForce RTX 2060
gzhang commented 3 years ago
┌ Info: System information:
│ CUDA toolkit 11.1.1, artifact installation
│ CUDA driver 11.1.0
│ NVIDIA driver 457.9.0
│
│ Libraries:
│ - CUBLAS: 11.3.0
│ - CURAND: 10.2.2
│ - CUFFT: 10.3.0
│ - CUSOLVER: 11.0.1
│ - CUSPARSE: 11.3.0
│ - CUPTI: 14.0.0
│ - NVML: 11.0.0+457.9
│ - CUDNN: 8.0.4 (for CUDA 11.1.0)
│ - CUTENSOR: 1.2.1 (for CUDA 11.1.0)
│
│ Toolchain:
│ - Julia: 1.5.3
│ - LLVM: 9.0.1
│ - PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4
│ - Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75
│
│ 1 device:
└   0: Quadro P2000 (sm_61, 3.526 GiB / 4.000 GiB available)
[ Info: Testing using 1 device(s): 1. Quadro P2000 (UUID 9b0b39dd-2ad4-66d0-d456-01bb0741d565)
[ Info: Skipping the following tests: cutensor\base, cutensor\contractions, cutensor\elementwise_binary, cutensor\elementwise_trinary, cutensor\permutations, cutensor\reductions, device\wmma
system: Windows 10 Enterprise (20H2, 19042.630)
Visual Studio 2019: 16.8.2
NVIDIA Nsight Compute 2020.2.1
NVIDIA Nsight Visual Studio Edition 2020.2.1.20303
NVIDIA CUDA Runtime 11.1
NVIDIA Graphical Drivers: 457.09

Problems persist after CUDA.jl upgrade to version 2.3.0

Test Summary:                         | Pass  Fail  Broken  Total
  Overall                             | 8468    13       5   8486
    cublas                            | 1911     9           1920
    exceptions                        |   13     4             17
    nvtx                              |                     No tests
    texture                           |   38             4     42
    threading                         |                     No tests
    cudadrv\memory                    |   49             1     50
    FAILURE

(Note: please use triple backticks to denote code listings)

maleadt commented 3 years ago

I haven't been to reproduce any of these failures. So it would be useful if somebody who can could reproduce this failure to a single test, reduce it further, add some additional specifics, etc.

clintonTE commented 3 years ago

Here are a couple of these I was able to reproduce by copying and modifying some of the code from the test file:


using Revise, CUDA, LinearAlgebra

using CUDA.CUBLAS
using CUDA.CUBLAS: band, bandex

using LinearAlgebra

m = 20
n = 35
k = 13

elty=Float32

alpha = rand(elty)
beta = rand(elty)

A = rand(elty,m,k)
B = rand(elty,k,n)

try
    C = alpha*(A\B)
    dC = copy(dB)
    CUBLAS.xt_trsm!('L','U','N','N',alpha,dA,dC)
    # move to host and compare
    h_C = Array(dC)
    @assert C ≈ h_C
catch err
    @warn "xt_trsm! gpu failed!! error: $err"
end

try
    C  = alpha*(A\B)
    h_C = CUBLAS.xt_trsm('L','U','N','N',alpha,Array(dA),Array(dB))
    @assert C ≈ h_C
catch err
    @warn "xt_trsm cpu failed!! error: $err"
end

Output:

┌ Warning: xt_trsm! gpu failed!! error: DimensionMismatch("Both inputs should have the same number of rows")
└ @ Main C:\Users\Clinton\Dropbox\Projects\Tutorial\cudatest.jl:29
┌ Warning: xt_trsm cpu failed!! error: DimensionMismatch("Both inputs should have the same number of rows")
└ @ Main C:\Users\Clinton\Dropbox\Projects\Tutorial\cudatest.jl:37
gzhang commented 3 years ago

Here are a couple of these I was able to reproduce by copying and modifying some of the code from the test file:

using Revise, CUDA, LinearAlgebra

using CUDA.CUBLAS
using CUDA.CUBLAS: band, bandex

using LinearAlgebra

m = 20
n = 35
k = 13

elty=Float32

alpha = rand(elty)
beta = rand(elty)

A = rand(elty,m,k)
B = rand(elty,k,n)

try
    C = alpha*(A\B)
    dC = copy(dB)
    CUBLAS.xt_trsm!('L','U','N','N',alpha,dA,dC)
    # move to host and compare
    h_C = Array(dC)
    @assert C ≈ h_C
catch err
    @warn "xt_trsm! gpu failed!! error: $err"
end

try
    C  = alpha*(A\B)
    h_C = CUBLAS.xt_trsm('L','U','N','N',alpha,Array(dA),Array(dB))
    @assert C ≈ h_C
catch err
    @warn "xt_trsm cpu failed!! error: $err"
end

Output:

┌ Warning: xt_trsm! gpu failed!! error: DimensionMismatch("Both inputs should have the same number of rows")
└ @ Main C:\Users\Clinton\Dropbox\Projects\Tutorial\cudatest.jl:29
┌ Warning: xt_trsm cpu failed!! error: DimensionMismatch("Both inputs should have the same number of rows")
└ @ Main C:\Users\Clinton\Dropbox\Projects\Tutorial\cudatest.jl:37

the same output:

┌ Warning: xt_trsm! gpu failed!! error: DimensionMismatch("Both inputs should have the same number of rows")
└ @ Main F:\Code\Julia_projects\testCuda.jl:23
┌ Warning: xt_trsm cpu failed!! error: DimensionMismatch("Both inputs should have the same number of rows")
└ @ Main F:\Code\Julia_projects\testCuda.jl:30
maleadt commented 3 years ago

OK, so only cublasXt tests fail? We might be doing something legitimately wrong then, because I remember cuda-memcheck complaining about how we pin our host memory there, which might behave differently on Windows.

That said, if you don't actively use those xt_ functions, the failures are harmless.

maleadt commented 3 years ago

We might be doing something legitimately wrong then, because I remember cuda-memcheck complaining about how we pin our host memory there, which might behave differently on Windows.

I verified, and that doesn't hold true anymore with CUDA 11.1.

clintonTE commented 3 years ago

OK, so only cublasXt tests fail? We might be doing something legitimately wrong then, because I remember cuda-memcheck complaining about how we pin our host memory there, which might behave differently on Windows.

That said, if you don't actively use those xt_ functions, the failures are harmless.

Good to know. Speaking for myself I don't use these, but happy to help test if a Windows box is needed.

maleadt commented 3 years ago
┌ Warning: xt_trsm! gpu failed!! error: DimensionMismatch("Both inputs should have the same number of rows")

Looks like you invalidly reduced the test to a failure that isn't like what happened originally though.

Anyway, if you can still reproduce the original failure, could you try adding a call to synchronize() after every call to CUBLAS.xt_* in test/cublas.jl and see if that fixes the problems?

clintonTE commented 3 years ago

If you are saying it is a coincidence, that's fine, but the above code was pulled based on cross-referencing the op's original stack trace line numbers and the test failures on my system.

I checked using the above example and this didn't work. I'll try it on the master branch tests once I make the jump to 1.6.

maleadt commented 3 years ago

If you are saying it is a coincidence, that's fine, but the above code was pulled based on cross-referencing the op's original stack trace line numbers and the test failures on my system.

The tests are very stateful, so you probably copied wrong definitions for some if the inputs, because the tests there never throw a DimensionMismatch nor do they in your original error report. So adding a synchronization point there is also not expected to do anything.

maleadt commented 3 years ago

I've implemented the above suggestion here: https://github.com/JuliaGPU/CUDA.jl/pull/572

Do note this branch needs Julia#master, and the Windows nightlies are lagging, so you need an up-to-date build like https://julialangnightlies-s3.julialang.org/pretesting/winnt/x64/1.6/julia-377aa809eb-win64.exe. Also note the required GPUCompiler dependency isn't tagged yet, so you need to launch using julia --project from CUDA.jl's checkout.

clintonTE commented 3 years ago

Ah I see what you mean. Here is hopefully a valid MWE:

using Revise, CUDA, LinearAlgebra, Random

using CUDA.CUBLAS
using CUDA.CUBLAS: band, bandex

Random.seed!(11)
function mwe()

  local m = 20
  local n = 35
  local k = 13

  elty=Float32

  local alpha = rand(elty)
  local beta = rand(elty)

  local A = triu(rand(elty, m, m))
  local B = rand(elty,m,n)
  local C = zeros(elty,m,n)
  local dA = CuArray(A)
  local dB = CuArray(B)
  local dC = CuArray(C)
  local failed=false

  try
    C = alpha*(A\B)
    dC = copy(dB)
    CUBLAS.xt_trsm!('L','U','N','N',alpha,dA,dC)
    CUDA.synchronize()
    # move to host and compare
    h_C = Array(dC)
    @assert C ≈ h_C
  catch err
    @warn "xt_trsm! gpu failed!! error: $err"
    failed=true
  end

  try
    C  = alpha*(A\B)
    h_C = CUBLAS.xt_trsm('L','U','N','N',alpha,Array(dA),Array(dB))
    CUDA.synchronize()
    @assert C ≈ h_C
  catch err
    @warn "xt_trsm cpu failed!! error: $err"
    failed=true
  end

  return failed
end

for i ∈ 1:10^3
  if mwe()
    @info "Failed on iteration $i"
    break
  end
end

Output:

┌ Warning: xt_trsm! gpu failed!! error: AssertionError("C ≈ h_C")
└ @ Main C:\Users\Clinton\Dropbox\Projects\Tutorial\cudatest.jl:39
[ Info: Failed on iteration 7

I'll give the PR a shot.

maleadt commented 3 years ago

If that snippet here failed with the synchronize in there, the PR won't help. Still, I'd appreciate if you could test it. Even with multiple iterations, I can't reproduce (haven't tried on Windows though).

Is there anything special about your set-up? Are you using the driver in WDDM or TCC mode? Hardware-accelerated GPU scheduling? Other special settings related to the GPU or CUDA?

EDIT: ah, I can finally reproduce this on my Windows system by doing multiple iterations.

clintonTE commented 3 years ago

Ah ok, I tried it and it didn't work as you say, basically the same results as the op. Nothing particularly special about my system- just a Razer laptop with a 2070 RTX running in WDDM with a bunch of monitors hooked up.

Edit: Note I tried it on the original #572 (67dfd4a)

clintonTE commented 3 years ago

I just tried it on #577 and everything worked!