JuliaGPU / CUDA.jl

CUDA programming in Julia.
https://juliagpu.org/cuda/
Other
1.21k stars 221 forks source link

nvprof does not detect kernel launches #371

Closed mkarikom closed 4 years ago

mkarikom commented 4 years ago

nvprof runs without error and CUDA.jl gives expected behavior, but nvprof cannot see anything.

Julia environment:

(@v1.4) pkg> status
Status `~/.julia/environments/v1.4/Project.toml`
  [c52e3926] Atom v0.12.19
  [052768ef] CUDA v1.2.1
  [e5e0dc1b] Juno v0.8.3
  [14b8a8f1] PkgTemplates v0.7.8
  [295af30f] Revise v2.7.3

nvprof version:

(base) au@a1:~$ nvprof --version
nvprof: NVIDIA (R) Cuda command line profiler
Copyright (c) 2012 - 2019 NVIDIA Corporation
Release version 10.1.243 (21)

hardware and library support:

(@v1.4) pkg> test CUDA
    Testing CUDA
Status `/tmp/jl_c4Y0CF/Manifest.toml`
  [621f4979] AbstractFFTs v0.5.0
  [79e6a3ab] Adapt v2.0.2
  [b99e7846] BinaryProvider v0.5.10
  [fa961155] CEnum v0.4.1
  [052768ef] CUDA v1.2.1
  [bbf7d656] CommonSubexpressions v0.3.0
  [e66e0078] CompilerSupportLibraries_jll v0.3.3+0
  [864edb3b] DataStructures v0.17.20
  [163ba53b] DiffResults v1.0.2
  [b552c78f] DiffRules v1.0.1
  [e2ba6199] ExprTools v0.1.1
  [7a1cc6ca] FFTW v1.2.2
  [f5851436] FFTW_jll v3.3.9+5
  [1a297f60] FillArrays v0.9.4
  [f6369f11] ForwardDiff v0.10.12
  [0c68f7d7] GPUArrays v5.0.0
  [61eb1bfa] GPUCompiler v0.5.5
  [1d5cc7b8] IntelOpenMP_jll v2018.0.3+0
  [929cbde3] LLVM v2.0.0
  [856f044c] MKL_jll v2020.2.254+0
  [1914dd2f] MacroTools v0.5.5
  [872c559c] NNlib v0.7.4
  [77ba4419] NaNMath v0.3.4
  [efe28fd5] OpenSpecFun_jll v0.5.3+3
  [bac558e1] OrderedCollections v1.3.0
  [189a3867] Reexport v0.2.0
  [ae029012] Requires v1.0.1
  [276daf66] SpecialFunctions v0.10.3
  [90137ffa] StaticArrays v0.12.4
  [a759f4b9] TimerOutputs v0.5.6
  [2a0f44e3] Base64 
  [ade2ca70] Dates 
  [8ba89e20] Distributed 
  [b77e0a4c] InteractiveUtils 
  [76f85450] LibGit2 
  [8f399da3] Libdl 
  [37e2e46d] LinearAlgebra 
  [56ddb016] Logging 
  [d6f4376e] Markdown 
  [44cfe95a] Pkg 
  [de0858da] Printf 
  [3fa0cd96] REPL 
  [9a3f8284] Random 
  [ea8e919c] SHA 
  [9e88b42a] Serialization 
  [6462fe0b] Sockets 
  [2f01184e] SparseArrays 
  [10745b16] Statistics 
  [8dfed614] Test 
  [cf7118a7] UUIDs 
  [4ec0a83e] Unicode 
┌ Info: System information:
│ CUDA toolkit 10.2.89, artifact installation
│ CUDA driver 10.2.0
│ NVIDIA driver 440.100.0
│ 
│ Libraries: 
│ - CUBLAS: 10.2.2
│ - CURAND: 10.1.2
│ - CUFFT: 10.1.2
│ - CUSOLVER: 10.3.0
│ - CUSPARSE: 10.3.1
│ - CUPTI: 12.0.0
│ - NVML: 10.0.0+440.100
│ - CUDNN: 8.0.1 (for CUDA 10.2.0)
│ - CUTENSOR: 1.2.0 (for CUDA 10.2.0)
│ 
│ Toolchain:
│ - Julia: 1.4.2
│ - LLVM: 8.0.1
│ - PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3
│ - Device support: sm_30, sm_32, sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75
│ 
│ 1 device(s):
└ - GeForce GTX 1080 Ti (sm_61, 8.982 GiB / 10.913 GiB available)
[ Info: Testing using 1 device(s): 1. GeForce GTX 1080 Ti (UUID ad1d87a4-88f9-0a82-edf0-3931aa888c68)
[ Info: Skipping the following tests: cutensor, device/wmma
                                     |          | ---------------- GPU ---------------- | ---------------- CPU ---------------- |
Test                        (Worker) | Time (s) | GC (s) | GC % | Alloc (MB) | RSS (MB) | GC (s) | GC % | Alloc (MB) | RSS (MB) |
initialization                   (2) |     2.95 |   0.00 |  0.0 |       0.00 |   135.00 |   0.05 |  1.7 |     160.93 |   868.43 |
apiutils                         (3) |     0.71 |   0.00 |  0.0 |       0.00 |   135.00 |   0.03 |  4.4 |      90.76 |   876.82 |
curand                           (2) |     0.25 |   0.00 |  0.0 |       0.00 |   141.00 |   0.00 |  0.0 |      28.72 |   876.84 |
codegen                          (6) |    16.40 |   0.26 |  1.6 |       0.00 |   175.00 |   0.94 |  5.7 |    1818.62 |  1049.93 |
broadcast                        (5) |    34.20 |   0.33 |  1.0 |       0.00 |   149.00 |   1.51 |  4.4 |    3419.01 |   998.46 |
cufft                            (9) |    35.26 |   0.31 |  0.9 |     144.16 |   303.00 |   1.93 |  5.5 |    4372.68 |  1208.13 |
cusparse                         (2) |    50.29 |   0.29 |  0.6 |       4.46 |   209.00 |   2.43 |  4.8 |    6067.47 |  1276.17 |
iterator                         (2) |     2.03 |   0.00 |  0.0 |       1.25 |   211.00 |   0.07 |  3.3 |     227.10 |  1276.34 |
memory                           (2) |     1.43 |   0.00 |  0.0 |       0.00 |   209.00 |   0.36 | 25.2 |     110.34 |  1276.36 |
array                            (4) |    56.83 |   0.33 |  0.6 |       5.20 |   155.00 |   2.73 |  4.8 |    6732.05 |  1109.42 |
nvml                             (4) |     0.46 |   0.00 |  0.0 |       0.00 |   155.00 |   0.00 |  0.0 |      49.10 |  1113.06 |
nvtx                             (4) |     0.46 |   0.00 |  0.0 |       0.00 |   155.00 |   0.03 |  5.7 |      73.85 |  1113.19 |
pointer                          (4) |     0.10 |   0.00 |  0.0 |       0.00 |   155.00 |   0.00 |  0.0 |       6.40 |  1113.26 |
nnlib                            (2) |     3.23 |   0.16 |  5.0 |       0.00 |   253.00 |   0.13 |  4.2 |     411.07 |  1408.39 |
random                           (4) |     4.63 |   0.00 |  0.0 |       0.02 |   155.00 |   0.17 |  3.6 |     492.31 |  1116.78 |
cublas                           (7) |    68.23 |   0.38 |  0.6 |      11.12 |   211.00 |   3.34 |  4.9 |    8936.22 |  1277.85 |
cudnn                            (8) |    68.74 |   0.32 |  0.5 |       0.60 |   261.00 |   2.81 |  4.1 |    7509.71 |  1547.84 |
cusolver                         (3) |    68.11 |   0.36 |  0.5 |    1128.68 |   321.00 |   3.46 |  5.1 |    8741.84 |  1404.84 |
cudadrv/context                  (3) |     0.65 |   0.00 |  0.0 |       0.00 |   321.00 |   0.00 |  0.0 |      32.95 |  1526.70 |
utils                            (8) |     1.21 |   0.00 |  0.0 |       0.00 |   261.00 |   0.05 |  4.4 |     141.72 |  1547.84 |
cudadrv/devices                  (3) |     0.31 |   0.00 |  0.0 |       0.00 |   321.00 |   0.01 |  4.8 |      39.47 |  1526.70 |
cudadrv/errors                   (8) |     0.18 |   0.00 |  0.0 |       0.00 |   261.00 |   0.00 |  0.0 |      22.18 |  1547.84 |
threading                        (7) |     2.10 |   0.00 |  0.1 |       4.69 |   221.00 |   0.06 |  3.0 |     198.21 |  1291.28 |
cudadrv/events                   (3) |     0.14 |   0.00 |  0.0 |       0.00 |   321.00 |   0.00 |  0.0 |      14.38 |  1526.70 |
cudadrv/module                   (3) |     0.37 |   0.00 |  0.0 |       0.00 |   321.00 |   0.02 |  4.4 |      46.39 |  1526.70 |
cudadrv/occupancy                (3) |     0.11 |   0.00 |  0.0 |       0.00 |   321.00 |   0.00 |  0.0 |       8.27 |  1526.70 |
cudadrv/execution                (8) |     0.90 |   0.00 |  0.0 |       0.00 |   261.00 |   0.04 |  4.2 |     107.33 |  1547.84 |
cudadrv/profile                  (3) |     0.25 |   0.00 |  0.0 |       0.00 |   321.00 |   0.00 |  0.0 |      48.15 |  1526.70 |
cudadrv/version                  (3) |     0.01 |   0.00 |  0.0 |       0.00 |   321.00 |   0.00 |  0.0 |       0.07 |  1526.70 |
cudadrv/stream                   (8) |     0.20 |   0.00 |  0.0 |       0.00 |   261.00 |   0.00 |  0.0 |      23.67 |  1547.84 |
cudadrv/memory                   (7) |     1.85 |   0.00 |  0.0 |       0.00 |   213.00 |   0.08 |  4.2 |     206.28 |  1292.60 |
statistics                       (2) |    13.65 |   0.00 |  0.0 |       0.00 |   253.00 |   0.61 |  4.5 |    1656.35 |  1452.93 |
device/array                     (8) |     3.35 |   0.00 |  0.0 |       0.00 |   261.00 |   0.13 |  4.0 |     355.86 |  1547.84 |
cusolver/cusparse                (3) |     6.51 |   0.00 |  0.0 |       0.19 |   387.00 |   0.19 |  2.9 |     583.47 |  1614.06 |
device/pointer                   (2) |     5.85 |   0.00 |  0.0 |       0.00 |   253.00 |   0.20 |  3.4 |     640.21 |  1459.73 |
gpuarrays/math                   (3) |     1.97 |   0.00 |  0.0 |       0.00 |   387.00 |   0.07 |  3.5 |     245.62 |  1620.85 |
texture                          (4) |    17.47 |   0.00 |  0.0 |       0.08 |   159.00 |   0.92 |  5.3 |    2414.18 |  1118.00 |
gpuarrays/input output           (2) |     2.79 |   0.00 |  0.0 |       0.00 |   253.00 |   0.22 |  8.0 |     535.03 |  1467.05 |
gpuarrays/interface              (4) |     1.74 |   0.00 |  0.0 |       0.00 |   159.00 |   0.06 |  3.2 |     191.04 |  1118.62 |
gpuarrays/value constructors     (3) |     3.92 |   0.00 |  0.0 |       0.00 |   389.00 |   0.12 |  3.0 |     367.18 |  1631.52 |
gpuarrays/uniformscaling         (4) |     5.93 |   0.00 |  0.0 |       0.01 |   187.00 |   0.21 |  3.6 |     626.72 |  1134.04 |
gpuarrays/indexing               (8) |    13.93 |   0.00 |  0.0 |       0.13 |   261.00 |   0.60 |  4.3 |    1715.34 |  1547.84 |
gpuarrays/iterator constructors  (2) |    10.05 |   0.00 |  0.0 |       0.02 |   253.00 |   0.46 |  4.6 |    1423.34 |  1539.64 |
gpuarrays/conversions            (4) |     3.72 |   0.00 |  0.0 |       0.01 |   183.00 |   0.18 |  4.8 |     596.07 |  1143.72 |
gpuarrays/constructors           (2) |     1.20 |   0.00 |  0.3 |       0.04 |   253.00 |   0.00 |  0.0 |      72.61 |  1541.23 |
gpuarrays/fft                    (8) |     5.85 |   0.00 |  0.0 |       6.01 |   339.00 |   0.26 |  4.5 |     769.86 |  1726.85 |
forwarddiff                      (9) |    63.56 |   0.20 |  0.3 |       0.00 |   305.00 |   0.89 |  1.4 |    2826.93 |  1373.03 |
gpuarrays/base                   (2) |    12.76 |   0.00 |  0.0 |      17.61 |   277.00 |   0.90 |  7.1 |    1878.78 |  1610.77 |
gpuarrays/random                 (4) |    14.56 |   0.00 |  0.0 |       0.02 |   183.00 |   0.42 |  2.9 |    1243.95 |  1203.35 |
examples                         (6) |    96.05 |   0.00 |  0.0 |       0.00 |   175.00 |   0.06 |  0.1 |      29.75 |  1056.34 |
gpuarrays/linear algebra         (3) |    49.42 |   0.01 |  0.0 |       1.43 |   383.00 |   1.42 |  2.9 |    4547.99 |  1810.02 |
execution                        (5) |   106.79 |   0.00 |  0.0 |       0.15 |   219.00 |   0.93 |  0.9 |    2890.57 |  1240.08 |
device/intrinsics                (7) |    73.27 |   0.00 |  0.0 |       0.01 |   747.00 |   1.35 |  1.8 |    4934.04 |  1470.76 |
gpuarrays/broadcasting           (9) |    54.74 |   0.00 |  0.0 |       1.19 |   297.00 |   2.27 |  4.1 |    7386.15 |  1506.79 |
gpuarrays/mapreduce essentials   (8) |    83.10 |   0.01 |  0.0 |       3.19 |   351.00 |   3.44 |  4.1 |   11834.02 |  1962.21 |
gpuarrays/mapreduce derivatives  (2) |   125.38 |   0.01 |  0.0 |       3.06 |   309.00 |   3.78 |  3.0 |   14215.29 |  1942.89 |

Test Summary: | Pass  Broken  Total
  Overall     | 8008       2   8010
    SUCCESS
    Testing CUDA tests passed 

script tested: scratch.jl (part of the CUDA.jl/test for mapreduce)

using Pkg
Pkg.activate("./")

using CUDA

function mapreduce_gpu(f::Function, op::Function, A::CuArray{T, N}) where {T, N}
    OT = Int
    v0 = 0

    out = CuArray{OT}(undef, (1,))
    @cuda threads=64 reduce_kernel(f, op, v0, A, out)
    Array(out)[1]
end

function reduce_kernel(f, op, v0::T, A, result) where {T}
    tmp_local = @cuStaticSharedMem(T, 64)
    acc = v0

    # Loop sequentially over chunks of input vector
    i = threadIdx().x
    while i <= length(A)
        element = f(A[i])
        acc = op(acc, element)
        i += blockDim().x
    end

    return
end

A = rand(1:10, 100)
dA = CuArray(A)

mapreduce(identity, +, A)

result of running scratch.jl in repl:

julia> include("/mnt/evo512/insync/Software_a1/testCUDA/scratch.jl")
 Activating new environment at `~/Project.toml`
502

result of running nvprof on scratch.jl:

(base) au@a1:~$ nvprof --profile-from-start off julia /mnt/evo512/insync/Software_a1/testCUDA/scratch.jl 
 Activating new environment at `~/~/Project.toml`
==275468== NVPROF is profiling process 275468, command: julia /mnt/evo512/insync/Software_a1/testCUDA/scratch.jl
==275468== Profiling application: julia /mnt/evo512/insync/Software_a1/testCUDA/scratch.jl
==275468== Profiling result:
No kernels were profiled.
No API activities were profiled.

expected result is something along the lines of CUDA.jl Introduction to profiling:

==2574== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  247.61ms         1  247.61ms  247.61ms  247.61ms  ptxcall_gpu_add1__1
      API calls:   99.54%  247.83ms         1  247.83ms  247.83ms  247.83ms  cuEventSynchronize
                    0.46%  1.1343ms         1  1.1343ms  1.1343ms  1.1343ms  cuLaunchKernel
                    0.00%  4.9490us         1  4.9490us  4.9490us  4.9490us  cuEventRecord
                    0.00%  4.4190us         1  4.4190us  4.4190us  4.4190us  cuEventCreate
                    0.00%     960ns         2     480ns     358ns     602ns  cuCtxGetCurrent
maleadt commented 4 years ago

If you use --profile-from-start off you need to activate the profiler again using CUDA.@profile.