PaddleJitLab / CUDATutorial

A self-learning tutorail for CUDA High Performance Programing.
Apache License 2.0
86 stars 16 forks source link

Nvprof Cannot be used with compute capability 8.0 and higher #5

Closed Liyulingyue closed 4 months ago

Liyulingyue commented 6 months ago

Windows使用nvprof需要将lib添加到环境变量中,并且nvprof无法在一些比较新的显卡上使用,报错信息如下:

D:\Codes\CUDATutorial\02_first_kernel>nvprof add.exe
======== Warning: nvprof is not supported on devices with compute capability 8.0 and higher.
                  Use NVIDIA Nsight Systems for GPU tracing and CPU sampling and NVIDIA Nsight Compute for GPU profiling.
                  Refer https://developer.nvidia.com/tools-overview for more details.

可能需要在03_nvprof_usage中补充nsight的使用方法。

AndSonder commented 6 months ago

可以帮忙尝试一下 nv-nsight-cu-cli 在 Windows 下能否正常使用吗,Linux 下 nv-nsight-cu-cli 可以用于替代 nvprof @Liyulingyue

AndSonder commented 6 months ago

这里提供一个nsight compute的Linux命令,供参考:

sudo /opt/nvidia/nsight-compute/2023.2.2/ncu --target-processes all --print-summary=per-kernel ./vector_add
==PROF== Profiling "add_kernel" - 0: 
0%....50%....100% - 9 passes
out[0] = 3.000
out[1] = 3.000
out[2] = 3.000
out[3] = 3.000
out[4] = 3.000
out[5] = 3.000
out[6] = 3.000
out[7] = 3.000
out[8] = 3.000
out[9] = 3.000
==PROF== Disconnected from process 2446176
[2446176] vector_add@127.0.0.1
  add_kernel(float *, float *, float *, int) (1, 1, 1)x(1, 1, 1), Device 0, CC 8.6, Invocations 1
    Section: GPU Speed Of Light Throughput
    ----------------------- ------------- ---------------- ---------------- ----------------
    Metric Name               Metric Unit          Minimum          Maximum          Average
    ----------------------- ------------- ---------------- ---------------- ----------------
    DRAM Frequency          cycle/nsecond             9.49             9.49             9.49
    SM Frequency            cycle/nsecond             1.39             1.39             1.39
    Elapsed Cycles                  cycle 1,022,671,264.00 1,022,671,264.00 1,022,671,264.00
    Memory Throughput                   %             0.07             0.07             0.07
    DRAM Throughput                     %             0.02             0.02             0.02
    Duration                      msecond           733.12           733.12           733.12
    L1/TEX Cache Throughput             %             5.87             5.87             5.87
    L2 Cache Throughput                 %             0.06             0.06             0.06
    SM Active Cycles                cycle    12,471,523.54    12,471,523.54    12,471,523.54
    Compute (SM) Throughput             %             0.07             0.07             0.07
    ----------------------- ------------- ---------------- ---------------- ----------------

    Section: Launch Statistics
    -------------------------------- --------------- ------- ------- -------
    Metric Name                          Metric Unit Minimum Maximum Average
    -------------------------------- --------------- ------- ------- -------
    Block Size                                          1.00    1.00    1.00
    Grid Size                                           1.00    1.00    1.00
    Registers Per Thread             register/thread   22.00   22.00   22.00
    Shared Memory Configuration Size           Kbyte   16.38   16.38   16.38
    Driver Shared Memory Per Block       Kbyte/block    1.02    1.02    1.02
    Dynamic Shared Memory Per Block       byte/block    0.00    0.00    0.00
    Static Shared Memory Per Block        byte/block    0.00    0.00    0.00
    Threads                                   thread    1.00    1.00    1.00
    Waves Per SM                                        0.00    0.00    0.00
    -------------------------------- --------------- ------- ------- -------

    Section: Occupancy
    ------------------------------- ----------- ------- ------- -------
    Metric Name                     Metric Unit Minimum Maximum Average
    ------------------------------- ----------- ------- ------- -------
    Block Limit SM                        block   16.00   16.00   16.00
    Block Limit Registers                 block   84.00   84.00   84.00
    Block Limit Shared Mem                block   16.00   16.00   16.00
    Block Limit Warps                     block   48.00   48.00   48.00
    Theoretical Active Warps per SM        warp   16.00   16.00   16.00
    Theoretical Occupancy                     %   33.33   33.33   33.33
    Achieved Occupancy                        %    2.08    2.08    2.08
    Achieved Active Warps Per SM           warp    1.00    1.00    1.00
    ------------------------------- ----------- ------- ------- -------

  Note: The shown averages are calculated as the arithmetic mean of the metric values after the evaluation of the    
  metrics for each individual kernel launch.                                                                         
  If aggregating across varying launch configurations (like shared memory, cache config settings), the arithmetic    
  mean can be misleading and looking at the individual results is recommended instead.                               
  This output mode is backwards compatible to the per-kernel summary output of nvprof
Aurelius84 commented 5 months ago

感谢@Liyulingyue反馈,这个我们后续定向针对windows的环境配置,丰富下文档。也感谢@Liyulingyue在windows环境的PR贡献🍻