Closed Liyulingyue closed 4 months ago
可以帮忙尝试一下 nv-nsight-cu-cli
在 Windows 下能否正常使用吗,Linux 下 nv-nsight-cu-cli 可以用于替代 nvprof @Liyulingyue
这里提供一个nsight compute的Linux命令,供参考:
sudo /opt/nvidia/nsight-compute/2023.2.2/ncu --target-processes all --print-summary=per-kernel ./vector_add
==PROF== Profiling "add_kernel" - 0:
0%....50%....100% - 9 passes
out[0] = 3.000
out[1] = 3.000
out[2] = 3.000
out[3] = 3.000
out[4] = 3.000
out[5] = 3.000
out[6] = 3.000
out[7] = 3.000
out[8] = 3.000
out[9] = 3.000
==PROF== Disconnected from process 2446176
[2446176] vector_add@127.0.0.1
add_kernel(float *, float *, float *, int) (1, 1, 1)x(1, 1, 1), Device 0, CC 8.6, Invocations 1
Section: GPU Speed Of Light Throughput
----------------------- ------------- ---------------- ---------------- ----------------
Metric Name Metric Unit Minimum Maximum Average
----------------------- ------------- ---------------- ---------------- ----------------
DRAM Frequency cycle/nsecond 9.49 9.49 9.49
SM Frequency cycle/nsecond 1.39 1.39 1.39
Elapsed Cycles cycle 1,022,671,264.00 1,022,671,264.00 1,022,671,264.00
Memory Throughput % 0.07 0.07 0.07
DRAM Throughput % 0.02 0.02 0.02
Duration msecond 733.12 733.12 733.12
L1/TEX Cache Throughput % 5.87 5.87 5.87
L2 Cache Throughput % 0.06 0.06 0.06
SM Active Cycles cycle 12,471,523.54 12,471,523.54 12,471,523.54
Compute (SM) Throughput % 0.07 0.07 0.07
----------------------- ------------- ---------------- ---------------- ----------------
Section: Launch Statistics
-------------------------------- --------------- ------- ------- -------
Metric Name Metric Unit Minimum Maximum Average
-------------------------------- --------------- ------- ------- -------
Block Size 1.00 1.00 1.00
Grid Size 1.00 1.00 1.00
Registers Per Thread register/thread 22.00 22.00 22.00
Shared Memory Configuration Size Kbyte 16.38 16.38 16.38
Driver Shared Memory Per Block Kbyte/block 1.02 1.02 1.02
Dynamic Shared Memory Per Block byte/block 0.00 0.00 0.00
Static Shared Memory Per Block byte/block 0.00 0.00 0.00
Threads thread 1.00 1.00 1.00
Waves Per SM 0.00 0.00 0.00
-------------------------------- --------------- ------- ------- -------
Section: Occupancy
------------------------------- ----------- ------- ------- -------
Metric Name Metric Unit Minimum Maximum Average
------------------------------- ----------- ------- ------- -------
Block Limit SM block 16.00 16.00 16.00
Block Limit Registers block 84.00 84.00 84.00
Block Limit Shared Mem block 16.00 16.00 16.00
Block Limit Warps block 48.00 48.00 48.00
Theoretical Active Warps per SM warp 16.00 16.00 16.00
Theoretical Occupancy % 33.33 33.33 33.33
Achieved Occupancy % 2.08 2.08 2.08
Achieved Active Warps Per SM warp 1.00 1.00 1.00
------------------------------- ----------- ------- ------- -------
Note: The shown averages are calculated as the arithmetic mean of the metric values after the evaluation of the
metrics for each individual kernel launch.
If aggregating across varying launch configurations (like shared memory, cache config settings), the arithmetic
mean can be misleading and looking at the individual results is recommended instead.
This output mode is backwards compatible to the per-kernel summary output of nvprof
感谢@Liyulingyue反馈,这个我们后续定向针对windows的环境配置,丰富下文档。也感谢@Liyulingyue在windows环境的PR贡献🍻
Windows使用nvprof需要将lib添加到环境变量中,并且nvprof无法在一些比较新的显卡上使用,报错信息如下:
可能需要在03_nvprof_usage中补充nsight的使用方法。