GPUOpen-Tools / radeon_compute_profiler

The Radeon Compute Profiler (RCP) is a performance analysis tool that gathers data from the API run-time and GPU for OpenCL™ and ROCm/HSA applications. This information can be used by developers to discover bottlenecks in the application and to find ways to optimize the application's performance.
MIT License
85 stars 19 forks source link

Missing basic counters from full list of performance counters #30

Open daviddpruitt opened 5 years ago

daviddpruitt commented 5 years ago

I'm running the current version of RCP (5.6) on a Radeon VII. When I ask for the list of available performance counters its incomplete, it only gives derived counts. Basic counts are nowhere to be found although clearly they're needed for the derived counts. However when I ask rocprofiler (also current version), which I understand is what RCP is based on, for a list of metrics they're all there.

rcprof -l
OpenCL performance counters:
The list of valid counters for Graphics IP v6 based graphics cards:
Wavefronts, VALUInsts, SALUInsts, VFetchInsts, SFetchInsts,
VWriteInsts, LDSInsts, GDSInsts, VALUUtilization, VALUBusy,
SALUBusy, FetchSize, WriteSize, CacheHit, MemUnitBusy,
MemUnitStalled, WriteUnitStalled, LDSBankConflict

...

HSA performance counters:
The list of valid counters for Graphics IP v8 based graphics cards:
Wavefronts, VALUInsts, SALUInsts, VFetchInsts, SFetchInsts,
VWriteInsts, FlatVMemInsts, LDSInsts, FlatLDSInsts, GDSInsts,
VALUUtilization, VALUBusy, SALUBusy, FetchSize, WriteSize,
CacheHit, MemUnitBusy, MemUnitStalled, WriteUnitStalled, LDSBankConflict

The list of valid counters for Vega based graphics cards:
Wavefronts, VALUInsts, SALUInsts, VFetchInsts, SFetchInsts,
VWriteInsts, FlatVMemInsts, LDSInsts, FlatLDSInsts, GDSInsts,
VALUUtilization, VALUBusy, SALUBusy, FetchSize, WriteSize,
L2CacheHit, MemUnitBusy, MemUnitStalled, WriteUnitStalled, LDSBankConflict
rpl_run.sh --list-basic
RPL: on '190801_110408' from '/home/ddpruitt/rocm' in '/home/ddpruitt/HIP/samples/0_Intro/square'
ROCProfiler: rc-file '/home/ddpruitt/rpl_rc.xml'
Basic HW counters:

  gpu-agent0 : GRBM_COUNT : Tie High - Count Number of Clocks
      block GRBM has 2 counters

  gpu-agent0 : GRBM_GUI_ACTIVE : The GUI is Active
      block GRBM has 2 counters

  gpu-agent0 : SQ_WAVES : Count number of waves sent to SQs. (per-simd, emulated, global)
      block SQ has 8 counters

  gpu-agent0 : SQ_INSTS_VALU : Number of VALU instructions issued. (per-simd, emulated)
      block SQ has 8 counters

  gpu-agent0 : SQ_INSTS_VMEM_WR : Number of VMEM write instructions issued (including FLAT). (per-simd, emulated)
      block SQ has 8 counters

  gpu-agent0 : SQ_INSTS_VMEM_RD : Number of VMEM read instructions issued (including FLAT). (per-simd, emulated)
      block SQ has 8 counters

  gpu-agent0 : SQ_INSTS_SALU : Number of SALU instructions issued. (per-simd, emulated)
      block SQ has 8 counters

  gpu-agent0 : SQ_INSTS_SMEM : Number of SMEM instructions issued. (per-simd, emulated)
      block SQ has 8 counters

  gpu-agent0 : SQ_INSTS_FLAT : Number of FLAT instructions issued. (per-simd, emulated)
      block SQ has 8 counters

...
rpl_run.sh --list-derived
RPL: on '190801_110411' from '/home/ddpruitt/rocm' in '/home/ddpruitt/HIP/samples/0_Intro/square'
ROCProfiler: rc-file '/home/ddpruitt/rpl_rc.xml'
Derived metrics:

  gpu-agent0 : TA_BUSY_avr : TA block is busy. Average over TA instances.
      TA_BUSY_avr = avr(TA_TA_BUSY,16)

  gpu-agent0 : TA_BUSY_max : TA block is busy. Max over TA instances.
      TA_BUSY_max = max(TA_TA_BUSY,16)

  gpu-agent0 : TA_BUSY_min : TA block is busy. Min over TA instances.
      TA_BUSY_min = min(TA_TA_BUSY,16)

  gpu-agent0 : TA_FLAT_READ_WAVEFRONTS_sum : Number of flat opcode reads processed by the TA. Sum over TA instances.
      TA_FLAT_READ_WAVEFRONTS_sum = sum(TA_FLAT_READ_WAVEFRONTS,16)

  gpu-agent0 : TA_FLAT_WRITE_WAVEFRONTS_sum : Number of flat opcode writes processed by the TA. Sum over TA instances.
      TA_FLAT_WRITE_WAVEFRONTS_sum = sum(TA_FLAT_WRITE_WAVEFRONTS,16)

  gpu-agent0 : TCC_HIT_sum : Number of cache hits. Sum over TCC instances.
      TCC_HIT_sum = sum(TCC_HIT,16)

  gpu-agent0 : TCC_MISS_sum : Number of cache misses. Sum over TCC instances.
      TCC_MISS_sum = sum(TCC_MISS,16)

  gpu-agent0 : TCC_EA_RDREQ_32B_sum : Number of 32-byte TCC/EA read requests. Sum over TCC instances.
      TCC_EA_RDREQ_32B_sum = sum(TCC_EA_RDREQ_32B,16)

  gpu-agent0 : TCC_EA_RDREQ_sum : Number of TCC/EA read requests (either 32-byte or 64-byte). Sum over TCC instances.
      TCC_EA_RDREQ_sum = sum(TCC_EA_RDREQ,16)

  gpu-agent0 : TCC_EA_WRREQ_sum : Number of transactions (either 32-byte or 64-byte) going over the TC_EA_wrreq interface. Sum over TCC instances.
      TCC_EA_WRREQ_sum = sum(TCC_EA_WRREQ,16)