The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions, and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. Any computer system has risks of security vulnerabilities that cannot be completely prevented or mitigated. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.THIS INFORMATION IS PROVIDED ‘AS IS.” AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS, OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY RELIANCE, DIRECT, INDIRECT, SPECIAL, OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD Arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies.
© 2022 Advanced Micro Devices, Inc. All Rights Reserved.
ROCProfiler is AMD’s tooling infrastructure that provides a hardware specific low level performance analysis interface for the profiling and the tracing of GPU compute applications.
Profiling with metrics and traces based on perfcounters (PMC) and traces (SPM). Implementation is based on AqlProfile HSA extension. The last API library version for ROCProfiler v1 is 8.0.0
The library source tree:
Roctracer & Rocprofiler need to be installed in the same directory.
export CMAKE_PREFIX_PATH=<path_to_hsa-runtime_includes>:<path_to_hsa-runtime_library>
export CMAKE_BUILD_TYPE=<debug|release> # release by default
export CMAKE_DEBUG_TRACE=1 # 1 to enable debug tracing
To build with the current installed ROCM:
cd .../rocprofiler
./build.sh ## (for clean build use `-cb`)
To run the test:
cd .../rocprofiler/build
export LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH # paths to ROC profiler and oher libraries
export HSA_TOOLS_LIB=librocprofiler64.so.1 # ROC profiler library loaded by HSA runtime
export ROCP_TOOL_LIB=test/librocprof-tool.so # tool library loaded by ROC profiler
export ROCP_METRICS=metrics.xml # ROC profiler metrics config file
export ROCP_INPUT=input.xml # input file for the tool library
export ROCP_OUTPUT_DIR=./ # output directory for the tool library, for metrics results file 'results.txt' and trace files
./<your_test>
Internal 'simple_convolution' test run script:
cd .../rocprofiler/build
./run.sh
export ROCPROFILER_LOG=1
export ROCPROFILER_TRACE=1
The following AMD GPU architectures are supported with ROCprofiler V1:
Note: ROCProfiler V1 tool usage documentation is available at Click Here
The first API library version for ROCProfiler v2 is 9.0.0
Note: ROCProfilerV2 is currently considered a beta version and is subject to change in future releases
Python packages can be installed using:
pip3 install -r requirements.txt
The user has two options for building:
Option 1 (It will install in the path saved in ROCM_PATH environment variable or /opt/rocm if ROCM_PATH is empty):
Normal Build
./build.sh --build OR ./build.sh -b
Clean Build
./build.sh --clean-build OR ./build.sh -cb
Option 2 (Where ROCM_PATH envronment need to be set with the current installation directory of rocm), run the following:
mkdir build && cd build
cmake -DCMAKE_PREFIX_PATH=$ROCM_PATH -DCMAKE_MODULE_PATH=$ROCM_PATH/hip/cmake -DROCPROFILER_BUILD_TESTS=1 -DROCPROFILER_BUILD_SAMPLES=1 <CMAKE_OPTIONS> ..
cmake --build . -- -j
cmake --build . -- -j doc
cmake --build . -- -j package
Optionally, run the following to install
cd build
cmake --build . -- -j install
A command line utility to control a session (launch/start/stop/exit), with the required application to be traced or profiled in a rocprofv2 context. Usage:
Launch the application with the required profiling and tracing options with giving a session identifier to be used later
rocsys --session session_name launch mpiexec -n 2 rocprofv2 -i samples/input.txt Histogram
Start a session with a given identifier created at launch
rocsys --session session_name start
Stop a session with a given identifier created at launch
rocsys –session session_name stop
Exit a session with a given identifier created at launch
rocsys –session session_name exit
Currently, rocprof can support both versions, rocprof and rocprofv2, that can be done using --tool-version
rocprof --tool-version <VERSION_REQUIRED> <rocprof/v2_options> <app_relative_path>
--tool-version 1
means it will just use rocprof V1.--tool-version 2
means it will just use rocprofv2.To know what version you are using right now, along with more information about the rocm version, use the following:
rocprof --version
HW counters and derived metrics can be collected using following option:
rocprofv2 -i samples/input.txt <app_relative_path>
input.txt content Example (Details of what is needed inside input.txt will be mentioned with every feature):
pmc: SQ_WAVES GRBM_COUNT GRBM_GUI_ACTIVE SQ_INSTS_VALU
Different trace options are available while profiling an app:
HIP API & asynchronous activity tracing
rocprofv2 --hip-api <app_relative_path> ## For synchronous HIP API Activity tracing
rocprofv2 --hip-activity <app_relative_path> ## For both Synchronous & ASynchronous HIP API Activity tracing
rocprofv2 --hip-trace <app_relative_path> ## Same as --hip-activity, added for backward compatibility
HSA API & asynchronous activity tracing
rocprofv2 --hsa-api <app_relative_path> ## For synchronous HSA API Activity tracing
rocprofv2 --hsa-activity <app_relative_path> ## For both Synchronous & ASynchronous HSA API Activity tracing
rocprofv2 --hsa-trace <app_relative_path> ## Same as --hsa-activity, added for backward compatibility
Kernel dispatches tracing
rocprofv2 --kernel-trace <app_relative_path> ## Kernel Dispatch Tracing
HIP & HSA API and asynchronous activity and kernel dispatches tracing
rocprofv2 --sys-trace <app_relative_path> ## Same as combining --hip-trace & --hsa-trace & ROCtx trace
For complete usage options, please run rocprofv2 help
rocprofv2 --help
We have a template for adding new plugins. New plugins can be written on top of rocprofv2 to support the desired output format using include/rocprofiler/v2/rocprofiler_plugins.h header file. These plugins are modular in nature and can easily be decoupled from the code based on need. Installation files:
rocprofiler-plugins_2.0.0-local_amd64.deb
rocprofiler-plugins-2.0.0-local.x86_64.rpm
Plugins may have multiple versions, the user can specify which version of the plugin to use by running the following command:
rocprofv2 --plugin <plugin_name> --plugin-version <plugin_version_required> <rocprofv2_options> <app_relative_path>
File plugin: outputs the data in txt files. File plugin have two versions, by default version 2 is the current default. Usage:
rocprofv2 --plugin file -i samples/input.txt -d output_dir <app_relative_path> # -d is optional, but can be used to define the directory output for output results
File plugin version 1 output header will be similar to the legacy rocprof v1 output:
Index,KernelName,gpu-id,queue-id,queue-index,pid,tid,grd,wgr,lds,scr,arch_vgpr,accum_vgpr,sgpr,wave_size,sig,obj,DispatchNs,BeginNs,EndNs,CompleteNs,Counters
File plugin version 2 output header:
Dispatch_ID,GPU_ID,Queue_ID,PID,TID,Grid_Size,Workgroup_Size,LDS_Per_Workgroup,Scratch_Per_Workitem,Arch_VGPR,Accum_VGPR,SGPR,Wave_Size,Kernel_Name,Start_Timestamp,End_Timestamp,Correlation_ID,Counters
Perfetto plugin: outputs the data in protobuf format. Protobuf files can be viewed using ui.perfetto.dev or using trace_processor. Usage:
rocprofv2 --plugin perfetto --hsa-trace -d output_dir <app_relative_path> # -d is optional, but can be used to define the directory output for output results
CTF plugin: Outputs the data in ctf format(a binary trace format). CTF binary output can be viewed using TraceCompass or babeltrace. Usage:
rocprofv2 --plugin ctf --hip-trace -d output_dir <app_relative_path> # -d is optional, but can be used to define the directory output for output results
JSON plugin: Outputs .json
file, the JSON file matches Google Trace Format, so it should be easily loaded to perfetto, chrome tracing or speedscope. For Speedscope, --disable-json-data-flows
option will be needed as speedscope doesn't work with data flows.
Usage:
rocprofv2 --plugin json --hip-trace -d output_dir <app_relative_path>
ATT (Advanced thread tracer) plugin: advanced hardware traces data in binary format. Please refer ATT section. Tool used to collect fine-grained hardware metrics. Provides ISA-level instruction hotspot analysis via hardware tracing.
rocprofv2 -i input.txt --plugin att auto --mode csv <app_relative_path>
# Or using a user-supplied ISA:
# rocprofv2 -i input.txt --plugin att <app_assembly_file> --mode csv <app_relative_path>
app_relative_path Path for the running application
ATT plugin optional parameters
--att_kernel "filename": Kernel filename(s) (glob) to use. A CSV file (or UI folder) will be generated for each kernel.txt file. Default: all in current folder.
--mode [csv, file, off (default)]
input.txt Required. Used to select specific compute units and other trace parameters. For first time users, using the following input file:
# vectoradd
att: TARGET_CU=1
SE_MASK=0x1
SIMD_SELECT=0x3
# histogram
att: TARGET_CU=0
SE_MASK=0xFF
SIMD_SELECT=0xF // 0xF for GFX9, SIMD_SELECT=0 for Navi
Possible contents:
att: TARGET_CU=1 // or some other CU [0,15] - WGP for Navi [0,8]
SE_MASK=0x1 // bitmask of shader engines. The fewer, the easier on the hardware. Default enables 1 out of 4 shader engines.
SIMD_SELECT=0xF // GFX9: bitmask of SIMDs. Navi: SIMD Index [0-3]. Recommended 0xF for GFX9 and 0x0 for Navi.
DISPATCH=ID // collect trace only for the given dispatch_ID. Multiple lines for can be added.
DISPATCH=ID,RN // collect trace only for the given dispatch_ID and MPI rank RN. Multiple lines with varying combinations of RN and ID can be added.
KERNEL=kernname // Profile only kernels containing the string kernname (c++ mangled name). Multiple lines can be added.
PERFCOUNTERS_CTRL=0x3 // Multiplier period for counter collection [0~31]. 0=fastest. GFX9 only.
PERFCOUNTER_MASK=0xFFF // Bitmask for perfcounter collection. GFX9 only.
PERFCOUNTER=counter_name // Add a SQ counter to be collected with ATT; period defined by PERFCOUNTERS_CTRL. GFX9 only.
BUFFER_SIZE=[size] // Sets size of the ATT buffer collection, per dispatch, in megabytes (shared among all shader engines).
ISA_CAPTURE_MODE=[0,1,2] // Set codeobj capture mode during kernel dispatch.
DISPATCH_RANGE=[begin],[end] // Continuously collect ATT data starting at "begin" and stop at "end". Alternative to DISPATCH= and KERNEL=.
By default, kernel names are truncated for ATT. To disable, please see the kernel name truncation section below.
Example for vectoradd.
# -g adds debugging symbols to the binary. Required only for tracking disassembly back to c++.
hipcc -g vectoradd_hip.cpp -o vectoradd_hip.exe
# "auto" means to use the automatically captured ISA, e.g. vectoradd_float_v0_isa.s dumped along with .att files.
# "--mode csv" dumps the result to "att_output_vectoradd_float_v0.csv".
rocprofv2 -i input.txt --plugin att auto --mode csv ./vectoradd_hip.exe
# Alternatively, using --save-temps to generate the ISA
hipcc -g --save-temps vectoradd_hip.cpp -o vectoradd_hip.exe
# Replace "auto" with <generated_gpu_isa.s> for user-supplied ISA. Typically they match the wildcards *amdgcn-amd-amdhsa*.s.
# Special attention to the correct architecture for the ISA, such as "gfx1100" (navi31).
rocprofv2 -i input.txt --plugin att vectoradd_hip-hip-amdgcn-amd-amdhsa-gfx1100.s --mode csv ./vectoradd_hip.exe
Instruction latencies will be in att_output_vectoradd_float_v0.csv
# Use -d option to specify the generated data directory, and -o to specify dir and filename is the csv:
rocprofv2 -d mydir -o test/mycsv -i input.txt --plugin att auto --mode csv ./vectoradd_hip.exe
# Generates raw files inside mydir/ and the parsed data on test/mycsv_vectoradd_float_v0.csv
Note: For MPI or long running applications, we recommend to run collection, and later run the parser with already collected data: Run only collection: The assembly file is not used. Use mpirun [...] rocprofv2 [...] if needed.
# Run only data collection, not the parser
rocprofv2 -i input.txt --plugin att auto --mode off ./vectoradd_hip.exe
Remove the binary/application from the command line.
# Only runs the parser on previously collected data.
rocprofv2 -i input.txt --plugin att auto --mode csv
Note 2: By default, ATT only collects a SINGLE kernel dispatch for the whole application, which is the first dispatch matching the given filters (DISPATCH=
export ROCPROFILER_MAX_ATT_PROFILES=<max_collections>
Or, alternatively, use the continuous ATT mode (DISPATCH_RANGE parameter).
Flush interval can be used to control the interval time in milliseconds between the buffers flush for the tool. However, if the buffers are full the flush will be called on its own. This can be used as in the next example:
rocprofv2 --flush-interval <TIME_INTERVAL_IN_MILLISECONDS> <rest_of_rocprofv2_arguments> <app_relative_path>
Trace period can be used to control when the profiling or tracing is enabled using two arguments, the first one is the delay time, which is the time spent idle without tracing or profiling. The second argument is the profiling or the tracing time, which is the active time where the profiling and tracing are working, so basically, the session will work in the following timeline:
<DELAY_TIME> => <PROFILING_OR_TRACING_SESSION_START> => <ACTIVE_PROFILING_OR_TRACING_TIME> => <PROFILING_OR_TRACING_SESSION_STOP>
This feature can be used using the following command:
rocprofv2 --trace-period <delay>:<active_time>:<interval> <rest_of_rocprofv2_arguments> <app_relative_path>
A device profiling session allows the user to profile the GPU device for counters irrespective of the running applications on the GPU. This is different from application profiling. device profiling session doesn't care about the host running processes and threads. It directly provides low level profiling information.
A session is a unique identifier for a profiling/tracing/pc-sampling task. A ROCProfilerV2 Session has enough information about what needs to be collected or traced and it allows the user to start/stop profiling/tracing whenever required. More details on the API can be found in the API specification documentation that can be installed using rocprofiler-doc package. Samples also can be found for how to use the API in samples directory.
We make use of the GoogleTest (Gtest) framework to automatically find and add test cases to the CMAKE testing environment. ROCProfilerV2 testing is categorized as following:
unittests (Gtest Based) : These includes tests for core classes. Any newly added functionality should have a unit test written to it.
featuretests (standalone and Gtest Based): These includes both API tests and tool tests. Tool is tested against different applications to make sure we have right output in evry run.
memorytests (standalone): This includes running address sanitizer for memory leaks, corruptions.
installation: rocprofiler-tests_9.0.0-local_amd64.deb rocprofiler-tests-9.0.0-local.x86_64.rpm
./build/tests/unittests/runUnitTests
./build/tests/featuretests/profiler/runFeatureTests
./build/tests/featuretests/tracer/runTracerFeatureTests
rocprofv2 -t
OR
ctest
To enable error messages logging to '/tmp/rocprofiler_log.txt':
export ROCPROFILER_LOG=1
By default kernel names are not truncated. To enable truncation for readability:
export ROCPROFILER_TRUNCATE_KERNEL_PATH=1
We make use of doxygen to automatically generate API documentation. Generated document can be found in the following path:
ROCM_PATH by default is /opt/rocm It can be set by the user in different location if needed.