GPUOpen-Tools / radeon_gpu_detective

Tool for post-mortem analysis of GPU crashes.
MIT License
50 stars 2 forks source link

radeon_gpu_detective

RGD is a tool for post-mortem analysis of GPU crashes.

The tool performs offline processing of AMD GPU crash dump files and generates crash analysis reports in text and JSON formats.

To generate AMD GPU crash dumps for your application, use Radeon Developer Panel (RDP) and follow the Crash Analysis help manual.

Build Instructions

It is recommended to build the tool using the "pre_build.py" script which can be found under the "build" subdirectory.

Steps:

cd build

python pre_build.py

The script supports different options such as using different MSVC toolsets versions. For the list of options run the script with -h.

By default, a solution is generated for VS 2022. To generate a solution for a different VS version or to use a different MSVC toolchain use the --vs argument. For example, to generate the solution for VS 2019 with the VS 2019 toolchain (MSVC 16), run:

python pre_build.py --vs 2019

Running

Basic usage (text output):

rgd --parse <input .rgd crash dump file> -o <output text file>

Basic usage (JSON output):

rgd --parse <input .rgd crash dump file> --json <output JSON file>

For more options, run rgd -h to print the help manual.

Usage

The rgd command line tool accepts AMD driver GPU crash dump files as an input (.rgd files) and generates crash analysis report files with summarized information that can assist in debugging GPU crashes.

The basic usage is:

rgd --parse <full path to input .rgd file> -o <full path to text output file>

The rgd command line tool's crash analysis output files include the following information by default:

Both text and JSON output files include the same information, in different representation. For simplicity, we will refer here to the human-readable textual output. Here are some more details about the crash analysis file's contents:

Crash Analysis File Information

System Information

This section is titled SYSTEM INFO and includes information about:

Markers in Progress

This section is titled MARKERS IN PROGRESS and contains information only about the execution markers that were in progress during the crash for each command buffer which was determined to be in flight during the crash. Here is the matching output for the tree below (see EXECUTION MARKER TREE):

Command Buffer ID: 0x2e
=======================
Frame 268 CL0/DownSamplePS/CmdDraw
Frame 268 CL0/DownSamplePS/CmdDraw
Frame 268 CL0/DownSamplePS/CmdDraw
Frame 268 CL0/DownSamplePS/CmdDraw
Frame 268 CL0/DownSamplePS/CmdDraw
Frame 268 CL0/Bloom/BlurPS/CmdDraw

Note that marker hierarchy is denoted by "/".

Execution Marker Tree

This section is titled EXECUTION MARKER TREE and contains a tree describing the marker status for each command buffer that was determined to be in flight during the crash.

User-provided marker strings will be wrapped in "double quotes". Here is an example marker tree:

Command Buffer ID: 0x2e
=======================
[>] "Frame 268 CL0"
 ├─[X] "Depth + Normal + Motion Vector PrePass"
 ├─[X] "Shadow Cascade Pass"
 ├─[X] "TLAS Build"
 ├─[X] "Classify tiles"
 ├─[X] "Trace shadows"
 ├─[X] "Denoise shadows"
 ├─[X] CmdDispatch
 ├─[X] CmdDispatch
 ├─[X] "GltfPbrPass::DrawBatchList"
 ├─[X] "Skydome Proc"
 ├─[X] "GltfPbrPass::DrawBatchList"
 ├─[>] "DownSamplePS"
 └─[>] "Bloom"
    ├─[>] "BlurPS"
    │  ├─[>] CmdDraw
    │  └─[ ] CmdDraw
    ├─[ ] CmdDraw
    ├─[ ] "BlurPS"
    ├─[ ] CmdDraw
    ├─[ ] "BlurPS"
    ├─[ ] CmdDraw
    ├─[ ] "BlurPS"
    ├─[ ] CmdDraw
    ├─[ ] "BlurPS"
    └─[ ] CmdDraw

Configuring the Execution Marker Output with RGD CLI

RGD CLI exposes a few options that impact how the marker tree is generated:

Configuring the Page Fault Summary with RGD CLI

Page Fault Summary

If the crash was determined to be caused by a page fault, a section titled PAGE FAULT SUMMARY will include useful details about the page fault such as:

A note about time tracking:

The general time format used by RGD is <hh:mm:ss:clks> which stands for <hours, minutes, seconds, CPU clks>. Beginning of time (00:00:00.00) is when the crash analysis session started (note that there is an expected lag between the start of the crashing process and the beginning of the crash analysis session, due to the time that takes to initialize crash analysis in the driver).

Capturing AMD GPU Crash Dump Files