$ omnitrace-run --help
[omnitrace-run] Usage: ./bin/omnitrace-run [ --help (count: 0, dtype: bool)
--version (count: 0, dtype: bool)
--monochrome (max: 1, dtype: bool)
--debug (max: 1, dtype: bool)
--verbose (count: 1, dtype: integral)
--ci (min: 0, dtype: boolean)
--dl-verbose (min: 1, dtype: integral)
--perfetto-annotations (min: 0, dtype: boolean)
--critical-trace-debug (min: 0, dtype: boolean)
--kokkosp-kernel-logger (min: 0, dtype: boolean)
--kokkosp-prefix (min: 0, dtype: string)
--sampling-allocator-size (min: 1, dtype: integral)
--kokkosp-name-length-max (min: 1, dtype: integral)
--critical-trace-serialize-names (min: 0, dtype: boolean)
--config (min: 1, dtype: filepath)
--output (min: 1, dtype: path [prefix])
--trace (max: 1, dtype: bool)
--profile (max: 1, dtype: bool)
--flat-profile (max: 1, dtype: bool)
--sample (min: 0, dtype: timer-type)
--host (max: 1, dtype: bool)
--device (max: 1, dtype: bool)
--wait (count: 1, dtype: seconds)
--duration (count: 1, dtype: seconds)
--periods (min: 1, dtype: period-spec(s))
--include (min: 1, dtype: [backend...])
--exclude (min: 1, dtype: [backend...])
--mode (min: 1, dtype: string)
--use-causal (min: 0, dtype: boolean)
--use-kokkosp (min: 0, dtype: boolean)
--use-mpip (min: 0, dtype: boolean)
--use-roctx (min: 0, dtype: boolean)
--critical-trace (min: 0, dtype: boolean)
--use-code-coverage (min: 0, dtype: boolean)
--use-perfetto (min: 0, dtype: boolean)
--use-process-sampling (min: 0, dtype: boolean)
--use-rcclp (min: 0, dtype: boolean)
--use-rocm-smi (min: 0, dtype: boolean)
--use-rocprofiler (min: 0, dtype: boolean)
--use-roctracer (min: 0, dtype: boolean)
--use-sampling (min: 0, dtype: boolean)
--use-timemory (min: 0, dtype: boolean)
--trace-thread-barriers (min: 0, dtype: boolean)
--trace-thread-join (min: 0, dtype: boolean)
--trace-thread-locks (min: 0, dtype: boolean)
--trace-thread-rw-locks (min: 0, dtype: boolean)
--trace-thread-spin-locks (min: 0, dtype: boolean)
--thread-pool-size (min: 1, dtype: integral)
--num-threads-hint (min: 1, dtype: integral)
--trace-file (count: 1, dtype: filepath)
--trace-buffer-size (count: 1, dtype: KB)
--trace-fill-policy (count: 1, dtype: policy)
--trace-wait (count: 1, dtype: seconds)
--trace-duration (count: 1, dtype: seconds)
--trace-periods (min: 1, dtype: period-spec(s))
--trace-clock-id (count: 1, dtype: clock-id)
--profile-format (min: 1, dtype: string)
--profile-diff (min: 1, dtype: path [prefix])
--process-freq (count: 1, dtype: floating-point)
--process-wait (count: 1, dtype: seconds)
--process-duration (count: 1, dtype: seconds)
--cpus (count: unlimited, dtype: int and/or range)
--gpus (count: unlimited, dtype: int and/or range)
--sampling-freq (count: 1, dtype: floating-point)
--tids (min: 1, dtype: int and/or range)
--sampling-wait (count: 1, dtype: seconds)
--sampling-duration (count: 1, dtype: seconds)
--sample-cputime (min: 0, dtype: [freq] [delay] [tids...])
--sample-realtime (min: 0, dtype: [freq] [delay] [tids...])
--sampling-cputime-delay (min: 1, dtype: floating-point)
--sampling-cputime-freq (min: 1, dtype: floating-point)
--sampling-cputime-tids (min: 0, dtype: string)
--sampling-include-inlines (min: 0, dtype: boolean)
--sampling-keep-internal (min: 0, dtype: boolean)
--sampling-realtime-delay (min: 1, dtype: floating-point)
--sampling-realtime-freq (min: 1, dtype: floating-point)
--sampling-realtime-offset (min: 1, dtype: integral)
--sampling-realtime-tids (min: 0, dtype: string)
--cpu-events (min: 1, dtype: [EVENT ...])
--gpu-events (min: 1, dtype: [EVENT ...])
--enable-categories (min: 1, dtype: string)
--disable-categories (min: 1, dtype: string)
--tmpdir (min: 0, dtype: string)
--use-pid (min: 0, dtype: boolean)
--time-output (min: 0, dtype: boolean)
--causal-file (min: 0, dtype: string)
--causal-file-reset (min: 0, dtype: boolean)
--use-temporary-files (min: 0, dtype: boolean)
--perfetto-backend (min: 1, dtype: string)
--perfetto-roctracer-per-stream (min: 0, dtype: boolean)
--perfetto-shmem-size-hint-kb (min: 1, dtype: integral)
--timemory-components (min: 0, dtype: string)
--roctracer-hip-activity (min: 0, dtype: boolean)
--roctracer-hip-api (min: 0, dtype: boolean)
--roctracer-hsa-activity (min: 0, dtype: boolean)
--roctracer-hsa-api (min: 0, dtype: boolean)
--roctracer-hsa-api-types (min: 0, dtype: string)
--critical-trace-buffer-count (min: 1, dtype: integral)
--critical-trace-count (min: 1, dtype: integral)
--critical-trace-per-row (min: 1, dtype: integral)
--inlines (max: 1, dtype: bool)
--hsa-interrupt (count: 1, dtype: int)
--causal-binary-exclude (min: 0, dtype: string)
--causal-binary-scope (min: 0, dtype: string)
--causal-delay (min: 1, dtype: floating-point)
--causal-duration (min: 1, dtype: floating-point)
--causal-end-to-end (min: 0, dtype: boolean)
--causal-fixed-speedup (min: 0, dtype: string)
--causal-function-exclude (min: 0, dtype: string)
--causal-function-exclude-defaults (min: 0, dtype: boolean)
--causal-function-scope (min: 0, dtype: string)
--causal-mode (min: 0, dtype: string)
--causal-random-seed (min: 1, dtype: integral)
--causal-source-exclude (min: 0, dtype: string)
--causal-source-scope (min: 0, dtype: string)
]
Command line interface to omnitrace configuration.
Options:
-h, -?, --help Shows this page (count: 0, dtype: bool)
--version Prints the version and exit (count: 0, dtype: bool)
[DEBUG OPTIONS]
--monochrome Disable colorized output (max: 1, dtype: bool)
--debug Debug output (max: 1, dtype: bool)
-v, --verbose Verbose output (count: 1, dtype: integral)
--ci Enable some runtime validation checks (typically enabled for continuous integration) (min: 0, dtype: boolean)
--dl-verbose Verbosity within the omnitrace-dl library (min: 1, dtype: integral)
--perfetto-annotations Include debug annotations in perfetto trace. When enabled, this feature will encode information such as the values of
the function arguments (when available). Disabling this feature may dramatically reduce the size of the trace (min: 0,
dtype: boolean)
--critical-trace-debug Enable debugging for critical trace (min: 0, dtype: boolean)
--kokkosp-kernel-logger Enables kernel logging (min: 0, dtype: boolean)
--kokkosp-prefix Set to [kokkos] to maintain old naming convention (min: 0, dtype: string)
--sampling-allocator-size The number of sampled threads handled by an allocator running in a background thread. Each thread that is sampled
communicates with an allocator running in a background thread which handles storing/caching the data when it's buffer
is full. Setting this value too high (i.e. equal to the number of threads when the thread count is high) may cause loss
of data -- the sampler may fill a new buffer and overwrite old buffer data before the allocator can process it. Setting
this value to 1 will result in a background allocator thread for every thread started by the application. (min: 1,
dtype: integral)
--kokkosp-name-length-max Set this to a value > 0 to help avoid unnamed Kokkos Tools callbacks. Generally, unnamed callbacks are the demangled
name of the function, which is very long (min: 1, dtype: integral)
--critical-trace-serialize-names
Include names in serialization of critical trace (mainly for debugging) (min: 0, dtype: boolean)
[GENERAL OPTIONS] These are options which are ubiquitously applied
-c, --config Configuration file (min: 1, dtype: filepath)
-o, --output Output path. Accepts 1-2 parameters corresponding to the output path and the output prefix (min: 1, dtype: path
[prefix])
-T, --trace Generate a detailed trace (perfetto output) (max: 1, dtype: bool)
-P, --profile Generate a call-stack-based profile (conflicts with --flat-profile) (max: 1, dtype: bool)
-F, --flat-profile Generate a flat profile (conflicts with --profile) (max: 1, dtype: bool)
-S, --sample [ cputime | realtime ]
Enable statistical sampling of call-stack (min: 0, dtype: timer-type)
-H, --host Enable sampling host-based metrics for the process. E.g. CPU frequency, memory usage, etc. (max: 1, dtype: bool)
-D, --device Enable sampling device-based metrics for the process. E.g. GPU temperature, memory usage, etc. (max: 1, dtype: bool)
-w, --wait This option is a combination of '--trace-wait' and '--sampling-wait'. See the descriptions for those two options.
(count: 1, dtype: seconds)
-d, --duration This option is a combination of '--trace-duration' and '--sampling-duration'. See the descriptions for those two
options. (count: 1, dtype: seconds)
--periods Similar to specifying delay and/or duration except in the form <DELAY>:<DURATION>, <DELAY>:<DURATION>:<REPEAT>, and/or
<DELAY>:<DURATION>:<REPEAT>:<CLOCK_ID> (min: 1, dtype: period-spec(s))
[BACKEND OPTIONS] These options control region information captured w/o sampling or instrumentation
-I, --include [ all | kokkosp | mpip | mutex-locks | ompt | rcclp | rocm-smi | rocprofiler | roctracer | roctx | rw-locks | spin-locks ]
Include data from these backends (min: 1, dtype: [backend...])
-E, --exclude [ all | kokkosp | mpip | mutex-locks | ompt | rcclp | rocm-smi | rocprofiler | roctracer | roctx | rw-locks | spin-locks ]
Exclude data from these backends (min: 1, dtype: [backend...])
--mode [ causal | coverage | sampling | trace ]
Data collection mode. Used to set default values for OMNITRACE_USE_* options. Typically set by omnitrace binary
instrumenter. (min: 1, dtype: string)
--use-causal Enable causal profiling analysis (min: 0, dtype: boolean)
--use-kokkosp Enable support for Kokkos Tools (min: 0, dtype: boolean)
--use-mpip Enable support for MPI functions (min: 0, dtype: boolean)
--use-roctx Enable ROCtx API. Warning! Out-of-order ranges may corrupt perfetto flamegraph (min: 0, dtype: boolean)
--critical-trace Enable generation of the critical trace (min: 0, dtype: boolean)
--use-code-coverage Enable support for code coverage (min: 0, dtype: boolean)
--use-perfetto Enable perfetto backend (min: 0, dtype: boolean)
--use-process-sampling Enable a background thread which samples process-level and system metrics such as the CPU/GPU freq, power, memory
usage, etc. (min: 0, dtype: boolean)
--use-rcclp Enable support for ROCm Communication Collectives Library (RCCL) Performance (min: 0, dtype: boolean)
--use-rocm-smi Enable sampling GPU power, temp, utilization, and memory usage (min: 0, dtype: boolean)
--use-rocprofiler Enable ROCm hardware counters (min: 0, dtype: boolean)
--use-roctracer Enable ROCm API and kernel tracing (min: 0, dtype: boolean)
--use-sampling Enable statistical sampling of call-stack (min: 0, dtype: boolean)
--use-timemory Enable timemory backend (min: 0, dtype: boolean)
--trace-thread-barriers Enable tracing calls to pthread_barrier functions. (min: 0, dtype: boolean)
--trace-thread-join Enable tracing calls to pthread_join functions. (min: 0, dtype: boolean)
--trace-thread-locks Enable tracing calls to pthread_mutex_lock, pthread_mutex_unlock, pthread_mutex_trylock (min: 0, dtype: boolean)
--trace-thread-rw-locks Enable tracing calls to pthread_rwlock_* functions. May cause deadlocks with ROCm-enabled OpenMPI. (min: 0, dtype:
boolean)
--trace-thread-spin-locks Enable tracing calls to pthread_spin_* functions. May cause deadlocks with MPI distributions. (min: 0, dtype: boolean)
[PARALLELISM OPTIONS]
--thread-pool-size Max number of threads for processing background tasks (min: 1, dtype: integral)
--num-threads-hint This is hint for how many threads are expected to be created in the application. Setting this value allows omnitrace to
preallocate resources during initialization and warn about any potential issues. For example, when call-stack sampling,
each thread has a unique sampler instance which communicates with an allocator instance running in a background thread.
Each allocator only handles N sampling instances (where N is the value of OMNITRACE_SAMPLING_ALLOCATOR_SIZE). When this
hint is set to >= the number of threads that get sampled, omnitrace can start all the background threads during
initialization (min: 1, dtype: integral)
[TRACING OPTIONS] Specific options controlling tracing (i.e. deterministic measurements of every event)
--trace-file Specify the trace output filename. Relative filepath will be with respect to output path and output prefix. (count: 1,
dtype: filepath)
--trace-buffer-size Size limit for the trace output (in KB) (count: 1, dtype: KB)
--trace-fill-policy [ discard | ring_buffer ]
Policy for new data when the buffer size limit is reached:
- discard : new data is ignored
- ring_buffer : new data overwrites oldest data (count: 1, dtype: policy)
--trace-wait Set the wait time (in seconds) before collecting trace and/or profiling data(in seconds). By default, the duration is
in seconds of realtime but that can changed via --trace-clock-id. (count: 1, dtype: seconds)
--trace-duration Set the duration of the trace and/or profile data collection (in seconds). By default, the duration is in seconds of
realtime but that can changed via --trace-clock-id. (count: 1, dtype: seconds)
--trace-periods More powerful version of specifying trace delay and/or duration. Format is one or more groups of: <DELAY>:<DURATION>,
<DELAY>:<DURATION>:<REPEAT>, and/or <DELAY>:<DURATION>:<REPEAT>:<CLOCK_ID>. (min: 1, dtype: period-spec(s))
--trace-clock-id [ 0 (realtime|CLOCK_REALTIME)
1 (monotonic|CLOCK_MONOTONIC)
2 (cputime|CLOCK_PROCESS_CPUTIME_ID)
4 (monotonic_raw|CLOCK_MONOTONIC_RAW)
5 (realtime_coarse|CLOCK_REALTIME_COARSE)
6 (monotonic_coarse|CLOCK_MONOTONIC_COARSE)
7 (boottime|CLOCK_BOOTTIME) ]
Set the default clock ID for for trace delay/duration. Note: "cputime" is the *process* CPU time and might need to be
scaled based on the number of threads, i.e. 4 seconds of CPU-time for an application with 4 fully active threads would
equate to ~1 second of realtime. If this proves to be difficult to handle in practice, please file a feature request
for omnitrace to auto-scale based on the number of threads. (count: 1, dtype: clock-id)
[PROFILE OPTIONS] Specific options controlling profiling (i.e. deterministic measurements which are aggregated into a summary)
--profile-format [ console | json | text ]
Data formats for profiling results (min: 1, dtype: string)
--profile-diff Generate a diff output b/t the profile collected and an existing profile from another run Accepts 1-2 parameters
corresponding to the input path and the input prefix (min: 1, dtype: path [prefix])
[HOST/DEVICE (PROCESS SAMPLING) OPTIONS]
Process sampling is background measurements for resources available to the entire process. These samples are not tied
to specific lines/regions of code
--process-freq Set the default host/device sampling frequency (number of interrupts per second) (count: 1, dtype: floating-point)
--process-wait Set the default wait time (i.e. delay) before taking first host/device sample (in seconds of realtime) (count: 1,
dtype: seconds)
--process-duration Set the duration of the host/device sampling (in seconds of realtime) (count: 1, dtype: seconds)
--cpus CPU IDs for frequency sampling. Supports integers and/or ranges (count: unlimited, dtype: int and/or range)
--gpus GPU IDs for SMI queries. Supports integers and/or ranges (count: unlimited, dtype: int and/or range)
[GENERAL SAMPLING OPTIONS] General options for timer-based sampling per-thread
-f, --sampling-freq Set the default sampling frequency (number of interrupts per second) (count: 1, dtype: floating-point)
-t, --tids Specify the default thread IDs for sampling, where 0 (zero) is the main thread and each thread created by the target
application is assigned an atomically incrementing value. (min: 1, dtype: int and/or range)
--sampling-wait Set the default wait time (i.e. delay) before taking first sample (in seconds). This delay time is based on the clock
of the sampler, i.e., a delay of 1 second for CPU-clock sampler may not equal 1 second of realtime (count: 1, dtype:
seconds)
--sampling-duration Set the duration of the sampling (in seconds of realtime). I.e., it is possible (currently) to set a CPU-clock time
delay that exceeds the real-time duration... resulting in zero samples being taken (count: 1, dtype: seconds)
[SAMPLING TIMER OPTIONS] These options determine the heuristic for deciding when to take a sample
--sample-cputime Sample based on a CPU-clock timer (default). Accepts zero or more arguments:
0. Enables sampling based on CPU-clock timer.
1. Interrupts per second. E.g., 100 == sample every 10 milliseconds of CPU-time.
2. Delay (in seconds of CPU-clock time). I.e., how long each thread should wait before taking first sample.
3+ Thread IDs to target for sampling, starting at 0 (the main thread).
May be specified as index or range, e.g., '0 2-4' will be interpreted as:
sample the main thread (0), do not sample the first child thread but sample the 2nd, 3rd, and 4th child threads (min: 0, dtype: [freq] [delay] [tids...])
--sample-realtime Sample based on a real-clock timer. Accepts zero or more arguments:
0. Enables sampling based on real-clock timer.
1. Interrupts per second. E.g., 100 == sample every 10 milliseconds of realtime.
2. Delay (in seconds of real-clock time). I.e., how long each thread should wait before taking first sample.
3+ Thread IDs to target for sampling, starting at 0 (the main thread).
May be specified as index or range, e.g., '0 2-4' will be interpreted as:
sample the main thread (0), do not sample the first child thread but sample the 2nd, 3rd, and 4th child threads
When sampling with a real-clock timer, please note that enabling this will cause threads which are typically "idle"
to consume more resources since, while idle, the real-clock time increases (and therefore triggers taking samples)
whereas the CPU-clock time does not. (min: 0, dtype: [freq] [delay] [tids...])
[ADVANCED SAMPLING OPTIONS] These options determine the heuristic for deciding when to take a sample
--sampling-cputime-delay Time (in seconds) to wait before the first CPU-time sampling signal is delivered. Defaults to OMNITRACE_SAMPLING_DELAY
when <= 0.0 (min: 1, dtype: floating-point)
--sampling-cputime-freq Number of software interrupts per second of CPU-time. Defaults to OMNITRACE_SAMPLING_FREQ when <= 0.0 (min: 1, dtype:
floating-point)
--sampling-cputime-tids Same as OMNITRACE_SAMPLING_TIDS but applies specifically to samplers whose timers are based on the CPU-time. This is
useful when both OMNITRACE_SAMPLING_CPUTIME=ON and OMNITRACE_SAMPLING_REALTIME=ON (min: 0, dtype: string)
--sampling-include-inlines Create entries for inlined functions when available (min: 0, dtype: boolean)
--sampling-keep-internal Configure whether the statistical samples should include call-stack entries from internal routines in omnitrace. E.g.
when ON, the call-stack will show functions like omnitrace_push_trace. If disabled, omnitrace will attempt to filter
out internal routines from the sampling call-stacks (min: 0, dtype: boolean)
--sampling-realtime-delay Time (in seconds) to wait before the first real (wall) time sampling signal is delivered. Defaults to
OMNITRACE_SAMPLING_DELAY when <= 0.0 (min: 1, dtype: floating-point)
--sampling-realtime-freq Number of software interrupts per second of real (wall) time. Defaults to OMNITRACE_SAMPLING_FREQ when <= 0.0 (min: 1,
dtype: floating-point)
--sampling-realtime-offset Modify this value only if the target process is also using SIGRTMIN. E.g. the signal used is SIGRTMIN + <THIS_VALUE>.
Value must be <= 30 (min: 1, dtype: integral)
--sampling-realtime-tids Same as OMNITRACE_SAMPLING_TIDS but applies specifically to samplers whose timers are based on the real (wall) time.
This is useful when both OMNITRACE_SAMPLING_CPUTIME=ON and OMNITRACE_SAMPLING_REALTIME=ON (min: 0, dtype: string)
[HARDWARE COUNTER OPTIONS] See also: omnitrace-avail -H
-C, --cpu-events Set the CPU hardware counter events to record (ref: `omnitrace-avail -H -c CPU`) (min: 1, dtype: [EVENT ...])
-G, --gpu-events Set the GPU hardware counter events to record (ref: `omnitrace-avail -H -c GPU`) (min: 1, dtype: [EVENT ...])
[CATEGORY OPTIONS]
--enable-categories [ causal
comm_data
cpu_frequency
critical-trace
device-critical-trace
device_busy
device_hip
device_hsa
device_memory_usage
device_power
device_temp
host
host-critical-trace
kernel_hardware_counter
kokkos
mpi
numa
ompt
process_context_switch
process_kernel_cpu_time
process_memory_hwm
process_page_fault
process_sampling
process_user_cpu_time
process_virtual_memory
pthread
python
rccl
rocm_hip
rocm_hsa
rocm_roctx
rocm_smi
rocprofiler
roctracer
sampling
thread_context_switch
thread_cpu_time
thread_hardware_counter
thread_page_fault
thread_peak_memory
thread_wall_time
timemory
user ]
Enable collecting profiling and trace data for these categories and disable all other categories (min: 1, dtype:
string)
--disable-categories [ causal
comm_data
cpu_frequency
critical-trace
device-critical-trace
device_busy
device_hip
device_hsa
device_memory_usage
device_power
device_temp
host
host-critical-trace
kernel_hardware_counter
kokkos
mpi
numa
ompt
process_context_switch
process_kernel_cpu_time
process_memory_hwm
process_page_fault
process_sampling
process_user_cpu_time
process_virtual_memory
pthread
python
rccl
rocm_hip
rocm_hsa
rocm_roctx
rocm_smi
rocprofiler
roctracer
sampling
thread_context_switch
thread_cpu_time
thread_hardware_counter
thread_page_fault
thread_peak_memory
thread_wall_time
timemory
user ]
Disable collecting profiling and trace data for these categories (min: 1, dtype: string)
[IO OPTIONS]
--tmpdir Base directory for temporary files (min: 0, dtype: string)
--use-pid Enable tagging filenames with process identifier (either MPI rank or pid) (min: 0, dtype: boolean)
--time-output Output data to subfolder w/ a timestamp (see also: TIME_FORMAT) (min: 0, dtype: boolean)
--causal-file Name of causal output filename (w/o extension) (min: 0, dtype: string)
--causal-file-reset Overwrite any existing causal output file instead of appending to it (min: 0, dtype: boolean)
--use-temporary-files Write data to temporary files to minimize the memory usage of omnitrace, e.g. call-stack samples will be periodically
written to a file and re-loaded during finalization (min: 0, dtype: boolean)
[PERFETTO OPTIONS]
--perfetto-backend [ all | inprocess | system ]
Specify the perfetto backend to activate. Options are: 'inprocess', 'system', or 'all' (min: 1, dtype: string)
--perfetto-roctracer-per-stream
Separate roctracer GPU side traces (copies, kernels) into separate tracks based on the stream they're enqueued into
(min: 0, dtype: boolean)
--perfetto-shmem-size-hint-kb
Hint for shared-memory buffer size in perfetto (in KB) (min: 1, dtype: integral)
[TIMEMORY OPTIONS]
--timemory-components List of components to collect via timemory (see `omnitrace-avail -C`) (min: 0, dtype: string)
[ROCM OPTIONS]
--roctracer-hip-activity Enable HIP activity tracing support (min: 0, dtype: boolean)
--roctracer-hip-api Enable HIP API tracing support (min: 0, dtype: boolean)
--roctracer-hsa-activity Enable HSA activity tracing support (min: 0, dtype: boolean)
--roctracer-hsa-api Enable HSA API tracing support (min: 0, dtype: boolean)
--roctracer-hsa-api-types HSA API type to collect (min: 0, dtype: string)
[CRITICAL_TRACE OPTIONS]
--critical-trace-buffer-count
Number of critical trace records to store in thread-local memory before submitting to shared buffer (min: 1, dtype:
integral)
--critical-trace-count Number of critical trace to export (0 == all) (min: 1, dtype: integral)
--critical-trace-per-row How many critical traces per row in perfetto (0 == all in one row) (min: 1, dtype: integral)
[MISCELLANEOUS OPTIONS]
-i, --inlines Include inline info in output when available (max: 1, dtype: bool)
--hsa-interrupt [ 0 | 1 ] Set the value of the HSA_ENABLE_INTERRUPT environment variable.
ROCm version 5.2 and older have a bug which will cause a deadlock if a sample is taken while waiting for the signal
that a kernel completed -- which happens when sampling with a real-clock timer. We require this option to be set to
when --realtime is specified to make users aware that, while this may fix the bug, it can have a negative impact on
performance.
Values:
0 avoid triggering the bug, potentially at the cost of reduced performance
1 do not modify how ROCm is notified about kernel completion (count: 1, dtype: int)
omnitrace-sample
except that it works with instrumented binarieslibomnitrace-dl.so
Usage
Binary rewrite
Sampling
The following two command are effectively identical:
Help Menu