ROCm / omnitrace

Omnitrace: Application Profiling, Tracing, and Analysis
https://rocm.docs.amd.com/projects/omnitrace/en/latest/
MIT License
297 stars 27 forks source link

omnitrace-run executable - required for running binary writes #257

Closed jrmadsen closed 1 year ago

jrmadsen commented 1 year ago

Usage

Binary rewrite

$ omnitrace-instrument -o foo.inst -- foo
$ omnitrace-run -TPHDS -- ./foo.inst

Sampling

The following two command are effectively identical:

$ omnitrace-run -S -- foo
$ omnitrace-sample -- foo

Help Menu

$ omnitrace-run --help
[omnitrace-run] Usage: ./bin/omnitrace-run [ --help (count: 0, dtype: bool)
                                             --version (count: 0, dtype: bool)
                                             --monochrome (max: 1, dtype: bool)
                                             --debug (max: 1, dtype: bool)
                                             --verbose (count: 1, dtype: integral)
                                             --ci (min: 0, dtype: boolean)
                                             --dl-verbose (min: 1, dtype: integral)
                                             --perfetto-annotations (min: 0, dtype: boolean)
                                             --critical-trace-debug (min: 0, dtype: boolean)
                                             --kokkosp-kernel-logger (min: 0, dtype: boolean)
                                             --kokkosp-prefix (min: 0, dtype: string)
                                             --sampling-allocator-size (min: 1, dtype: integral)
                                             --kokkosp-name-length-max (min: 1, dtype: integral)
                                             --critical-trace-serialize-names (min: 0, dtype: boolean)
                                             --config (min: 1, dtype: filepath)
                                             --output (min: 1, dtype: path [prefix])
                                             --trace (max: 1, dtype: bool)
                                             --profile (max: 1, dtype: bool)
                                             --flat-profile (max: 1, dtype: bool)
                                             --sample (min: 0, dtype: timer-type)
                                             --host (max: 1, dtype: bool)
                                             --device (max: 1, dtype: bool)
                                             --wait (count: 1, dtype: seconds)
                                             --duration (count: 1, dtype: seconds)
                                             --periods (min: 1, dtype: period-spec(s))
                                             --include (min: 1, dtype: [backend...])
                                             --exclude (min: 1, dtype: [backend...])
                                             --mode (min: 1, dtype: string)
                                             --use-causal (min: 0, dtype: boolean)
                                             --use-kokkosp (min: 0, dtype: boolean)
                                             --use-mpip (min: 0, dtype: boolean)
                                             --use-roctx (min: 0, dtype: boolean)
                                             --critical-trace (min: 0, dtype: boolean)
                                             --use-code-coverage (min: 0, dtype: boolean)
                                             --use-perfetto (min: 0, dtype: boolean)
                                             --use-process-sampling (min: 0, dtype: boolean)
                                             --use-rcclp (min: 0, dtype: boolean)
                                             --use-rocm-smi (min: 0, dtype: boolean)
                                             --use-rocprofiler (min: 0, dtype: boolean)
                                             --use-roctracer (min: 0, dtype: boolean)
                                             --use-sampling (min: 0, dtype: boolean)
                                             --use-timemory (min: 0, dtype: boolean)
                                             --trace-thread-barriers (min: 0, dtype: boolean)
                                             --trace-thread-join (min: 0, dtype: boolean)
                                             --trace-thread-locks (min: 0, dtype: boolean)
                                             --trace-thread-rw-locks (min: 0, dtype: boolean)
                                             --trace-thread-spin-locks (min: 0, dtype: boolean)
                                             --thread-pool-size (min: 1, dtype: integral)
                                             --num-threads-hint (min: 1, dtype: integral)
                                             --trace-file (count: 1, dtype: filepath)
                                             --trace-buffer-size (count: 1, dtype: KB)
                                             --trace-fill-policy (count: 1, dtype: policy)
                                             --trace-wait (count: 1, dtype: seconds)
                                             --trace-duration (count: 1, dtype: seconds)
                                             --trace-periods (min: 1, dtype: period-spec(s))
                                             --trace-clock-id (count: 1, dtype: clock-id)
                                             --profile-format (min: 1, dtype: string)
                                             --profile-diff (min: 1, dtype: path [prefix])
                                             --process-freq (count: 1, dtype: floating-point)
                                             --process-wait (count: 1, dtype: seconds)
                                             --process-duration (count: 1, dtype: seconds)
                                             --cpus (count: unlimited, dtype: int and/or range)
                                             --gpus (count: unlimited, dtype: int and/or range)
                                             --sampling-freq (count: 1, dtype: floating-point)
                                             --tids (min: 1, dtype: int and/or range)
                                             --sampling-wait (count: 1, dtype: seconds)
                                             --sampling-duration (count: 1, dtype: seconds)
                                             --sample-cputime (min: 0, dtype: [freq] [delay] [tids...])
                                             --sample-realtime (min: 0, dtype: [freq] [delay] [tids...])
                                             --sampling-cputime-delay (min: 1, dtype: floating-point)
                                             --sampling-cputime-freq (min: 1, dtype: floating-point)
                                             --sampling-cputime-tids (min: 0, dtype: string)
                                             --sampling-include-inlines (min: 0, dtype: boolean)
                                             --sampling-keep-internal (min: 0, dtype: boolean)
                                             --sampling-realtime-delay (min: 1, dtype: floating-point)
                                             --sampling-realtime-freq (min: 1, dtype: floating-point)
                                             --sampling-realtime-offset (min: 1, dtype: integral)
                                             --sampling-realtime-tids (min: 0, dtype: string)
                                             --cpu-events (min: 1, dtype: [EVENT ...])
                                             --gpu-events (min: 1, dtype: [EVENT ...])
                                             --enable-categories (min: 1, dtype: string)
                                             --disable-categories (min: 1, dtype: string)
                                             --tmpdir (min: 0, dtype: string)
                                             --use-pid (min: 0, dtype: boolean)
                                             --time-output (min: 0, dtype: boolean)
                                             --causal-file (min: 0, dtype: string)
                                             --causal-file-reset (min: 0, dtype: boolean)
                                             --use-temporary-files (min: 0, dtype: boolean)
                                             --perfetto-backend (min: 1, dtype: string)
                                             --perfetto-roctracer-per-stream (min: 0, dtype: boolean)
                                             --perfetto-shmem-size-hint-kb (min: 1, dtype: integral)
                                             --timemory-components (min: 0, dtype: string)
                                             --roctracer-hip-activity (min: 0, dtype: boolean)
                                             --roctracer-hip-api (min: 0, dtype: boolean)
                                             --roctracer-hsa-activity (min: 0, dtype: boolean)
                                             --roctracer-hsa-api (min: 0, dtype: boolean)
                                             --roctracer-hsa-api-types (min: 0, dtype: string)
                                             --critical-trace-buffer-count (min: 1, dtype: integral)
                                             --critical-trace-count (min: 1, dtype: integral)
                                             --critical-trace-per-row (min: 1, dtype: integral)
                                             --inlines (max: 1, dtype: bool)
                                             --hsa-interrupt (count: 1, dtype: int)
                                             --causal-binary-exclude (min: 0, dtype: string)
                                             --causal-binary-scope (min: 0, dtype: string)
                                             --causal-delay (min: 1, dtype: floating-point)
                                             --causal-duration (min: 1, dtype: floating-point)
                                             --causal-end-to-end (min: 0, dtype: boolean)
                                             --causal-fixed-speedup (min: 0, dtype: string)
                                             --causal-function-exclude (min: 0, dtype: string)
                                             --causal-function-exclude-defaults (min: 0, dtype: boolean)
                                             --causal-function-scope (min: 0, dtype: string)
                                             --causal-mode (min: 0, dtype: string)
                                             --causal-random-seed (min: 1, dtype: integral)
                                             --causal-source-exclude (min: 0, dtype: string)
                                             --causal-source-scope (min: 0, dtype: string)
                                           ] 

    Command line interface to omnitrace configuration.

Options:
    -h, -?, --help                 Shows this page (count: 0, dtype: bool) 
    --version                      Prints the version and exit (count: 0, dtype: bool) 

    [DEBUG OPTIONS]                                  

    --monochrome                   Disable colorized output (max: 1, dtype: bool) 
    --debug                        Debug output (max: 1, dtype: bool) 
    -v, --verbose                  Verbose output (count: 1, dtype: integral) 
    --ci                           Enable some runtime validation checks (typically enabled for continuous integration) (min: 0, dtype: boolean) 
    --dl-verbose                   Verbosity within the omnitrace-dl library (min: 1, dtype: integral) 
    --perfetto-annotations         Include debug annotations in perfetto trace. When enabled, this feature will encode information such as the values of 
                                   the function arguments (when available). Disabling this feature may dramatically reduce the size of the trace (min: 0, 
                                   dtype: boolean) 
    --critical-trace-debug         Enable debugging for critical trace (min: 0, dtype: boolean) 
    --kokkosp-kernel-logger        Enables kernel logging (min: 0, dtype: boolean) 
    --kokkosp-prefix               Set to [kokkos] to maintain old naming convention (min: 0, dtype: string) 
    --sampling-allocator-size      The number of sampled threads handled by an allocator running in a background thread. Each thread that is sampled 
                                   communicates with an allocator running in a background thread which handles storing/caching the data when it's buffer 
                                   is full. Setting this value too high (i.e. equal to the number of threads when the thread count is high) may cause loss 
                                   of data -- the sampler may fill a new buffer and overwrite old buffer data before the allocator can process it. Setting 
                                   this value to 1 will result in a background allocator thread for every thread started by the application. (min: 1, 
                                   dtype: integral) 
    --kokkosp-name-length-max      Set this to a value > 0 to help avoid unnamed Kokkos Tools callbacks. Generally, unnamed callbacks are the demangled 
                                   name of the function, which is very long (min: 1, dtype: integral) 
    --critical-trace-serialize-names
                                   Include names in serialization of critical trace (mainly for debugging) (min: 0, dtype: boolean) 

    [GENERAL OPTIONS]  These are options which are ubiquitously applied 

    -c, --config                   Configuration file (min: 1, dtype: filepath) 
    -o, --output                   Output path. Accepts 1-2 parameters corresponding to the output path and the output prefix (min: 1, dtype: path 
                                   [prefix]) 
    -T, --trace                    Generate a detailed trace (perfetto output) (max: 1, dtype: bool) 
    -P, --profile                  Generate a call-stack-based profile (conflicts with --flat-profile) (max: 1, dtype: bool) 
    -F, --flat-profile             Generate a flat profile (conflicts with --profile) (max: 1, dtype: bool) 
    -S, --sample [ cputime | realtime ]
                                   Enable statistical sampling of call-stack (min: 0, dtype: timer-type) 
    -H, --host                     Enable sampling host-based metrics for the process. E.g. CPU frequency, memory usage, etc. (max: 1, dtype: bool) 
    -D, --device                   Enable sampling device-based metrics for the process. E.g. GPU temperature, memory usage, etc. (max: 1, dtype: bool) 
    -w, --wait                     This option is a combination of '--trace-wait' and '--sampling-wait'. See the descriptions for those two options. 
                                   (count: 1, dtype: seconds) 
    -d, --duration                 This option is a combination of '--trace-duration' and '--sampling-duration'. See the descriptions for those two 
                                   options. (count: 1, dtype: seconds) 
    --periods                      Similar to specifying delay and/or duration except in the form <DELAY>:<DURATION>, <DELAY>:<DURATION>:<REPEAT>, and/or 
                                   <DELAY>:<DURATION>:<REPEAT>:<CLOCK_ID> (min: 1, dtype: period-spec(s)) 

    [BACKEND OPTIONS]  These options control region information captured w/o sampling or instrumentation 

    -I, --include [ all | kokkosp | mpip | mutex-locks | ompt | rcclp | rocm-smi | rocprofiler | roctracer | roctx | rw-locks | spin-locks ]
                                   Include data from these backends (min: 1, dtype: [backend...]) 
    -E, --exclude [ all | kokkosp | mpip | mutex-locks | ompt | rcclp | rocm-smi | rocprofiler | roctracer | roctx | rw-locks | spin-locks ]
                                   Exclude data from these backends (min: 1, dtype: [backend...]) 
    --mode [ causal | coverage | sampling | trace ]
                                   Data collection mode. Used to set default values for OMNITRACE_USE_* options. Typically set by omnitrace binary 
                                   instrumenter. (min: 1, dtype: string) 
    --use-causal                   Enable causal profiling analysis (min: 0, dtype: boolean) 
    --use-kokkosp                  Enable support for Kokkos Tools (min: 0, dtype: boolean) 
    --use-mpip                     Enable support for MPI functions (min: 0, dtype: boolean) 
    --use-roctx                    Enable ROCtx API. Warning! Out-of-order ranges may corrupt perfetto flamegraph (min: 0, dtype: boolean) 
    --critical-trace               Enable generation of the critical trace (min: 0, dtype: boolean) 
    --use-code-coverage            Enable support for code coverage (min: 0, dtype: boolean) 
    --use-perfetto                 Enable perfetto backend (min: 0, dtype: boolean) 
    --use-process-sampling         Enable a background thread which samples process-level and system metrics such as the CPU/GPU freq, power, memory 
                                   usage, etc. (min: 0, dtype: boolean) 
    --use-rcclp                    Enable support for ROCm Communication Collectives Library (RCCL) Performance (min: 0, dtype: boolean) 
    --use-rocm-smi                 Enable sampling GPU power, temp, utilization, and memory usage (min: 0, dtype: boolean) 
    --use-rocprofiler              Enable ROCm hardware counters (min: 0, dtype: boolean) 
    --use-roctracer                Enable ROCm API and kernel tracing (min: 0, dtype: boolean) 
    --use-sampling                 Enable statistical sampling of call-stack (min: 0, dtype: boolean) 
    --use-timemory                 Enable timemory backend (min: 0, dtype: boolean) 
    --trace-thread-barriers        Enable tracing calls to pthread_barrier functions. (min: 0, dtype: boolean) 
    --trace-thread-join            Enable tracing calls to pthread_join functions. (min: 0, dtype: boolean) 
    --trace-thread-locks           Enable tracing calls to pthread_mutex_lock, pthread_mutex_unlock, pthread_mutex_trylock (min: 0, dtype: boolean) 
    --trace-thread-rw-locks        Enable tracing calls to pthread_rwlock_* functions. May cause deadlocks with ROCm-enabled OpenMPI. (min: 0, dtype: 
                                   boolean) 
    --trace-thread-spin-locks      Enable tracing calls to pthread_spin_* functions. May cause deadlocks with MPI distributions. (min: 0, dtype: boolean) 

    [PARALLELISM OPTIONS]                               

    --thread-pool-size             Max number of threads for processing background tasks (min: 1, dtype: integral) 
    --num-threads-hint             This is hint for how many threads are expected to be created in the application. Setting this value allows omnitrace to 
                                   preallocate resources during initialization and warn about any potential issues. For example, when call-stack sampling, 
                                   each thread has a unique sampler instance which communicates with an allocator instance running in a background thread. 
                                   Each allocator only handles N sampling instances (where N is the value of OMNITRACE_SAMPLING_ALLOCATOR_SIZE). When this 
                                   hint is set to >= the number of threads that get sampled, omnitrace can start all the background threads during 
                                   initialization (min: 1, dtype: integral) 

    [TRACING OPTIONS]  Specific options controlling tracing (i.e. deterministic measurements of every event) 

    --trace-file                   Specify the trace output filename. Relative filepath will be with respect to output path and output prefix. (count: 1, 
                                   dtype: filepath) 
    --trace-buffer-size            Size limit for the trace output (in KB) (count: 1, dtype: KB) 
    --trace-fill-policy [ discard | ring_buffer ]

                                   Policy for new data when the buffer size limit is reached:
                                       - discard     : new data is ignored
                                       - ring_buffer : new data overwrites oldest data (count: 1, dtype: policy)
    --trace-wait                   Set the wait time (in seconds) before collecting trace and/or profiling data(in seconds). By default, the duration is 
                                   in seconds of realtime but that can changed via --trace-clock-id. (count: 1, dtype: seconds) 
    --trace-duration               Set the duration of the trace and/or profile data collection (in seconds). By default, the duration is in seconds of 
                                   realtime but that can changed via --trace-clock-id. (count: 1, dtype: seconds) 
    --trace-periods                More powerful version of specifying trace delay and/or duration. Format is one or more groups of: <DELAY>:<DURATION>, 
                                   <DELAY>:<DURATION>:<REPEAT>, and/or <DELAY>:<DURATION>:<REPEAT>:<CLOCK_ID>. (min: 1, dtype: period-spec(s)) 
    --trace-clock-id [ 0 (realtime|CLOCK_REALTIME)
                       1 (monotonic|CLOCK_MONOTONIC)
                       2 (cputime|CLOCK_PROCESS_CPUTIME_ID)
                       4 (monotonic_raw|CLOCK_MONOTONIC_RAW)
                       5 (realtime_coarse|CLOCK_REALTIME_COARSE)
                       6 (monotonic_coarse|CLOCK_MONOTONIC_COARSE)
                       7 (boottime|CLOCK_BOOTTIME) ]
                                   Set the default clock ID for for trace delay/duration. Note: "cputime" is the *process* CPU time and might need to be 
                                   scaled based on the number of threads, i.e. 4 seconds of CPU-time for an application with 4 fully active threads would 
                                   equate to ~1 second of realtime. If this proves to be difficult to handle in practice, please file a feature request 
                                   for omnitrace to auto-scale based on the number of threads. (count: 1, dtype: clock-id) 

    [PROFILE OPTIONS]  Specific options controlling profiling (i.e. deterministic measurements which are aggregated into a summary) 

    --profile-format [ console | json | text ]
                                   Data formats for profiling results (min: 1, dtype: string) 
    --profile-diff                 Generate a diff output b/t the profile collected and an existing profile from another run Accepts 1-2 parameters 
                                   corresponding to the input path and the input prefix (min: 1, dtype: path [prefix]) 

    [HOST/DEVICE (PROCESS SAMPLING) OPTIONS]
                                   Process sampling is background measurements for resources available to the entire process. These samples are not tied 
                                   to specific lines/regions of code 

    --process-freq                 Set the default host/device sampling frequency (number of interrupts per second) (count: 1, dtype: floating-point) 
    --process-wait                 Set the default wait time (i.e. delay) before taking first host/device sample (in seconds of realtime) (count: 1, 
                                   dtype: seconds) 
    --process-duration             Set the duration of the host/device sampling (in seconds of realtime) (count: 1, dtype: seconds) 
    --cpus                         CPU IDs for frequency sampling. Supports integers and/or ranges (count: unlimited, dtype: int and/or range) 
    --gpus                         GPU IDs for SMI queries. Supports integers and/or ranges (count: unlimited, dtype: int and/or range) 

    [GENERAL SAMPLING OPTIONS] General options for timer-based sampling per-thread 

    -f, --sampling-freq            Set the default sampling frequency (number of interrupts per second) (count: 1, dtype: floating-point) 
    -t, --tids                     Specify the default thread IDs for sampling, where 0 (zero) is the main thread and each thread created by the target 
                                   application is assigned an atomically incrementing value. (min: 1, dtype: int and/or range) 
    --sampling-wait                Set the default wait time (i.e. delay) before taking first sample (in seconds). This delay time is based on the clock 
                                   of the sampler, i.e., a delay of 1 second for CPU-clock sampler may not equal 1 second of realtime (count: 1, dtype: 
                                   seconds) 
    --sampling-duration            Set the duration of the sampling (in seconds of realtime). I.e., it is possible (currently) to set a CPU-clock time 
                                   delay that exceeds the real-time duration... resulting in zero samples being taken (count: 1, dtype: seconds) 

    [SAMPLING TIMER OPTIONS] These options determine the heuristic for deciding when to take a sample 

    --sample-cputime               Sample based on a CPU-clock timer (default). Accepts zero or more arguments:
                                       0. Enables sampling based on CPU-clock timer.
                                       1. Interrupts per second. E.g., 100 == sample every 10 milliseconds of CPU-time.
                                       2. Delay (in seconds of CPU-clock time). I.e., how long each thread should wait before taking first sample.
                                       3+ Thread IDs to target for sampling, starting at 0 (the main thread).
                                          May be specified as index or range, e.g., '0 2-4' will be interpreted as:
                                             sample the main thread (0), do not sample the first child thread but sample the 2nd, 3rd, and 4th child threads (min: 0, dtype: [freq] [delay] [tids...])
    --sample-realtime              Sample based on a real-clock timer. Accepts zero or more arguments:
                                       0. Enables sampling based on real-clock timer.
                                       1. Interrupts per second. E.g., 100 == sample every 10 milliseconds of realtime.
                                       2. Delay (in seconds of real-clock time). I.e., how long each thread should wait before taking first sample.
                                       3+ Thread IDs to target for sampling, starting at 0 (the main thread).
                                          May be specified as index or range, e.g., '0 2-4' will be interpreted as:
                                             sample the main thread (0), do not sample the first child thread but sample the 2nd, 3rd, and 4th child threads
                                          When sampling with a real-clock timer, please note that enabling this will cause threads which are typically "idle"
                                          to consume more resources since, while idle, the real-clock time increases (and therefore triggers taking samples)
                                          whereas the CPU-clock time does not. (min: 0, dtype: [freq] [delay] [tids...])

    [ADVANCED SAMPLING OPTIONS] These options determine the heuristic for deciding when to take a sample 

    --sampling-cputime-delay       Time (in seconds) to wait before the first CPU-time sampling signal is delivered. Defaults to OMNITRACE_SAMPLING_DELAY 
                                   when <= 0.0 (min: 1, dtype: floating-point) 
    --sampling-cputime-freq        Number of software interrupts per second of CPU-time. Defaults to OMNITRACE_SAMPLING_FREQ when <= 0.0 (min: 1, dtype: 
                                   floating-point) 
    --sampling-cputime-tids        Same as OMNITRACE_SAMPLING_TIDS but applies specifically to samplers whose timers are based on the CPU-time. This is 
                                   useful when both OMNITRACE_SAMPLING_CPUTIME=ON and OMNITRACE_SAMPLING_REALTIME=ON (min: 0, dtype: string) 
    --sampling-include-inlines     Create entries for inlined functions when available (min: 0, dtype: boolean) 
    --sampling-keep-internal       Configure whether the statistical samples should include call-stack entries from internal routines in omnitrace. E.g. 
                                   when ON, the call-stack will show functions like omnitrace_push_trace. If disabled, omnitrace will attempt to filter 
                                   out internal routines from the sampling call-stacks (min: 0, dtype: boolean) 
    --sampling-realtime-delay      Time (in seconds) to wait before the first real (wall) time sampling signal is delivered. Defaults to 
                                   OMNITRACE_SAMPLING_DELAY when <= 0.0 (min: 1, dtype: floating-point) 
    --sampling-realtime-freq       Number of software interrupts per second of real (wall) time. Defaults to OMNITRACE_SAMPLING_FREQ when <= 0.0 (min: 1, 
                                   dtype: floating-point) 
    --sampling-realtime-offset     Modify this value only if the target process is also using SIGRTMIN. E.g. the signal used is SIGRTMIN + <THIS_VALUE>. 
                                   Value must be <= 30 (min: 1, dtype: integral) 
    --sampling-realtime-tids       Same as OMNITRACE_SAMPLING_TIDS but applies specifically to samplers whose timers are based on the real (wall) time. 
                                   This is useful when both OMNITRACE_SAMPLING_CPUTIME=ON and OMNITRACE_SAMPLING_REALTIME=ON (min: 0, dtype: string) 

    [HARDWARE COUNTER OPTIONS] See also: omnitrace-avail -H  

    -C, --cpu-events               Set the CPU hardware counter events to record (ref: `omnitrace-avail -H -c CPU`) (min: 1, dtype: [EVENT ...]) 
    -G, --gpu-events               Set the GPU hardware counter events to record (ref: `omnitrace-avail -H -c GPU`) (min: 1, dtype: [EVENT ...]) 

    [CATEGORY OPTIONS]                               

    --enable-categories [ causal
                          comm_data
                          cpu_frequency
                          critical-trace
                          device-critical-trace
                          device_busy
                          device_hip
                          device_hsa
                          device_memory_usage
                          device_power
                          device_temp
                          host
                          host-critical-trace
                          kernel_hardware_counter
                          kokkos
                          mpi
                          numa
                          ompt
                          process_context_switch
                          process_kernel_cpu_time
                          process_memory_hwm
                          process_page_fault
                          process_sampling
                          process_user_cpu_time
                          process_virtual_memory
                          pthread
                          python
                          rccl
                          rocm_hip
                          rocm_hsa
                          rocm_roctx
                          rocm_smi
                          rocprofiler
                          roctracer
                          sampling
                          thread_context_switch
                          thread_cpu_time
                          thread_hardware_counter
                          thread_page_fault
                          thread_peak_memory
                          thread_wall_time
                          timemory
                          user ]
                                   Enable collecting profiling and trace data for these categories and disable all other categories (min: 1, dtype: 
                                   string) 
    --disable-categories [ causal
                           comm_data
                           cpu_frequency
                           critical-trace
                           device-critical-trace
                           device_busy
                           device_hip
                           device_hsa
                           device_memory_usage
                           device_power
                           device_temp
                           host
                           host-critical-trace
                           kernel_hardware_counter
                           kokkos
                           mpi
                           numa
                           ompt
                           process_context_switch
                           process_kernel_cpu_time
                           process_memory_hwm
                           process_page_fault
                           process_sampling
                           process_user_cpu_time
                           process_virtual_memory
                           pthread
                           python
                           rccl
                           rocm_hip
                           rocm_hsa
                           rocm_roctx
                           rocm_smi
                           rocprofiler
                           roctracer
                           sampling
                           thread_context_switch
                           thread_cpu_time
                           thread_hardware_counter
                           thread_page_fault
                           thread_peak_memory
                           thread_wall_time
                           timemory
                           user ]
                                   Disable collecting profiling and trace data for these categories (min: 1, dtype: string) 

    [IO OPTIONS]                                     

    --tmpdir                       Base directory for temporary files (min: 0, dtype: string) 
    --use-pid                      Enable tagging filenames with process identifier (either MPI rank or pid) (min: 0, dtype: boolean) 
    --time-output                  Output data to subfolder w/ a timestamp (see also: TIME_FORMAT) (min: 0, dtype: boolean) 
    --causal-file                  Name of causal output filename (w/o extension) (min: 0, dtype: string) 
    --causal-file-reset            Overwrite any existing causal output file instead of appending to it (min: 0, dtype: boolean) 
    --use-temporary-files          Write data to temporary files to minimize the memory usage of omnitrace, e.g. call-stack samples will be periodically 
                                   written to a file and re-loaded during finalization (min: 0, dtype: boolean) 

    [PERFETTO OPTIONS]                               

    --perfetto-backend [ all | inprocess | system ]
                                   Specify the perfetto backend to activate. Options are: 'inprocess', 'system', or 'all' (min: 1, dtype: string) 
    --perfetto-roctracer-per-stream
                                   Separate roctracer GPU side traces (copies, kernels) into separate tracks based on the stream they're enqueued into 
                                   (min: 0, dtype: boolean) 
    --perfetto-shmem-size-hint-kb 
                                   Hint for shared-memory buffer size in perfetto (in KB) (min: 1, dtype: integral) 

    [TIMEMORY OPTIONS]                               

    --timemory-components          List of components to collect via timemory (see `omnitrace-avail -C`) (min: 0, dtype: string) 

    [ROCM OPTIONS]                                   

    --roctracer-hip-activity       Enable HIP activity tracing support (min: 0, dtype: boolean) 
    --roctracer-hip-api            Enable HIP API tracing support (min: 0, dtype: boolean) 
    --roctracer-hsa-activity       Enable HSA activity tracing support (min: 0, dtype: boolean) 
    --roctracer-hsa-api            Enable HSA API tracing support (min: 0, dtype: boolean) 
    --roctracer-hsa-api-types      HSA API type to collect (min: 0, dtype: string) 

    [CRITICAL_TRACE OPTIONS]                               

    --critical-trace-buffer-count 
                                   Number of critical trace records to store in thread-local memory before submitting to shared buffer (min: 1, dtype: 
                                   integral) 
    --critical-trace-count         Number of critical trace to export (0 == all) (min: 1, dtype: integral) 
    --critical-trace-per-row       How many critical traces per row in perfetto (0 == all in one row) (min: 1, dtype: integral) 

    [MISCELLANEOUS OPTIONS]                               

    -i, --inlines                  Include inline info in output when available (max: 1, dtype: bool) 
    --hsa-interrupt [ 0 | 1 ]      Set the value of the HSA_ENABLE_INTERRUPT environment variable.
                                     ROCm version 5.2 and older have a bug which will cause a deadlock if a sample is taken while waiting for the signal
                                     that a kernel completed -- which happens when sampling with a real-clock timer. We require this option to be set to
                                     when --realtime is specified to make users aware that, while this may fix the bug, it can have a negative impact on
                                     performance.
                                     Values:
                                       0     avoid triggering the bug, potentially at the cost of reduced performance
                                       1     do not modify how ROCm is notified about kernel completion (count: 1, dtype: int)
jrmadsen commented 1 year ago