Profiling execution error on 2x MI100 system

Describe the bug I am unable to profile my workload.

Development Environment:

Linux Distribution: Docker Container running Ubuntu 22.04
Omniperf Version: 2.0.1 (release)
GPU: 2x MI100
Custer (if applicable): [e.g. Crusher, ]

To Reproduce Steps to reproduce the behavior:

Setup ubuntu 22.04 dockercontainer running rocm 6.0
Clone and compile llama.cpp and disable GPU peer to peer during compilation.
Run the following command to profile a sample LLM forward pass omniperf profile -V -n llama3 -k dequantize_mul_mat_vec -- ./llama.cpp/llama-cli -m llama.cpp/models/llama3/Meta-Llama-3-8B-Instruct/ggml-model-f16.gguf -ngl 50 -n $num_tokens -c $context --prompt $user_prompt

See command output and error below:

DEBUG ROC Profiler: /opt/rocm-6.0.0/bin/rocprof
DEBUG Execution mode = profile

___                  _                  __ 
/ _ \ _ __ ___  _ __ (_)_ __   ___ _ __ / _|
| | | | '_ ` _ \| '_ \| | '_ \ / _ \ '__| |_ 
| |_| | | | | | | | | | | |_) |  __/ |  |  _|
\___/|_| |_| |_|_| |_|_| .__/ \___|_|  |_|  
                    |_|                  

DEBUG [profiling] perform SoC profiling setup for gfx908
DEBUG [profiling] pre-processing using rocprofv1 profiler
DEBUG [profiling] performing profiling using rocprofv1 profiler
INFO Omniperf version: 2.0.1
INFO Profiler choice: rocprofv1
INFO Path: /root/git/rocm-llm-profile/workloads/llama3/MI100
INFO Target: MI100
INFO Command: ./llama.cpp/llama-cli -m llama.cpp/models/llama3/Meta-Llama-3-8B-Instruct/ggml-model-f16.gguf -ngl 50 -n 4 -c 2048 --prompt The largest continent is
INFO Kernel Selection: ['dequantize_mul_mat_vec']
INFO Dispatch Selection: None
INFO Hardware Blocks: All
INFO 
INFO ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
INFO Collecting Performance Counters
INFO ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
INFO 
DEBUG [subprocess] ['sed', '-i', '-r', 's%^(kernel:).*%kernel: dequantize_mul_mat_vec%g', '/root/git/rocm-llm-profile/workloads/llama3/MI100/perfmon/SQ_IFETCH_LEVEL.txt']
INFO 
DEBUG 
INFO [profiling] Current input file: /root/git/rocm-llm-profile/workloads/llama3/MI100/perfmon/SQ_IFETCH_LEVEL.txt
DEBUG pmc file: SQ_IFETCH_LEVEL.txt
DEBUG [subprocess] ['rocprof', '-i', '/root/git/rocm-llm-profile/workloads/llama3/MI100/perfmon/SQ_IFETCH_LEVEL.txt', '-m', '/git/2.0.1/libexec/omniperf/omniperf_soc/profile_configs/metrics.xml', '--timestamp', 'on', '-o', '/root/git/rocm-llm-profile/workloads/llama3/MI100/SQ_IFETCH_LEVEL.csv', '"./llama.cpp/llama-cli -m llama.cpp/models/llama3/Meta-Llama-3-8B-Instruct/ggml-model-f16.gguf -ngl 50 -n 4 -c 2048 --prompt The largest continent is"']
INFO    |-> [rocprof] RPL: on '240712_142848' from '/opt/rocm-6.0.0' in '/root/git/rocm-llm-profile'
INFO    |-> [rocprof] RPL: profiling '""./llama.cpp/llama-cli -m llama.cpp/models/llama3/Meta-Llama-3-8B-Instruct/ggml-model-f16.gguf -ngl 50 -n 4 -c 2048 --prompt The largest continent is""'
INFO    |-> [rocprof] RPL: input file '/root/git/rocm-llm-profile/workloads/llama3/MI100/perfmon/SQ_IFETCH_LEVEL.txt'
INFO    |-> [rocprof] RPL: output dir '/tmp/rpl_data_240712_142848_186089'
INFO    |-> [rocprof] RPL: result dir '/tmp/rpl_data_240712_142848_186089/input0_results_240712_142848'
INFO    |-> [rocprof] error: unknown argument: largest
INFO    |-> [rocprof] usage: ./llama.cpp/llama-cli [options]
INFO    |-> [rocprof] 
INFO    |-> [rocprof] general:
INFO    |-> [rocprof] 
INFO    |-> [rocprof] -h,    --help, --usage          print usage and exit
INFO    |-> [rocprof] --version                show version and build info
INFO    |-> [rocprof] -v,    --verbose                print verbose information
INFO    |-> [rocprof] --verbosity N            set specific verbosity level (default: 0)
INFO    |-> [rocprof] --verbose-prompt         print a verbose prompt before generation (default: false)
INFO    |-> [rocprof] --no-display-prompt      don't print prompt at generation (default: false)
INFO    |-> [rocprof] -co,   --color                  colorise output to distinguish prompt and user input from generations (default: false)
INFO    |-> [rocprof] -s,    --seed SEED              RNG seed (default: -1, use random seed for < 0)
INFO    |-> [rocprof] -t,    --threads N              number of threads to use during generation (default: 128)
INFO    |-> [rocprof] -tb,   --threads-batch N        number of threads to use during batch and prompt processing (default: same as --threads)
INFO    |-> [rocprof] -td,   --threads-draft N        number of threads to use during generation (default: same as --threads)
INFO    |-> [rocprof] -tbd,  --threads-batch-draft N  number of threads to use during batch and prompt processing (default: same as --threads-draft)
INFO    |-> [rocprof] --draft N                number of tokens to draft for speculative decoding (default: 5)
INFO    |-> [rocprof] -ps,   --p-split N              speculative decoding split probability (default: 0.1)
INFO    |-> [rocprof] -lcs,  --lookup-cache-static FNAME
INFO    |-> [rocprof] path to static lookup cache to use for lookup decoding (not updated by generation)
INFO    |-> [rocprof] -lcd,  --lookup-cache-dynamic FNAME
INFO    |-> [rocprof] path to dynamic lookup cache to use for lookup decoding (updated by generation)
INFO    |-> [rocprof] -c,    --ctx-size N             size of the prompt context (default: 0, 0 = loaded from model)
INFO    |-> [rocprof] -n,    --predict N              number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled)
INFO    |-> [rocprof] -b,    --batch-size N           logical maximum batch size (default: 2048)
INFO    |-> [rocprof] -ub,   --ubatch-size N          physical maximum batch size (default: 512)
INFO    |-> [rocprof] --keep N                 number of tokens to keep from the initial prompt (default: 0, -1 = all)
INFO    |-> [rocprof] --chunks N               max number of chunks to process (default: -1, -1 = all)
INFO    |-> [rocprof] -fa,   --flash-attn             enable Flash Attention (default: disabled)
INFO    |-> [rocprof] -p,    --prompt PROMPT          prompt to start generation with (default: '')
INFO    |-> [rocprof] -f,    --file FNAME             a file containing the prompt (default: none)
INFO    |-> [rocprof] --in-file FNAME          an input file (repeat to specify multiple files)
INFO    |-> [rocprof] -bf,   --binary-file FNAME      binary file containing the prompt (default: none)
INFO    |-> [rocprof] -e,    --escape                 process escapes sequences (\n, \r, \t, \', \", \\) (default: true)
INFO    |-> [rocprof] --no-escape              do not process escape sequences
INFO    |-> [rocprof] -ptc,  --print-token-count N    print token count every N tokens (default: -1)
INFO    |-> [rocprof] --prompt-cache FNAME     file to cache prompt state for faster startup (default: none)
INFO    |-> [rocprof] --prompt-cache-all       if specified, saves user input and generations to cache as well
INFO    |-> [rocprof] not supported with --interactive or other interactive options
INFO    |-> [rocprof] --prompt-cache-ro        if specified, uses the prompt cache but does not update it
INFO    |-> [rocprof] -r,    --reverse-prompt PROMPT  halt generation at PROMPT, return control in interactive mode
INFO    |-> [rocprof] can be specified more than once for multiple prompts
INFO    |-> [rocprof] -sp,   --special                special tokens output enabled (default: false)
INFO    |-> [rocprof] -cnv,  --conversation           run in conversation mode (does not print special tokens and suffix/prefix) (default: false)
INFO    |-> [rocprof] -i,    --interactive            run in interactive mode (default: false)
INFO    |-> [rocprof] -if,   --interactive-first      run in interactive mode and wait for input right away (default: false)
INFO    |-> [rocprof] -mli,  --multiline-input        allows you to write or paste multiple lines without ending each in '\'
INFO    |-> [rocprof] --in-prefix-bos          prefix BOS to user inputs, preceding the `--in-prefix` string
INFO    |-> [rocprof] --in-prefix STRING       string to prefix user inputs with (default: empty)
INFO    |-> [rocprof] --in-suffix STRING       string to suffix after user inputs with (default: empty)
INFO    |-> [rocprof] 
INFO    |-> [rocprof] sampling:
INFO    |-> [rocprof] 
INFO    |-> [rocprof] --samplers SAMPLERS      samplers that will be used for generation in the order, separated by ';'
INFO    |-> [rocprof] (default: top_k;tfs_z;typical_p;top_p;min_p;temperature)
INFO    |-> [rocprof] --sampling-seq SEQUENCE  simplified sequence for samplers that will be used (default: kfypmt)
INFO    |-> [rocprof] --ignore-eos             ignore end of stream token and continue generating (implies --logit-bias EOS-inf)
INFO    |-> [rocprof] --penalize-nl            penalize newline tokens (default: false)
INFO    |-> [rocprof] --temp N                 temperature (default: 0.8)
INFO    |-> [rocprof] --top-k N                top-k sampling (default: 40, 0 = disabled)
INFO    |-> [rocprof] --top-p N                top-p sampling (default: 0.9, 1.0 = disabled)
INFO    |-> [rocprof] --min-p N                min-p sampling (default: 0.1, 0.0 = disabled)
INFO    |-> [rocprof] --tfs N                  tail free sampling, parameter z (default: 1.0, 1.0 = disabled)
INFO    |-> [rocprof] --typical N              locally typical sampling, parameter p (default: 1.0, 1.0 = disabled)
INFO    |-> [rocprof] --repeat-last-n N        last n tokens to consider for penalize (default: 64, 0 = disabled, -1 = ctx_size)
INFO    |-> [rocprof] --repeat-penalty N       penalize repeat sequence of tokens (default: 1.0, 1.0 = disabled)
INFO    |-> [rocprof] --presence-penalty N     repeat alpha presence penalty (default: 0.0, 0.0 = disabled)
INFO    |-> [rocprof] --frequency-penalty N    repeat alpha frequency penalty (default: 0.0, 0.0 = disabled)
INFO    |-> [rocprof] --dynatemp-range N       dynamic temperature range (default: 0.0, 0.0 = disabled)
INFO    |-> [rocprof] --dynatemp-exp N         dynamic temperature exponent (default: 1.0)
INFO    |-> [rocprof] --mirostat N             use Mirostat sampling.
INFO    |-> [rocprof] Top K, Nucleus, Tail Free and Locally Typical samplers are ignored if used.
INFO    |-> [rocprof] (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)
INFO    |-> [rocprof] --mirostat-lr N          Mirostat learning rate, parameter eta (default: 0.1)
INFO    |-> [rocprof] --mirostat-ent N         Mirostat target entropy, parameter tau (default: 5.0)
INFO    |-> [rocprof] -l TOKEN_ID(+/-)BIAS     modifies the likelihood of token appearing in the completion,
INFO    |-> [rocprof] i.e. `--logit-bias 15043+1` to increase likelihood of token ' Hello',
INFO    |-> [rocprof] or `--logit-bias 15043-1` to decrease likelihood of token ' Hello'
INFO    |-> [rocprof] --cfg-negative-prompt PROMPT
INFO    |-> [rocprof] negative prompt to use for guidance (default: '')
INFO    |-> [rocprof] --cfg-negative-prompt-file FNAME
INFO    |-> [rocprof] negative prompt file to use for guidance
INFO    |-> [rocprof] --cfg-scale N            strength of guidance (default: 1.0, 1.0 = disable)
INFO    |-> [rocprof] 
INFO    |-> [rocprof] grammar:
INFO    |-> [rocprof] 
INFO    |-> [rocprof] --grammar GRAMMAR        BNF-like grammar to constrain generations (see samples in grammars/ dir) (default: '')
INFO    |-> [rocprof] --grammar-file FNAME     file to read grammar from
INFO    |-> [rocprof] -j,    --json-schema SCHEMA     JSON schema to constrain generations (https://json-schema.org/), e.g. `{}` for any JSON object
INFO    |-> [rocprof] For schemas w/ external $refs, use --grammar + example/json_schema_to_grammar.py instead
INFO    |-> [rocprof] 
INFO    |-> [rocprof] embedding:
INFO    |-> [rocprof] 
INFO    |-> [rocprof] --pooling {none,mean,cls}
INFO    |-> [rocprof] pooling type for embeddings, use model default if unspecified
INFO    |-> [rocprof] 
INFO    |-> [rocprof] context hacking:
INFO    |-> [rocprof] 
INFO    |-> [rocprof] --rope-scaling {none,linear,yarn}
INFO    |-> [rocprof] RoPE frequency scaling method, defaults to linear unless specified by the model
INFO    |-> [rocprof] --rope-scale N           RoPE context scaling factor, expands context by a factor of N
INFO    |-> [rocprof] --rope-freq-base N       RoPE base frequency, used by NTK-aware scaling (default: loaded from model)
INFO    |-> [rocprof] --rope-freq-scale N      RoPE frequency scaling factor, expands context by a factor of 1/N
INFO    |-> [rocprof] --yarn-orig-ctx N        YaRN: original context size of model (default: 0 = model training context size)
INFO    |-> [rocprof] --yarn-ext-factor N      YaRN: extrapolation mix factor (default: -1.0, 0.0 = full interpolation)
INFO    |-> [rocprof] --yarn-attn-factor N     YaRN: scale sqrt(t) or attention magnitude (default: 1.0)
INFO    |-> [rocprof] --yarn-beta-slow N       YaRN: high correction dim or alpha (default: 1.0)
INFO    |-> [rocprof] --yarn-beta-fast N       YaRN: low correction dim or beta (default: 32.0)
INFO    |-> [rocprof] -gan,  --grp-attn-n N           group-attention factor (default: 1)
INFO    |-> [rocprof] -gaw,  --grp-attn-w N           group-attention width (default: 512.0)
INFO    |-> [rocprof] -dkvc, --dump-kv-cache          verbose print of the KV cache
INFO    |-> [rocprof] -nkvo, --no-kv-offload          disable KV offload
INFO    |-> [rocprof] -ctk,  --cache-type-k TYPE      KV cache data type for K (default: f16)
INFO    |-> [rocprof] -ctv,  --cache-type-v TYPE      KV cache data type for V (default: f16)
INFO    |-> [rocprof] 
INFO    |-> [rocprof] perplexity:
INFO    |-> [rocprof] 
INFO    |-> [rocprof] --all-logits             return logits for all tokens in the batch (default: false)
INFO    |-> [rocprof] --hellaswag              compute HellaSwag score over random tasks from datafile supplied with -f
INFO    |-> [rocprof] --hellaswag-tasks N      number of tasks to use when computing the HellaSwag score (default: 400)
INFO    |-> [rocprof] --winogrande             compute Winogrande score over random tasks from datafile supplied with -f
INFO    |-> [rocprof] --winogrande-tasks N     number of tasks to use when computing the Winogrande score (default: 0)
INFO    |-> [rocprof] --multiple-choice        compute multiple choice score over random tasks from datafile supplied with -f
INFO    |-> [rocprof] --multiple-choice-tasks N
INFO    |-> [rocprof] number of tasks to use when computing the multiple choice score (default: 0)
INFO    |-> [rocprof] --kl-divergence          computes KL-divergence to logits provided via --kl-divergence-base
INFO    |-> [rocprof] --ppl-stride N           stride for perplexity calculation (default: 0)
INFO    |-> [rocprof] --ppl-output-type {0,1}  output type for perplexity calculation (default: 0)
INFO    |-> [rocprof] 
INFO    |-> [rocprof] parallel:
INFO    |-> [rocprof] 
INFO    |-> [rocprof] -dt,   --defrag-thold N         KV cache defragmentation threshold (default: -1.0, < 0 - disabled)
INFO    |-> [rocprof] -np,   --parallel N             number of parallel sequences to decode (default: 1)
INFO    |-> [rocprof] -ns,   --sequences N            number of sequences to decode (default: 1)
INFO    |-> [rocprof] -cb,   --cont-batching          enable continuous batching (a.k.a dynamic batching) (default: enabled)
INFO    |-> [rocprof] 
INFO    |-> [rocprof] multi-modality:
INFO    |-> [rocprof] 
INFO    |-> [rocprof] --mmproj FILE            path to a multimodal projector file for LLaVA. see examples/llava/README.md
INFO    |-> [rocprof] --image FILE             path to an image file. use with multimodal models. Specify multiple times for batching
INFO    |-> [rocprof] 
INFO    |-> [rocprof] backend:
INFO    |-> [rocprof] 
INFO    |-> [rocprof] --rpc SERVERS            comma separated list of RPC servers
INFO    |-> [rocprof] --mlock                  force system to keep model in RAM rather than swapping or compressing
INFO    |-> [rocprof] --no-mmap                do not memory-map model (slower load but may reduce pageouts if not using mlock)
INFO    |-> [rocprof] --numa TYPE              attempt optimizations that help on some NUMA systems
INFO    |-> [rocprof] - distribute: spread execution evenly over all nodes
INFO    |-> [rocprof] - isolate: only spawn threads on CPUs on the node that execution started on
INFO    |-> [rocprof] - numactl: use the CPU map provided by numactl
INFO    |-> [rocprof] if run without this previously, it is recommended to drop the system page cache before using this
INFO    |-> [rocprof] see https://github.com/ggerganov/llama.cpp/issues/1437
INFO    |-> [rocprof] -ngl,  --gpu-layers N           number of layers to store in VRAM
INFO    |-> [rocprof] -ngld, --gpu-layers-draft N     number of layers to store in VRAM for the draft model
INFO    |-> [rocprof] -sm,   --split-mode SPLIT_MODE  how to split the model across multiple GPUs, one of:
INFO    |-> [rocprof] - none: use one GPU only
INFO    |-> [rocprof] - layer (default): split layers and KV across GPUs
INFO    |-> [rocprof] - row: split rows across GPUs
INFO    |-> [rocprof] -ts,   --tensor-split SPLIT     fraction of the model to offload to each GPU, comma-separated list of proportions, e.g. 3,1
INFO    |-> [rocprof] -mg,   --main-gpu i             the GPU to use for the model (with split-mode = none),
INFO    |-> [rocprof] or for intermediate results and KV (with split-mode = row) (default: 0)
INFO    |-> [rocprof] 
INFO    |-> [rocprof] model:
INFO    |-> [rocprof] 
INFO    |-> [rocprof] --check-tensors          check model tensor data for invalid values (default: false)
INFO    |-> [rocprof] --override-kv KEY=TYPE:VALUE
INFO    |-> [rocprof] advanced option to override model metadata by key. may be specified multiple times.
INFO    |-> [rocprof] types: int, float, bool, str. example: --override-kv tokenizer.ggml.add_bos_token=bool:false
INFO    |-> [rocprof] --lora FNAME             apply LoRA adapter (implies --no-mmap)
INFO    |-> [rocprof] --lora-scaled FNAME S    apply LoRA adapter with user defined scaling S (implies --no-mmap)
INFO    |-> [rocprof] --lora-base FNAME        optional model to use as a base for the layers modified by the LoRA adapter
INFO    |-> [rocprof] --control-vector FNAME   add a control vector
INFO    |-> [rocprof] --control-vector-scaled FNAME SCALE
INFO    |-> [rocprof] add a control vector with user defined scaling SCALE
INFO    |-> [rocprof] --control-vector-layer-range START END
INFO    |-> [rocprof] layer range to apply the control vector(s) to, start and end inclusive
INFO    |-> [rocprof] -m,    --model FNAME            model path (default: models/$filename with filename from --hf-file
INFO    |-> [rocprof] or --model-url if set, otherwise models/7B/ggml-model-f16.gguf)
INFO    |-> [rocprof] -md,   --model-draft FNAME      draft model for speculative decoding (default: unused)
INFO    |-> [rocprof] -mu,   --model-url MODEL_URL    model download url (default: unused)
INFO    |-> [rocprof] -hfr,  --hf-repo REPO           Hugging Face model repository (default: unused)
INFO    |-> [rocprof] -hff,  --hf-file FILE           Hugging Face model file (default: unused)
INFO    |-> [rocprof] 
INFO    |-> [rocprof] retrieval:
INFO    |-> [rocprof] 
INFO    |-> [rocprof] --context-file FNAME     file to load context from (repeat to specify multiple files)
INFO    |-> [rocprof] --chunk-size N           minimum length of embedded text chunks (default: 64)
INFO    |-> [rocprof] --chunk-separator STRING
INFO    |-> [rocprof] separator between chunks (default: '
INFO    |-> [rocprof] ')
INFO    |-> [rocprof] 
INFO    |-> [rocprof] passkey:
INFO    |-> [rocprof] 
INFO    |-> [rocprof] --junk N                 number of times to repeat the junk text (default: 250)
INFO    |-> [rocprof] --pos N                  position of the passkey in the junk text (default: -1)
INFO    |-> [rocprof] 
INFO    |-> [rocprof] imatrix:
INFO    |-> [rocprof] 
INFO    |-> [rocprof] -o,    --output FNAME           output file (default: 'imatrix.dat')
INFO    |-> [rocprof] --output-frequency N     output the imatrix every N iterations (default: 10)
INFO    |-> [rocprof] --save-frequency N       save an imatrix copy every N iterations (default: 0)
INFO    |-> [rocprof] --process-output         collect data for the output tensor (default: false)
INFO    |-> [rocprof] --no-ppl                 do not compute perplexity (default: true)
INFO    |-> [rocprof] --chunk N                start processing the input from chunk N (default: 0)
INFO    |-> [rocprof] 
INFO    |-> [rocprof] bench:
INFO    |-> [rocprof] 
INFO    |-> [rocprof] -pps                            is the prompt shared across parallel sequences (default: false)
INFO    |-> [rocprof] -npp n0,n1,...                  number of prompt tokens
INFO    |-> [rocprof] -ntg n0,n1,...                  number of text generation tokens
INFO    |-> [rocprof] -npl n0,n1,...                  number of parallel prompts
INFO    |-> [rocprof] 
INFO    |-> [rocprof] server:
INFO    |-> [rocprof] 
INFO    |-> [rocprof] --host HOST              ip address to listen (default: 127.0.0.1)
INFO    |-> [rocprof] --port PORT              port to listen (default: 8080)
INFO    |-> [rocprof] --path PATH              path to serve static files from (default: )
INFO    |-> [rocprof] --embedding(s)           enable embedding endpoint (default: disabled)
INFO    |-> [rocprof] --api-key KEY            API key to use for authentication (default: none)
INFO    |-> [rocprof] --api-key-file FNAME     path to file containing API keys (default: none)
INFO    |-> [rocprof] --ssl-key-file FNAME     path to file a PEM-encoded SSL private key
INFO    |-> [rocprof] --ssl-cert-file FNAME    path to file a PEM-encoded SSL certificate
INFO    |-> [rocprof] --timeout N              server read/write timeout in seconds (default: 600)
INFO    |-> [rocprof] --threads-http N         number of threads used to process HTTP requests (default: -1)
INFO    |-> [rocprof] --system-prompt-file FNAME
INFO    |-> [rocprof] set a file to load a system prompt (initial prompt of all slots), this is useful for chat applications
INFO    |-> [rocprof] --log-format {text,json}
INFO    |-> [rocprof] log output format: json or text (default: json)
INFO    |-> [rocprof] --metrics                enable prometheus compatible metrics endpoint (default: disabled)
INFO    |-> [rocprof] --no-slots               disables slots monitoring endpoint (default: enabled)
INFO    |-> [rocprof] --slot-save-path PATH    path to save slot kv cache (default: disabled)
INFO    |-> [rocprof] --chat-template JINJA_TEMPLATE
INFO    |-> [rocprof] set custom jinja chat template (default: template taken from model's metadata)
INFO    |-> [rocprof] only commonly used templates are accepted:
INFO    |-> [rocprof] https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template
INFO    |-> [rocprof] -sps,  --slot-prompt-similarity SIMILARITY
INFO    |-> [rocprof] how much the prompt of a request must match the prompt of a slot in order to use that slot (default: 0.50, 0.0 = disabled)
INFO    |-> [rocprof] 
INFO    |-> [rocprof] 
INFO    |-> [rocprof] logging:
INFO    |-> [rocprof] 
INFO    |-> [rocprof] --simple-io              use basic IO for better compatibility in subprocesses and limited consoles
INFO    |-> [rocprof] -ld,   --logdir LOGDIR          path under which to save YAML logs (no logging if unset)
INFO    |-> [rocprof] --log-test               Run simple logging test
INFO    |-> [rocprof] --log-disable            Disable trace logs
INFO    |-> [rocprof] --log-enable             Enable trace logs
INFO    |-> [rocprof] --log-file FNAME         Specify a log filename (without extension)
INFO    |-> [rocprof] --log-new                Create a separate new log file on start. Each log file will have unique name: "<name>.<ID>.log"
INFO    |-> [rocprof] --log-append             Don't truncate the old log file.
INFO    |-> [rocprof] 
INFO    |-> [rocprof] File '/root/git/rocm-llm-profile/workloads/llama3/MI100/SQ_IFETCH_LEVEL.csv' is generating
INFO    |-> [rocprof] 
ERROR Profiling execution failed.

Expected behavior I expect omniperf to run without any errors.

Screenshots If applicable, add screenshots to help explain your problem.

Additional context The command below shows that my omniperf installation is correct.

root@febc8da47e3f:~/git/rocm-llm-profile# ls $INSTALL_DIR
2.0.1  modulefiles  python-libs

ROCm / rocprofiler-compute

Profiling execution error on 2x MI100 system #381