ROCm / rocprofiler-compute

Advanced Profiling and Analytics for AMD Hardware
https://rocm.docs.amd.com/projects/omniperf/en/latest/
MIT License
135 stars 49 forks source link

Profiling execution error on 2x MI100 system #381

Closed aymane-eljerari closed 3 months ago

aymane-eljerari commented 3 months ago

Describe the bug I am unable to profile my workload.

Development Environment:

To Reproduce Steps to reproduce the behavior:

  1. Setup ubuntu 22.04 dockercontainer running rocm 6.0
  2. Clone and compile llama.cpp and disable GPU peer to peer during compilation.
  3. Run the following command to profile a sample LLM forward pass omniperf profile -V -n llama3 -k dequantize_mul_mat_vec -- ./llama.cpp/llama-cli -m llama.cpp/models/llama3/Meta-Llama-3-8B-Instruct/ggml-model-f16.gguf -ngl 50 -n $num_tokens -c $context --prompt $user_prompt
  4. See command output and error below:

    DEBUG ROC Profiler: /opt/rocm-6.0.0/bin/rocprof
    DEBUG Execution mode = profile
    
    ___                  _                  __ 
    / _ \ _ __ ___  _ __ (_)_ __   ___ _ __ / _|
    | | | | '_ ` _ \| '_ \| | '_ \ / _ \ '__| |_ 
    | |_| | | | | | | | | | | |_) |  __/ |  |  _|
    \___/|_| |_| |_|_| |_|_| .__/ \___|_|  |_|  
                        |_|                  
    
    DEBUG [profiling] perform SoC profiling setup for gfx908
    DEBUG [profiling] pre-processing using rocprofv1 profiler
    DEBUG [profiling] performing profiling using rocprofv1 profiler
    INFO Omniperf version: 2.0.1
    INFO Profiler choice: rocprofv1
    INFO Path: /root/git/rocm-llm-profile/workloads/llama3/MI100
    INFO Target: MI100
    INFO Command: ./llama.cpp/llama-cli -m llama.cpp/models/llama3/Meta-Llama-3-8B-Instruct/ggml-model-f16.gguf -ngl 50 -n 4 -c 2048 --prompt The largest continent is
    INFO Kernel Selection: ['dequantize_mul_mat_vec']
    INFO Dispatch Selection: None
    INFO Hardware Blocks: All
    INFO 
    INFO ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    INFO Collecting Performance Counters
    INFO ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    INFO 
    DEBUG [subprocess] ['sed', '-i', '-r', 's%^(kernel:).*%kernel: dequantize_mul_mat_vec%g', '/root/git/rocm-llm-profile/workloads/llama3/MI100/perfmon/SQ_IFETCH_LEVEL.txt']
    INFO 
    DEBUG 
    INFO [profiling] Current input file: /root/git/rocm-llm-profile/workloads/llama3/MI100/perfmon/SQ_IFETCH_LEVEL.txt
    DEBUG pmc file: SQ_IFETCH_LEVEL.txt
    DEBUG [subprocess] ['rocprof', '-i', '/root/git/rocm-llm-profile/workloads/llama3/MI100/perfmon/SQ_IFETCH_LEVEL.txt', '-m', '/git/2.0.1/libexec/omniperf/omniperf_soc/profile_configs/metrics.xml', '--timestamp', 'on', '-o', '/root/git/rocm-llm-profile/workloads/llama3/MI100/SQ_IFETCH_LEVEL.csv', '"./llama.cpp/llama-cli -m llama.cpp/models/llama3/Meta-Llama-3-8B-Instruct/ggml-model-f16.gguf -ngl 50 -n 4 -c 2048 --prompt The largest continent is"']
    INFO    |-> [rocprof] RPL: on '240712_142848' from '/opt/rocm-6.0.0' in '/root/git/rocm-llm-profile'
    INFO    |-> [rocprof] RPL: profiling '""./llama.cpp/llama-cli -m llama.cpp/models/llama3/Meta-Llama-3-8B-Instruct/ggml-model-f16.gguf -ngl 50 -n 4 -c 2048 --prompt The largest continent is""'
    INFO    |-> [rocprof] RPL: input file '/root/git/rocm-llm-profile/workloads/llama3/MI100/perfmon/SQ_IFETCH_LEVEL.txt'
    INFO    |-> [rocprof] RPL: output dir '/tmp/rpl_data_240712_142848_186089'
    INFO    |-> [rocprof] RPL: result dir '/tmp/rpl_data_240712_142848_186089/input0_results_240712_142848'
    INFO    |-> [rocprof] error: unknown argument: largest
    INFO    |-> [rocprof] usage: ./llama.cpp/llama-cli [options]
    INFO    |-> [rocprof] 
    INFO    |-> [rocprof] general:
    INFO    |-> [rocprof] 
    INFO    |-> [rocprof] -h,    --help, --usage          print usage and exit
    INFO    |-> [rocprof] --version                show version and build info
    INFO    |-> [rocprof] -v,    --verbose                print verbose information
    INFO    |-> [rocprof] --verbosity N            set specific verbosity level (default: 0)
    INFO    |-> [rocprof] --verbose-prompt         print a verbose prompt before generation (default: false)
    INFO    |-> [rocprof] --no-display-prompt      don't print prompt at generation (default: false)
    INFO    |-> [rocprof] -co,   --color                  colorise output to distinguish prompt and user input from generations (default: false)
    INFO    |-> [rocprof] -s,    --seed SEED              RNG seed (default: -1, use random seed for < 0)
    INFO    |-> [rocprof] -t,    --threads N              number of threads to use during generation (default: 128)
    INFO    |-> [rocprof] -tb,   --threads-batch N        number of threads to use during batch and prompt processing (default: same as --threads)
    INFO    |-> [rocprof] -td,   --threads-draft N        number of threads to use during generation (default: same as --threads)
    INFO    |-> [rocprof] -tbd,  --threads-batch-draft N  number of threads to use during batch and prompt processing (default: same as --threads-draft)
    INFO    |-> [rocprof] --draft N                number of tokens to draft for speculative decoding (default: 5)
    INFO    |-> [rocprof] -ps,   --p-split N              speculative decoding split probability (default: 0.1)
    INFO    |-> [rocprof] -lcs,  --lookup-cache-static FNAME
    INFO    |-> [rocprof] path to static lookup cache to use for lookup decoding (not updated by generation)
    INFO    |-> [rocprof] -lcd,  --lookup-cache-dynamic FNAME
    INFO    |-> [rocprof] path to dynamic lookup cache to use for lookup decoding (updated by generation)
    INFO    |-> [rocprof] -c,    --ctx-size N             size of the prompt context (default: 0, 0 = loaded from model)
    INFO    |-> [rocprof] -n,    --predict N              number of tokens to predict (default: -1, -1 = infinity, -2 = until context filled)
    INFO    |-> [rocprof] -b,    --batch-size N           logical maximum batch size (default: 2048)
    INFO    |-> [rocprof] -ub,   --ubatch-size N          physical maximum batch size (default: 512)
    INFO    |-> [rocprof] --keep N                 number of tokens to keep from the initial prompt (default: 0, -1 = all)
    INFO    |-> [rocprof] --chunks N               max number of chunks to process (default: -1, -1 = all)
    INFO    |-> [rocprof] -fa,   --flash-attn             enable Flash Attention (default: disabled)
    INFO    |-> [rocprof] -p,    --prompt PROMPT          prompt to start generation with (default: '')
    INFO    |-> [rocprof] -f,    --file FNAME             a file containing the prompt (default: none)
    INFO    |-> [rocprof] --in-file FNAME          an input file (repeat to specify multiple files)
    INFO    |-> [rocprof] -bf,   --binary-file FNAME      binary file containing the prompt (default: none)
    INFO    |-> [rocprof] -e,    --escape                 process escapes sequences (\n, \r, \t, \', \", \\) (default: true)
    INFO    |-> [rocprof] --no-escape              do not process escape sequences
    INFO    |-> [rocprof] -ptc,  --print-token-count N    print token count every N tokens (default: -1)
    INFO    |-> [rocprof] --prompt-cache FNAME     file to cache prompt state for faster startup (default: none)
    INFO    |-> [rocprof] --prompt-cache-all       if specified, saves user input and generations to cache as well
    INFO    |-> [rocprof] not supported with --interactive or other interactive options
    INFO    |-> [rocprof] --prompt-cache-ro        if specified, uses the prompt cache but does not update it
    INFO    |-> [rocprof] -r,    --reverse-prompt PROMPT  halt generation at PROMPT, return control in interactive mode
    INFO    |-> [rocprof] can be specified more than once for multiple prompts
    INFO    |-> [rocprof] -sp,   --special                special tokens output enabled (default: false)
    INFO    |-> [rocprof] -cnv,  --conversation           run in conversation mode (does not print special tokens and suffix/prefix) (default: false)
    INFO    |-> [rocprof] -i,    --interactive            run in interactive mode (default: false)
    INFO    |-> [rocprof] -if,   --interactive-first      run in interactive mode and wait for input right away (default: false)
    INFO    |-> [rocprof] -mli,  --multiline-input        allows you to write or paste multiple lines without ending each in '\'
    INFO    |-> [rocprof] --in-prefix-bos          prefix BOS to user inputs, preceding the `--in-prefix` string
    INFO    |-> [rocprof] --in-prefix STRING       string to prefix user inputs with (default: empty)
    INFO    |-> [rocprof] --in-suffix STRING       string to suffix after user inputs with (default: empty)
    INFO    |-> [rocprof] 
    INFO    |-> [rocprof] sampling:
    INFO    |-> [rocprof] 
    INFO    |-> [rocprof] --samplers SAMPLERS      samplers that will be used for generation in the order, separated by ';'
    INFO    |-> [rocprof] (default: top_k;tfs_z;typical_p;top_p;min_p;temperature)
    INFO    |-> [rocprof] --sampling-seq SEQUENCE  simplified sequence for samplers that will be used (default: kfypmt)
    INFO    |-> [rocprof] --ignore-eos             ignore end of stream token and continue generating (implies --logit-bias EOS-inf)
    INFO    |-> [rocprof] --penalize-nl            penalize newline tokens (default: false)
    INFO    |-> [rocprof] --temp N                 temperature (default: 0.8)
    INFO    |-> [rocprof] --top-k N                top-k sampling (default: 40, 0 = disabled)
    INFO    |-> [rocprof] --top-p N                top-p sampling (default: 0.9, 1.0 = disabled)
    INFO    |-> [rocprof] --min-p N                min-p sampling (default: 0.1, 0.0 = disabled)
    INFO    |-> [rocprof] --tfs N                  tail free sampling, parameter z (default: 1.0, 1.0 = disabled)
    INFO    |-> [rocprof] --typical N              locally typical sampling, parameter p (default: 1.0, 1.0 = disabled)
    INFO    |-> [rocprof] --repeat-last-n N        last n tokens to consider for penalize (default: 64, 0 = disabled, -1 = ctx_size)
    INFO    |-> [rocprof] --repeat-penalty N       penalize repeat sequence of tokens (default: 1.0, 1.0 = disabled)
    INFO    |-> [rocprof] --presence-penalty N     repeat alpha presence penalty (default: 0.0, 0.0 = disabled)
    INFO    |-> [rocprof] --frequency-penalty N    repeat alpha frequency penalty (default: 0.0, 0.0 = disabled)
    INFO    |-> [rocprof] --dynatemp-range N       dynamic temperature range (default: 0.0, 0.0 = disabled)
    INFO    |-> [rocprof] --dynatemp-exp N         dynamic temperature exponent (default: 1.0)
    INFO    |-> [rocprof] --mirostat N             use Mirostat sampling.
    INFO    |-> [rocprof] Top K, Nucleus, Tail Free and Locally Typical samplers are ignored if used.
    INFO    |-> [rocprof] (default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)
    INFO    |-> [rocprof] --mirostat-lr N          Mirostat learning rate, parameter eta (default: 0.1)
    INFO    |-> [rocprof] --mirostat-ent N         Mirostat target entropy, parameter tau (default: 5.0)
    INFO    |-> [rocprof] -l TOKEN_ID(+/-)BIAS     modifies the likelihood of token appearing in the completion,
    INFO    |-> [rocprof] i.e. `--logit-bias 15043+1` to increase likelihood of token ' Hello',
    INFO    |-> [rocprof] or `--logit-bias 15043-1` to decrease likelihood of token ' Hello'
    INFO    |-> [rocprof] --cfg-negative-prompt PROMPT
    INFO    |-> [rocprof] negative prompt to use for guidance (default: '')
    INFO    |-> [rocprof] --cfg-negative-prompt-file FNAME
    INFO    |-> [rocprof] negative prompt file to use for guidance
    INFO    |-> [rocprof] --cfg-scale N            strength of guidance (default: 1.0, 1.0 = disable)
    INFO    |-> [rocprof] 
    INFO    |-> [rocprof] grammar:
    INFO    |-> [rocprof] 
    INFO    |-> [rocprof] --grammar GRAMMAR        BNF-like grammar to constrain generations (see samples in grammars/ dir) (default: '')
    INFO    |-> [rocprof] --grammar-file FNAME     file to read grammar from
    INFO    |-> [rocprof] -j,    --json-schema SCHEMA     JSON schema to constrain generations (https://json-schema.org/), e.g. `{}` for any JSON object
    INFO    |-> [rocprof] For schemas w/ external $refs, use --grammar + example/json_schema_to_grammar.py instead
    INFO    |-> [rocprof] 
    INFO    |-> [rocprof] embedding:
    INFO    |-> [rocprof] 
    INFO    |-> [rocprof] --pooling {none,mean,cls}
    INFO    |-> [rocprof] pooling type for embeddings, use model default if unspecified
    INFO    |-> [rocprof] 
    INFO    |-> [rocprof] context hacking:
    INFO    |-> [rocprof] 
    INFO    |-> [rocprof] --rope-scaling {none,linear,yarn}
    INFO    |-> [rocprof] RoPE frequency scaling method, defaults to linear unless specified by the model
    INFO    |-> [rocprof] --rope-scale N           RoPE context scaling factor, expands context by a factor of N
    INFO    |-> [rocprof] --rope-freq-base N       RoPE base frequency, used by NTK-aware scaling (default: loaded from model)
    INFO    |-> [rocprof] --rope-freq-scale N      RoPE frequency scaling factor, expands context by a factor of 1/N
    INFO    |-> [rocprof] --yarn-orig-ctx N        YaRN: original context size of model (default: 0 = model training context size)
    INFO    |-> [rocprof] --yarn-ext-factor N      YaRN: extrapolation mix factor (default: -1.0, 0.0 = full interpolation)
    INFO    |-> [rocprof] --yarn-attn-factor N     YaRN: scale sqrt(t) or attention magnitude (default: 1.0)
    INFO    |-> [rocprof] --yarn-beta-slow N       YaRN: high correction dim or alpha (default: 1.0)
    INFO    |-> [rocprof] --yarn-beta-fast N       YaRN: low correction dim or beta (default: 32.0)
    INFO    |-> [rocprof] -gan,  --grp-attn-n N           group-attention factor (default: 1)
    INFO    |-> [rocprof] -gaw,  --grp-attn-w N           group-attention width (default: 512.0)
    INFO    |-> [rocprof] -dkvc, --dump-kv-cache          verbose print of the KV cache
    INFO    |-> [rocprof] -nkvo, --no-kv-offload          disable KV offload
    INFO    |-> [rocprof] -ctk,  --cache-type-k TYPE      KV cache data type for K (default: f16)
    INFO    |-> [rocprof] -ctv,  --cache-type-v TYPE      KV cache data type for V (default: f16)
    INFO    |-> [rocprof] 
    INFO    |-> [rocprof] perplexity:
    INFO    |-> [rocprof] 
    INFO    |-> [rocprof] --all-logits             return logits for all tokens in the batch (default: false)
    INFO    |-> [rocprof] --hellaswag              compute HellaSwag score over random tasks from datafile supplied with -f
    INFO    |-> [rocprof] --hellaswag-tasks N      number of tasks to use when computing the HellaSwag score (default: 400)
    INFO    |-> [rocprof] --winogrande             compute Winogrande score over random tasks from datafile supplied with -f
    INFO    |-> [rocprof] --winogrande-tasks N     number of tasks to use when computing the Winogrande score (default: 0)
    INFO    |-> [rocprof] --multiple-choice        compute multiple choice score over random tasks from datafile supplied with -f
    INFO    |-> [rocprof] --multiple-choice-tasks N
    INFO    |-> [rocprof] number of tasks to use when computing the multiple choice score (default: 0)
    INFO    |-> [rocprof] --kl-divergence          computes KL-divergence to logits provided via --kl-divergence-base
    INFO    |-> [rocprof] --ppl-stride N           stride for perplexity calculation (default: 0)
    INFO    |-> [rocprof] --ppl-output-type {0,1}  output type for perplexity calculation (default: 0)
    INFO    |-> [rocprof] 
    INFO    |-> [rocprof] parallel:
    INFO    |-> [rocprof] 
    INFO    |-> [rocprof] -dt,   --defrag-thold N         KV cache defragmentation threshold (default: -1.0, < 0 - disabled)
    INFO    |-> [rocprof] -np,   --parallel N             number of parallel sequences to decode (default: 1)
    INFO    |-> [rocprof] -ns,   --sequences N            number of sequences to decode (default: 1)
    INFO    |-> [rocprof] -cb,   --cont-batching          enable continuous batching (a.k.a dynamic batching) (default: enabled)
    INFO    |-> [rocprof] 
    INFO    |-> [rocprof] multi-modality:
    INFO    |-> [rocprof] 
    INFO    |-> [rocprof] --mmproj FILE            path to a multimodal projector file for LLaVA. see examples/llava/README.md
    INFO    |-> [rocprof] --image FILE             path to an image file. use with multimodal models. Specify multiple times for batching
    INFO    |-> [rocprof] 
    INFO    |-> [rocprof] backend:
    INFO    |-> [rocprof] 
    INFO    |-> [rocprof] --rpc SERVERS            comma separated list of RPC servers
    INFO    |-> [rocprof] --mlock                  force system to keep model in RAM rather than swapping or compressing
    INFO    |-> [rocprof] --no-mmap                do not memory-map model (slower load but may reduce pageouts if not using mlock)
    INFO    |-> [rocprof] --numa TYPE              attempt optimizations that help on some NUMA systems
    INFO    |-> [rocprof] - distribute: spread execution evenly over all nodes
    INFO    |-> [rocprof] - isolate: only spawn threads on CPUs on the node that execution started on
    INFO    |-> [rocprof] - numactl: use the CPU map provided by numactl
    INFO    |-> [rocprof] if run without this previously, it is recommended to drop the system page cache before using this
    INFO    |-> [rocprof] see https://github.com/ggerganov/llama.cpp/issues/1437
    INFO    |-> [rocprof] -ngl,  --gpu-layers N           number of layers to store in VRAM
    INFO    |-> [rocprof] -ngld, --gpu-layers-draft N     number of layers to store in VRAM for the draft model
    INFO    |-> [rocprof] -sm,   --split-mode SPLIT_MODE  how to split the model across multiple GPUs, one of:
    INFO    |-> [rocprof] - none: use one GPU only
    INFO    |-> [rocprof] - layer (default): split layers and KV across GPUs
    INFO    |-> [rocprof] - row: split rows across GPUs
    INFO    |-> [rocprof] -ts,   --tensor-split SPLIT     fraction of the model to offload to each GPU, comma-separated list of proportions, e.g. 3,1
    INFO    |-> [rocprof] -mg,   --main-gpu i             the GPU to use for the model (with split-mode = none),
    INFO    |-> [rocprof] or for intermediate results and KV (with split-mode = row) (default: 0)
    INFO    |-> [rocprof] 
    INFO    |-> [rocprof] model:
    INFO    |-> [rocprof] 
    INFO    |-> [rocprof] --check-tensors          check model tensor data for invalid values (default: false)
    INFO    |-> [rocprof] --override-kv KEY=TYPE:VALUE
    INFO    |-> [rocprof] advanced option to override model metadata by key. may be specified multiple times.
    INFO    |-> [rocprof] types: int, float, bool, str. example: --override-kv tokenizer.ggml.add_bos_token=bool:false
    INFO    |-> [rocprof] --lora FNAME             apply LoRA adapter (implies --no-mmap)
    INFO    |-> [rocprof] --lora-scaled FNAME S    apply LoRA adapter with user defined scaling S (implies --no-mmap)
    INFO    |-> [rocprof] --lora-base FNAME        optional model to use as a base for the layers modified by the LoRA adapter
    INFO    |-> [rocprof] --control-vector FNAME   add a control vector
    INFO    |-> [rocprof] --control-vector-scaled FNAME SCALE
    INFO    |-> [rocprof] add a control vector with user defined scaling SCALE
    INFO    |-> [rocprof] --control-vector-layer-range START END
    INFO    |-> [rocprof] layer range to apply the control vector(s) to, start and end inclusive
    INFO    |-> [rocprof] -m,    --model FNAME            model path (default: models/$filename with filename from --hf-file
    INFO    |-> [rocprof] or --model-url if set, otherwise models/7B/ggml-model-f16.gguf)
    INFO    |-> [rocprof] -md,   --model-draft FNAME      draft model for speculative decoding (default: unused)
    INFO    |-> [rocprof] -mu,   --model-url MODEL_URL    model download url (default: unused)
    INFO    |-> [rocprof] -hfr,  --hf-repo REPO           Hugging Face model repository (default: unused)
    INFO    |-> [rocprof] -hff,  --hf-file FILE           Hugging Face model file (default: unused)
    INFO    |-> [rocprof] 
    INFO    |-> [rocprof] retrieval:
    INFO    |-> [rocprof] 
    INFO    |-> [rocprof] --context-file FNAME     file to load context from (repeat to specify multiple files)
    INFO    |-> [rocprof] --chunk-size N           minimum length of embedded text chunks (default: 64)
    INFO    |-> [rocprof] --chunk-separator STRING
    INFO    |-> [rocprof] separator between chunks (default: '
    INFO    |-> [rocprof] ')
    INFO    |-> [rocprof] 
    INFO    |-> [rocprof] passkey:
    INFO    |-> [rocprof] 
    INFO    |-> [rocprof] --junk N                 number of times to repeat the junk text (default: 250)
    INFO    |-> [rocprof] --pos N                  position of the passkey in the junk text (default: -1)
    INFO    |-> [rocprof] 
    INFO    |-> [rocprof] imatrix:
    INFO    |-> [rocprof] 
    INFO    |-> [rocprof] -o,    --output FNAME           output file (default: 'imatrix.dat')
    INFO    |-> [rocprof] --output-frequency N     output the imatrix every N iterations (default: 10)
    INFO    |-> [rocprof] --save-frequency N       save an imatrix copy every N iterations (default: 0)
    INFO    |-> [rocprof] --process-output         collect data for the output tensor (default: false)
    INFO    |-> [rocprof] --no-ppl                 do not compute perplexity (default: true)
    INFO    |-> [rocprof] --chunk N                start processing the input from chunk N (default: 0)
    INFO    |-> [rocprof] 
    INFO    |-> [rocprof] bench:
    INFO    |-> [rocprof] 
    INFO    |-> [rocprof] -pps                            is the prompt shared across parallel sequences (default: false)
    INFO    |-> [rocprof] -npp n0,n1,...                  number of prompt tokens
    INFO    |-> [rocprof] -ntg n0,n1,...                  number of text generation tokens
    INFO    |-> [rocprof] -npl n0,n1,...                  number of parallel prompts
    INFO    |-> [rocprof] 
    INFO    |-> [rocprof] server:
    INFO    |-> [rocprof] 
    INFO    |-> [rocprof] --host HOST              ip address to listen (default: 127.0.0.1)
    INFO    |-> [rocprof] --port PORT              port to listen (default: 8080)
    INFO    |-> [rocprof] --path PATH              path to serve static files from (default: )
    INFO    |-> [rocprof] --embedding(s)           enable embedding endpoint (default: disabled)
    INFO    |-> [rocprof] --api-key KEY            API key to use for authentication (default: none)
    INFO    |-> [rocprof] --api-key-file FNAME     path to file containing API keys (default: none)
    INFO    |-> [rocprof] --ssl-key-file FNAME     path to file a PEM-encoded SSL private key
    INFO    |-> [rocprof] --ssl-cert-file FNAME    path to file a PEM-encoded SSL certificate
    INFO    |-> [rocprof] --timeout N              server read/write timeout in seconds (default: 600)
    INFO    |-> [rocprof] --threads-http N         number of threads used to process HTTP requests (default: -1)
    INFO    |-> [rocprof] --system-prompt-file FNAME
    INFO    |-> [rocprof] set a file to load a system prompt (initial prompt of all slots), this is useful for chat applications
    INFO    |-> [rocprof] --log-format {text,json}
    INFO    |-> [rocprof] log output format: json or text (default: json)
    INFO    |-> [rocprof] --metrics                enable prometheus compatible metrics endpoint (default: disabled)
    INFO    |-> [rocprof] --no-slots               disables slots monitoring endpoint (default: enabled)
    INFO    |-> [rocprof] --slot-save-path PATH    path to save slot kv cache (default: disabled)
    INFO    |-> [rocprof] --chat-template JINJA_TEMPLATE
    INFO    |-> [rocprof] set custom jinja chat template (default: template taken from model's metadata)
    INFO    |-> [rocprof] only commonly used templates are accepted:
    INFO    |-> [rocprof] https://github.com/ggerganov/llama.cpp/wiki/Templates-supported-by-llama_chat_apply_template
    INFO    |-> [rocprof] -sps,  --slot-prompt-similarity SIMILARITY
    INFO    |-> [rocprof] how much the prompt of a request must match the prompt of a slot in order to use that slot (default: 0.50, 0.0 = disabled)
    INFO    |-> [rocprof] 
    INFO    |-> [rocprof] 
    INFO    |-> [rocprof] logging:
    INFO    |-> [rocprof] 
    INFO    |-> [rocprof] --simple-io              use basic IO for better compatibility in subprocesses and limited consoles
    INFO    |-> [rocprof] -ld,   --logdir LOGDIR          path under which to save YAML logs (no logging if unset)
    INFO    |-> [rocprof] --log-test               Run simple logging test
    INFO    |-> [rocprof] --log-disable            Disable trace logs
    INFO    |-> [rocprof] --log-enable             Enable trace logs
    INFO    |-> [rocprof] --log-file FNAME         Specify a log filename (without extension)
    INFO    |-> [rocprof] --log-new                Create a separate new log file on start. Each log file will have unique name: "<name>.<ID>.log"
    INFO    |-> [rocprof] --log-append             Don't truncate the old log file.
    INFO    |-> [rocprof] 
    INFO    |-> [rocprof] File '/root/git/rocm-llm-profile/workloads/llama3/MI100/SQ_IFETCH_LEVEL.csv' is generating
    INFO    |-> [rocprof] 
    ERROR Profiling execution failed.

    Expected behavior I expect omniperf to run without any errors.

Screenshots If applicable, add screenshots to help explain your problem.

Additional context The command below shows that my omniperf installation is correct.

root@febc8da47e3f:~/git/rocm-llm-profile# ls $INSTALL_DIR
2.0.1  modulefiles  python-libs
coleramos425 commented 3 months ago

Hi @aymane-eljerari. Looks to me like this is an issue with your application launch parameters image

Please verify and first confirm that the application runs before profiling with Omniperf. If your issue persists, please re-open this ticket.

coleramos425 commented 3 months ago

See #383. I believe encapsulating your application arguments in a wrapper script could also be a potential solution.