aws-samples / awsome-inference

MIT No Attribution
31 stars 10 forks source link

genai-perf in benchmark_concurrency.sh with multi-lora got HTTP 404 #18

Closed Ryan-ZL-Lin closed 1 month ago

Ryan-ZL-Lin commented 2 months ago

Hi first of all, thanks for this "awsome" repo that makes the integration between NVIDIA NIM and AWS EKS much easier. Based on this blog post, I'm able to setup NIM with EKS.

However, when performing genai-perf, I got HTTP code 404 and couldn't figure out where the problem is, even the API endpoint is tested beforehand. Is there anyone who could provide some hints to see how to address this issue better?

Here is process to reproduce the error:

  1. A nodeport service created image

  2. NIM and genai-perf Pods are both scheduled without error image

  3. multi loras are hosted, hence, there are 1 base mode name and 4 lora model names:

    • meta/llama3-8b-instruct
    • llama3-8b-instruct-lora_vnemo-squad-v1
    • llama3-8b-instruct-lora_vnemo-math-v1
    • llama3-8b-instruct-lora_vhf-math-v1
    • llama3-8b-instruct-lora_vhf-squad-v1
  4. API testing inside the genai-perf pod is successful image

  5. parameters to use in genai-perf pod

    root@genai-perf-5cdc688bb8-x45m9:/workspace# export MODEL_NAME=meta/llama-3-8b-instruct
    root@genai-perf-5cdc688bb8-x45m9:/workspace# export TOKENIZER=meta-llama/Meta-Llama-3-8B-instruct
    root@genai-perf-5cdc688bb8-x45m9:/workspace# export OUTPUT_DIR=artifacts
    root@genai-perf-5cdc688bb8-x45m9:/workspace# export BENCHMARK_OUTPUT_DIR_ROOT=benchmarks
    root@genai-perf-5cdc688bb8-x45m9:/workspace# export LOCAL_PORTNUMBER=8000
    root@genai-perf-5cdc688bb8-x45m9:/workspace# export concurrency=50
    root@genai-perf-5cdc688bb8-x45m9:/workspace# export input_seq_len=7000
    root@genai-perf-5cdc688bb8-x45m9:/workspace# export output_seq_len=1000
  6. run genai-perf command inside genai-perf pod root@genai-perf-5cdc688bb8-x45m9:/workspace# genai-perf -m ${MODEL_NAME} --service-kind openai --url openai-service:${LOCAL_PORTNUMBER} --endpoint v1/chat/completions --endpoint-type chat --concurrency ${concurrency} --num-prompts 100 --tokenizer ${TOKENIZER} --synthetic-input-tokens-mean $input_seq_len --synthetic-input-tokens-stddev 0 --streaming --extra-inputs max_tokens:$output_seq_len --extra-inputs ignore_eos:true --measurement-interval 4000 --generate-plots -v

  7. 404 error occured

    
    2024-08-27 03:50 [INFO] genai_perf.parser:166 - Model name 'meta/llama-3-8b-instruct' cannot be used to create artifact directory. Instead, 'meta_llama-3-8b-instruct' will be used.
    2024-08-27 03:50 [INFO] genai_perf.wrapper:137 - Running Perf Analyzer : 'perf_analyzer -m meta/llama-3-8b-instruct --async --input-data artifacts/meta_llama-3-8b-instruct-openai-chat-concurrency50/llm_inputs.json --endpoint v1/chat/completions --service-kind openai -u openai-service:8000 --measurement-interval 4000 --stability-percentage 999 --profile-export-file artifacts/meta_llama-3-8b-instruct-openai-chat-concurrency50/profile_export.json --verbose -i http --concurrency-range 50'
    Successfully read data for 1 stream/streams with 100 step/steps.
    *** Measurement Settings ***
    Service Kind: OPENAI
    Using "time_windows" mode for stabilization
    Stabilizing using average latency
    Measurement window: 4000 msec
    Using asynchronous calls for inference

Request concurrency: 50 Failed to retrieve results from inference request. Thread [0] had error: OpenAI response returns HTTP code 404

Thread [1] had error: OpenAI response returns HTTP code 404

Thread [2] had error: OpenAI response returns HTTP code 404

Thread [3] had error: OpenAI response returns HTTP code 404

Thread [4] had error: OpenAI response returns HTTP code 404

Thread [5] had error: OpenAI response returns HTTP code 404

Thread [6] had error: OpenAI response returns HTTP code 404

Thread [7] had error: OpenAI response returns HTTP code 404

Thread [8] had error: OpenAI response returns HTTP code 404

Thread [9] had error: OpenAI response returns HTTP code 404

Thread [10] had error: OpenAI response returns HTTP code 404

Thread [11] had error: OpenAI response returns HTTP code 404

Thread [12] had error: OpenAI response returns HTTP code 404

Thread [13] had error: OpenAI response returns HTTP code 404

Thread [14] had error: OpenAI response returns HTTP code 404

Thread [15] had error: OpenAI response returns HTTP code 404

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/genai_perf/main.py", line 143, in run args.func(args, extra_args) File "/usr/local/lib/python3.10/dist-packages/genai_perf/parser.py", line 570, in profile_handler Profiler.run(args=args, extra_args=extra_args) File "/usr/local/lib/python3.10/dist-packages/genai_perf/wrapper.py", line 139, in run subprocess.run(cmd, check=True, stdout=None) File "/usr/lib/python3.10/subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['perf_analyzer', '-m', 'meta/llama-3-8b-instruct', '--async', '--input-data', 'artifacts/meta_llama-3-8b-instruct-openai-chat-concurrency50/llm_inputs.json', '--endpoint', 'v1/chat/completions', '--service-kind', 'openai', '-u', 'openai-service:8000', '--measurement-interval', '4000', '--stability-percentage', '999', '--profile-export-file', 'artifacts/meta_llama-3-8b-instruct-openai-chat-concurrency50/profile_export.json', '--verbose', '-i', 'http', '--concurrency-range', '50']' returned non-zero exit status 99.

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/usr/local/lib/python3.10/dist-packages/genai_perf/main.py", line 154, in main run() File "/usr/local/lib/python3.10/dist-packages/genai_perf/main.py", line 147, in run raise GenAIPerfException(e) genai_perf.exceptions.GenAIPerfException: Command '['perf_analyzer', '-m', 'meta/llama-3-8b-instruct', '--async', '--input-data', 'artifacts/meta_llama-3-8b-instruct-openai-chat-concurrency50/llm_inputs.json', '--endpoint', 'v1/chat/completions', '--service-kind', 'openai', '-u', 'openai-service:8000', '--measurement-interval', '4000', '--stability-percentage', '999', '--profile-export-file', 'artifacts/meta_llama-3-8b-instruct-openai-chat-concurrency50/profile_export.json', '--verbose', '-i', 'http', '--concurrency-range', '50']' returned non-zero exit status 99. 2024-08-27 03:50 [ERROR] genai_perf.main:158 - Command '['perf_analyzer', '-m', 'meta/llama-3-8b-instruct', '--async', '--input-data', 'artifacts/meta_llama-3-8b-instruct-openai-chat-concurrency50/llm_inputs.json', '--endpoint', 'v1/chat/completions', '--service-kind', 'openai', '-u', 'openai-service:8000', '--measurement-interval', '4000', '--stability-percentage', '999', '--profile-export-file', 'artifacts/meta_llama-3-8b-instruct-openai-chat-concurrency50/profile_export.json', '--verbose', '-i', 'http', '--concurrency-range', '50']' returned non-zero exit status 99. root@genai-perf-5cdc688bb8-x45m9:/workspace#

amanshanbhag commented 2 months ago

Hey Ryan, thanks for your detailed logs. Can you try switching the service name from the openai-service one to the service created as part of your initial helm deployment (my-nim-nim-llm by default)? Want to isolate whether the issue is with the service or with the genai-perf.yaml configuration

JoeyTPChou commented 2 months ago

Hi @Ryan-ZL-Lin, sorry for the late response. In the curl command the model name is meta/llama3-8b-instruct while the YAML is using meta/llama-3-8b-instruct. Please try to use meta/llama3-8b-instruct and see if it works

Ryan-ZL-Lin commented 1 month ago

Thanks @JoeyTPChou , the problem is solved.