dstackai / dstack

dstack is a lightweight, open-source alternative to Kubernetes & Slurm, simplifying AI container orchestration with multi-cloud & on-prem support. It natively supports NVIDIA, AMD, & TPU.
https://dstack.ai/docs
Mozilla Public License 2.0
1.53k stars 154 forks source link

[Bug]: NIM example with dstack does not support Llama 3.1-8B-Instruct #1999

Open Bihan opened 1 day ago

Bihan commented 1 day ago

Steps to reproduce

Configure a nim task in examples/deployment/nim/task.dstack.yml with Llama 3.1-8b-instruct

type: task

name: llama31-nim-task
image: nvcr.io/nim/meta/llama-3.1-8b-instruct:latest

env:
  - NGC_API_KEY
registry_auth:
  username: $oauthtoken
  password: ${{ env.NGC_API_KEY }}

ports:
  - 8000

spot_policy: auto

resources:
  gpu: 24GB

backends: ["aws", "azure", "cudo", "datacrunch", "gcp", "lambda", "oci", "tensordock"]

Actual behaviour

Run the configuration dstack apply -f examples/deployment/nim/task.dstack.yml

Project                bihan                                        
 User                   admin                                        
 Configuration          examples/deployment/nim/task.dstack.yml      
 Type                   task                                         
 Resources              2..xCPU, 8GB.., 1xGPU (24GB), 100GB.. (disk) 
 Max price              -                                            
 Max duration           72h                                          
 Spot policy            auto                                         
 Retry policy           no                                           
 Creation policy        reuse-or-create                              
 Termination policy     destroy-after-idle                           
 Termination idle time  5m                                           

 #  BACKEND  REGION           INSTANCE       RESOURCES                                 SPOT  PRICE       
 1  gcp      asia-northeast3  g2-standard-4  4xCPU, 16GB, 1xL4 (24GB), 100.0GB (disk)  yes   $0.173268   
 2  gcp      asia-northeast3  g2-standard-8  8xCPU, 32GB, 1xL4 (24GB), 100.0GB (disk)  yes   $0.206236   
 3  gcp      asia-east1       g2-standard-4  4xCPU, 16GB, 1xL4 (24GB), 100.0GB (disk)  yes   $0.219016   
    ...                                                                                                  
 Shown 3 of 225 offers, $2.42622 max

Submit the run llama31-nim-task? [y/n]: y
llama31-nim-task provisioning completed (terminating)
Run failed with error code CONTAINER_EXITED_WITH_ERROR.
Error: /run/sshd must be owned by root and not group or world-writable.
Check CLI, server, and run logs for more details.

Expected behaviour

When tried directly in host, NIM works with Llama 3.1-8b

$ NGC_API_KEY=
$ NIM_MAX_MODEL_LEN= # max supported by KV cache

$ docker run -it --rm \
    --gpus all \
    --shm-size=16GB \
    -e NGC_API_KEY \
    -e NIM_MAX_MODEL_LEN \
    -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
    -u $(id -u) \
    -p 8000:8000 \
    nvcr.io/nim/meta/llama-3.1-8b-instruct:latest

===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================

NVIDIA Inference Microservice LLM NIM Version 1.2.2
Model: meta/llama-3.1-8b-instruct

Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

The use of this model is governed by the NVIDIA AI Foundation Models Community License Agreement (found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-ai-foundation-models-community-license-agreement/#:~:text=This%20license%20agreement%20(%E2%80%9CAgreement%E2%80%9D,algorithms%2C%20parameters%2C%20configuration%20files%2C).

ADDITIONAL INFORMATION: Llama 3.1 Community License Agreement, Built with Llama.

2024-11-14 16:49:01,192 [INFO] PyTorch version 2.3.1 available.
INFO 2024-11-14 16:49:04.699 ngc_profile.py:231] Running NIM without LoRA. Only looking for compatible profiles that do not support LoRA.
INFO 2024-11-14 16:49:04.699 ngc_profile.py:233] Detected 1 compatible profile(s).
INFO 2024-11-14 16:49:04.699 ngc_injector.py:152] Valid profile: 3bb4e8fe78e5037b05dd618cebb1053347325ad6a1e709e0eb18bb8558362ac5 (vllm-bf16-tp1) on GPUs [0]
INFO 2024-11-14 16:49:04.700 ngc_injector.py:206] Selected profile: 3bb4e8fe78e5037b05dd618cebb1053347325ad6a1e709e0eb18bb8558362ac5 (vllm-bf16-tp1)
INFO 2024-11-14 16:49:04.702 ngc_injector.py:214] Profile metadata: feat_lora: false
INFO 2024-11-14 16:49:04.702 ngc_injector.py:214] Profile metadata: llm_engine: vllm
INFO 2024-11-14 16:49:04.702 ngc_injector.py:214] Profile metadata: precision: bf16
INFO 2024-11-14 16:49:04.702 ngc_injector.py:214] Profile metadata: tp: 1
INFO 2024-11-14 16:49:04.702 ngc_injector.py:245] Preparing model workspace. This step might download additional files to run the model.
INFO 2024-11-14 16:49:04.704 ngc_injector.py:260] Model workspace is now ready. It took 0.001 seconds
INFO 2024-11-14 16:49:04.704 launch.py:46] engine_world_size=1
INFO 2024-11-14 16:49:04.705 launch.py:92] running command ['/opt/nim/llm/.venv/bin/python3', '-m', 'vllm_nvext.entrypoints.openai.api_server', '--served-model-name', 'meta/llama-3.1-8b-instruct', '--async-engine-args', '{"model": "/tmp/LLM-88ox0xbt", "served_model_name": ["meta/llama-3.1-8b-instruct"], "tokenizer": "/tmp/LLM-88ox0xbt", "skip_tokenizer_init": false, "tokenizer_mode": "auto", "trust_remote_code": false, "download_dir": null, "load_format": "auto", "dtype": "bfloat16", "kv_cache_dtype": "auto", "quantization_param_path": null, "seed": 0, "max_model_len": 4096, "worker_use_ray": false, "distributed_executor_backend": "mp", "pipeline_parallel_size": 1, "tensor_parallel_size": 1, "max_parallel_loading_workers": null, "block_size": 16, "enable_prefix_caching": false, "disable_sliding_window": false, "use_v2_block_manager": false, "swap_space": 4, "cpu_offload_gb": 0, "gpu_memory_utilization": 0.9, "max_num_batched_tokens": null, "max_num_seqs": 256, "max_logprobs": 20, "disable_log_stats": false, "revision": null, "code_revision": null, "rope_scaling": null, "rope_theta": null, "tokenizer_revision": null, "quantization": null, "enforce_eager": false, "max_context_len_to_capture": null, "max_seq_len_to_capture": 8192, "disable_custom_all_reduce": false, "tokenizer_pool_size": 0, "tokenizer_pool_type": "ray", "tokenizer_pool_extra_config": null, "enable_lora": false, "max_loras": 8, "max_lora_rank": 32, "enable_prompt_adapter": false, "max_prompt_adapters": 1, "max_prompt_adapter_token": 0, "fully_sharded_loras": false, "lora_extra_vocab_size": 256, "long_lora_scaling_factors": null, "lora_dtype": "auto", "max_cpu_loras": 16, "peft_source": null, "peft_refresh_interval": null, "device": "auto", "ray_workers_use_nsight": false, "num_gpu_blocks_override": null, "num_lookahead_slots": 0, "model_loader_extra_config": null, "ignore_patterns": [], "preemption_mode": null, "scheduler_delay_factor": 0.0, "enable_chunked_prefill": null, "guided_decoding_backend": "lm-format-enforcer", "speculative_model": null, "speculative_draft_tensor_parallel_size": null, "num_speculative_tokens": null, "speculative_max_model_len": null, "speculative_disable_by_batch_size": null, "ngram_prompt_lookup_max": null, "ngram_prompt_lookup_min": null, "spec_decoding_acceptance_method": "rejection_sampler", "typical_acceptance_sampler_posterior_threshold": null, "typical_acceptance_sampler_posterior_alpha": null, "qlora_adapter_name_or_path": null, "disable_logprobs_during_spec_decoding": null, "otlp_traces_endpoint": null, "engine_use_ray": false, "disable_log_requests": true, "selected_gpus": [{"name": "NVIDIA L4", "device_index": 0, "device_id": "27b8:10de", "total_memory": 24152899584, "free_memory": 23580573696, "used_memory": 3211264, "reserved_memory": 569114624, "family": null}]}']
[1731602949.090863] [00e2be6260b9:49   :0]          parser.c:2305 UCX  WARN  unused environment variables: UCX_HOME; UCX_DIR (maybe: UCX_TLS?)
[1731602949.090863] [00e2be6260b9:49   :0]          parser.c:2305 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
2024-11-14 16:49:14,939 [INFO] PyTorch version 2.3.1 available.
2024-11-14 16:49:22,306 [WARNING] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
2024-11-14 16:49:22,306 [INFO] [TRT-LLM] [I] Starting TensorRT-LLM init.
2024-11-14 16:49:22,319 [INFO] [TRT-LLM] [I] TensorRT-LLM inited.
[TensorRT-LLM] TensorRT-LLM version: 0.11.1.dev20240809
INFO 2024-11-14 16:49:22.395 api_server.py:644] NIM LLM API version 1.1.2
INFO 2024-11-14 16:49:22.412 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='/tmp/LLM-88ox0xbt', speculative_config=None, tokenizer='/tmp/LLM-88ox0xbt', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='lm-format-enforcer'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=meta/llama-3.1-8b-instruct, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 2024-11-14 16:49:23.137 model_runner.py:680] Starting to load model /tmp/LLM-88ox0xbt...
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:28<01:24, 28.12s/it]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:56<00:56, 28.27s/it]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [01:03<00:18, 18.43s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [01:23<00:00, 18.99s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [01:23<00:00, 20.76s/it]

INFO 2024-11-14 16:50:46.783 model_runner.py:692] Loading model weights took 14.9888 GB
INFO 2024-11-14 16:50:49.196 distributed_gpu_executor.py:56] # GPU blocks: 1750, # CPU blocks: 2048
INFO 2024-11-14 16:50:51.823 model_runner.py:980] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 2024-11-14 16:50:51.826 model_runner.py:984] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 2024-11-14 16:51:13.770 model_runner.py:1181] Graph capturing finished in 22 secs.
INFO 2024-11-14 16:51:13.804 serving_chat.py:94] Using supplied tool use configs
INFO 2024-11-14 16:51:13.804 serving_chat.py:94] Using supplied tool use configs
INFO 2024-11-14 16:51:13.804 api_server.py:596] Serving endpoints:
  0.0.0.0:8000/openapi.json
  0.0.0.0:8000/docs
  0.0.0.0:8000/docs/oauth2-redirect
  0.0.0.0:8000/metrics
  0.0.0.0:8000/v1/health/ready
  0.0.0.0:8000/v1/health/live
  0.0.0.0:8000/v1/models
  0.0.0.0:8000/v1/license
  0.0.0.0:8000/v1/metadata
  0.0.0.0:8000/v1/version
  0.0.0.0:8000/v1/chat/completions
  0.0.0.0:8000/v1/completions
  0.0.0.0:8000/experimental/ls/inference/chat_completion
  0.0.0.0:8000/experimental/ls/inference/completion
INFO 2024-11-14 16:51:13.804 api_server.py:600] An example cURL request:
curl -X 'POST' \
  'http://0.0.0.0:8000/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
    "model": "meta/llama-3.1-8b-instruct",
    "messages": [
      {
        "role":"user",
        "content":"Hello! How are you?"
      },
      {
        "role":"assistant",
        "content":"Hi! I am quite well, how can I help you today?"
      },
      {
        "role":"user",
        "content":"Can you write me a song?"
      }
    ],
    "top_p": 1,
    "n": 1,
    "max_tokens": 15,
    "stream": true,
    "frequency_penalty": 1.0,
    "stop": ["hello"]
  }'

INFO 2024-11-14 16:51:13.915 server.py:82] Started server process [49]
INFO 2024-11-14 16:51:13.916 on.py:48] Waiting for application startup.
INFO 2024-11-14 16:51:13.916 on.py:62] Application startup complete.
INFO 2024-11-14 16:51:13.947 server.py:214] Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO 2024-11-14 16:51:23.925 metrics.py:396] Avg prompt throughput: 0.3 tokens/s, Avg generation throughput: 1.6 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 2024-11-14 16:51:33.926 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.

dstack version

master

Server logs

[22:54:41] DEBUG    dstack._internal.server.background.tasks.process_instances:610 Check instance llama31-nim-task-0 status. shim health: Service is OK       
           DEBUG    dstack._internal.server.background.tasks.process_runs:87 run(7f2d1a)llama31-nim-task: processing run                                      
[22:54:43] DEBUG    dstack._internal.server.background.tasks.process_runs:87 run(7f2d1a)llama31-nim-task: processing run                                      
           DEBUG    dstack._internal.server.app:213 Processed request POST http://127.0.0.1:3000/api/project/bihan/runs/get in 0.016294s                      
[22:54:45] DEBUG    dstack._internal.server.background.tasks.process_running_jobs:242 job(ee4b0c)llama31-nim-task-0-0: process pulling job with shim,         
                    age=0:04:00.312403                                                                                                                        
[22:54:47] DEBUG    dstack._internal.server.background.tasks.process_instances:610 Check instance llama31-nim-task-0 status. shim health: Service is OK       
[22:54:48] DEBUG    dstack._internal.server.app:213 Processed request POST http://127.0.0.1:3000/api/project/bihan/runs/get in 0.020562s                      
[22:54:49] DEBUG    dstack._internal.server.background.tasks.process_running_jobs:242 job(ee4b0c)llama31-nim-task-0-0: process pulling job with shim,         
                    age=0:04:04.757083                                                                                                                        
[22:54:51] WARNING  dstack._internal.server.background.tasks.process_running_jobs:475 shim failed to execute job llama31-nim-task-0-0:                        
                    CONTAINER_EXITED_WITH_ERROR (/run/sshd must be owned by root and not group or world-writable.)                                            
           DEBUG    dstack._internal.server.background.tasks.process_running_jobs:481 shim status: {'state': 'pending', 'executor_error': '',                 
                    'container_name': 'llama31-nim-task-0-0', 'status': 'exited', 'running': False, 'oom_killed': False, 'dead': False, 'exit_code': 255,     
                    'error': '', 'result': {'reason': 'CONTAINER_EXITED_WITH_ERROR', 'reason_message': '/run/sshd must be owned by root and not group or      
                    world-writable.'}}                                                                                                                        
[22:54:52] WARNING  dstack._internal.server.background.tasks.process_running_jobs:276 job(ee4b0c)llama31-nim-task-0-0: failed because runner is not available 
                    or return an error,  age=0:04:07.832166                                                                                                   
           DEBUG    dstack._internal.server.background.tasks.process_instances:610 Check instance llama31-nim-task-0 status. shim health: Service is OK       
[22:54:53] DEBUG    dstack._internal.server.background.tasks.process_runs:87 run(7f2d1a)llama31-nim-task: processing run                                      
           INFO     dstack._internal.server.background.tasks.process_runs:330 run(7f2d1a)llama31-nim-task: run status has changed PROVISIONING -> TERMINATING 
           DEBUG    dstack._internal.server.app:213 Processed request POST http://127.0.0.1:3000/api/project/bihan/runs/get in 0.018031s                      
           DEBUG    dstack._internal.server.app:213 Processed request POST http://127.0.0.1:3000/api/project/bihan/runs/get in 0.009912s                      
[22:54:54] DEBUG    dstack._internal.server.app:213 Processed request POST http://127.0.0.1:3000/api/project/bihan/runs/get in 0.018990s                      
[22:54:55] DEBUG    dstack._internal.server.background.tasks.process_runs:87 run(7f2d1a)llama31-nim-task: processing run                                      
           DEBUG    dstack._internal.server.app:213 Processed request POST http://127.0.0.1:3000/api/project/bihan/runs/get in 0.019299s                      
[22:54:56] DEBUG    dstack._internal.server.background.tasks.process_instances:610 Check instance llama31-nim-task-0 status. shim health: Service is OK       
           DEBUG    dstack._internal.server.app:213 Processed request POST http://127.0.0.1:3000/api/project/bihan/runs/get in 0.023599s                      
[22:54:57] DEBUG    dstack._internal.server.background.tasks.process_runs:87 run(7f2d1a)llama31-nim-task: processing run                                      
           DEBUG    dstack._internal.server.app:213 Processed request POST http://127.0.0.1:3000/api/project/bihan/runs/get in 0.014791s                      
[22:54:58] DEBUG    dstack._internal.server.services.jobs:234 job(ee4b0c)llama31-nim-task-0-0: stopping container                                             
           DEBUG    dstack._internal.server.app:213 Processed request POST http://127.0.0.1:3000/api/project/bihan/runs/get in 0.023412s                      
[22:55:00] DEBUG    dstack._internal.server.app:213 Processed request POST http://127.0.0.1:3000/api/project/bihan/runs/get in 0.030719s                      
           INFO     dstack._internal.server.services.jobs:268 job(ee4b0c)llama31-nim-task-0-0: instance 'llama31-nim-task-0' has been released, new status is 
                    IDLE                                                                                                                                      
           INFO     dstack._internal.server.services.jobs:283 job(ee4b0c)llama31-nim-task-0-0: job status is FAILED, reason: CONTAINER_EXITED_WITH_ERROR      
[22:55:01] DEBUG    dstack._internal.server.app:213 Processed request POST http://127.0.0.1:3000/api/project/bihan/runs/get in 0.019889s                      
           DEBUG    dstack._internal.server.background.tasks.process_runs:87 run(7f2d1a)llama31-nim-task: processing run                                      
           INFO     dstack._internal.server.services.runs:952 run(7f2d1a)llama31-nim-task: run status has changed TERMINATING -> FAILED, reason: JOB_FAILED   
[22:55:02] DEBUG    dstack._internal.server.app:213 Processed request POST http://127.0.0.1:3000/api/project/bihan/runs/get in 0.015102s

Additional information

No response

Bihan commented 1 day ago

@deep-diver NIM example not working with Llama 3.1-8b-instruct:latest

peterschmidt85 commented 1 day ago

@jvstme @un-def cc