dstack is a lightweight, open-source alternative to Kubernetes & Slurm, simplifying AI container orchestration with multi-cloud & on-prem support. It natively supports NVIDIA, AMD, & TPU.
Run the configuration
dstack apply -f examples/deployment/nim/task.dstack.yml
Project bihan
User admin
Configuration examples/deployment/nim/task.dstack.yml
Type task
Resources 2..xCPU, 8GB.., 1xGPU (24GB), 100GB.. (disk)
Max price -
Max duration 72h
Spot policy auto
Retry policy no
Creation policy reuse-or-create
Termination policy destroy-after-idle
Termination idle time 5m
# BACKEND REGION INSTANCE RESOURCES SPOT PRICE
1 gcp asia-northeast3 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100.0GB (disk) yes $0.173268
2 gcp asia-northeast3 g2-standard-8 8xCPU, 32GB, 1xL4 (24GB), 100.0GB (disk) yes $0.206236
3 gcp asia-east1 g2-standard-4 4xCPU, 16GB, 1xL4 (24GB), 100.0GB (disk) yes $0.219016
...
Shown 3 of 225 offers, $2.42622 max
Submit the run llama31-nim-task? [y/n]: y
llama31-nim-task provisioning completed (terminating)
Run failed with error code CONTAINER_EXITED_WITH_ERROR.
Error: /run/sshd must be owned by root and not group or world-writable.
Check CLI, server, and run logs for more details.
Expected behaviour
When tried directly in host, NIM works with Llama 3.1-8b
$ NGC_API_KEY=
$ NIM_MAX_MODEL_LEN= # max supported by KV cache
$ docker run -it --rm \
--gpus all \
--shm-size=16GB \
-e NGC_API_KEY \
-e NIM_MAX_MODEL_LEN \
-v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
-u $(id -u) \
-p 8000:8000 \
nvcr.io/nim/meta/llama-3.1-8b-instruct:latest
===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================
NVIDIA Inference Microservice LLM NIM Version 1.2.2
Model: meta/llama-3.1-8b-instruct
Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
The use of this model is governed by the NVIDIA AI Foundation Models Community License Agreement (found at https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-ai-foundation-models-community-license-agreement/#:~:text=This%20license%20agreement%20(%E2%80%9CAgreement%E2%80%9D,algorithms%2C%20parameters%2C%20configuration%20files%2C).
ADDITIONAL INFORMATION: Llama 3.1 Community License Agreement, Built with Llama.
2024-11-14 16:49:01,192 [INFO] PyTorch version 2.3.1 available.
INFO 2024-11-14 16:49:04.699 ngc_profile.py:231] Running NIM without LoRA. Only looking for compatible profiles that do not support LoRA.
INFO 2024-11-14 16:49:04.699 ngc_profile.py:233] Detected 1 compatible profile(s).
INFO 2024-11-14 16:49:04.699 ngc_injector.py:152] Valid profile: 3bb4e8fe78e5037b05dd618cebb1053347325ad6a1e709e0eb18bb8558362ac5 (vllm-bf16-tp1) on GPUs [0]
INFO 2024-11-14 16:49:04.700 ngc_injector.py:206] Selected profile: 3bb4e8fe78e5037b05dd618cebb1053347325ad6a1e709e0eb18bb8558362ac5 (vllm-bf16-tp1)
INFO 2024-11-14 16:49:04.702 ngc_injector.py:214] Profile metadata: feat_lora: false
INFO 2024-11-14 16:49:04.702 ngc_injector.py:214] Profile metadata: llm_engine: vllm
INFO 2024-11-14 16:49:04.702 ngc_injector.py:214] Profile metadata: precision: bf16
INFO 2024-11-14 16:49:04.702 ngc_injector.py:214] Profile metadata: tp: 1
INFO 2024-11-14 16:49:04.702 ngc_injector.py:245] Preparing model workspace. This step might download additional files to run the model.
INFO 2024-11-14 16:49:04.704 ngc_injector.py:260] Model workspace is now ready. It took 0.001 seconds
INFO 2024-11-14 16:49:04.704 launch.py:46] engine_world_size=1
INFO 2024-11-14 16:49:04.705 launch.py:92] running command ['/opt/nim/llm/.venv/bin/python3', '-m', 'vllm_nvext.entrypoints.openai.api_server', '--served-model-name', 'meta/llama-3.1-8b-instruct', '--async-engine-args', '{"model": "/tmp/LLM-88ox0xbt", "served_model_name": ["meta/llama-3.1-8b-instruct"], "tokenizer": "/tmp/LLM-88ox0xbt", "skip_tokenizer_init": false, "tokenizer_mode": "auto", "trust_remote_code": false, "download_dir": null, "load_format": "auto", "dtype": "bfloat16", "kv_cache_dtype": "auto", "quantization_param_path": null, "seed": 0, "max_model_len": 4096, "worker_use_ray": false, "distributed_executor_backend": "mp", "pipeline_parallel_size": 1, "tensor_parallel_size": 1, "max_parallel_loading_workers": null, "block_size": 16, "enable_prefix_caching": false, "disable_sliding_window": false, "use_v2_block_manager": false, "swap_space": 4, "cpu_offload_gb": 0, "gpu_memory_utilization": 0.9, "max_num_batched_tokens": null, "max_num_seqs": 256, "max_logprobs": 20, "disable_log_stats": false, "revision": null, "code_revision": null, "rope_scaling": null, "rope_theta": null, "tokenizer_revision": null, "quantization": null, "enforce_eager": false, "max_context_len_to_capture": null, "max_seq_len_to_capture": 8192, "disable_custom_all_reduce": false, "tokenizer_pool_size": 0, "tokenizer_pool_type": "ray", "tokenizer_pool_extra_config": null, "enable_lora": false, "max_loras": 8, "max_lora_rank": 32, "enable_prompt_adapter": false, "max_prompt_adapters": 1, "max_prompt_adapter_token": 0, "fully_sharded_loras": false, "lora_extra_vocab_size": 256, "long_lora_scaling_factors": null, "lora_dtype": "auto", "max_cpu_loras": 16, "peft_source": null, "peft_refresh_interval": null, "device": "auto", "ray_workers_use_nsight": false, "num_gpu_blocks_override": null, "num_lookahead_slots": 0, "model_loader_extra_config": null, "ignore_patterns": [], "preemption_mode": null, "scheduler_delay_factor": 0.0, "enable_chunked_prefill": null, "guided_decoding_backend": "lm-format-enforcer", "speculative_model": null, "speculative_draft_tensor_parallel_size": null, "num_speculative_tokens": null, "speculative_max_model_len": null, "speculative_disable_by_batch_size": null, "ngram_prompt_lookup_max": null, "ngram_prompt_lookup_min": null, "spec_decoding_acceptance_method": "rejection_sampler", "typical_acceptance_sampler_posterior_threshold": null, "typical_acceptance_sampler_posterior_alpha": null, "qlora_adapter_name_or_path": null, "disable_logprobs_during_spec_decoding": null, "otlp_traces_endpoint": null, "engine_use_ray": false, "disable_log_requests": true, "selected_gpus": [{"name": "NVIDIA L4", "device_index": 0, "device_id": "27b8:10de", "total_memory": 24152899584, "free_memory": 23580573696, "used_memory": 3211264, "reserved_memory": 569114624, "family": null}]}']
[1731602949.090863] [00e2be6260b9:49 :0] parser.c:2305 UCX WARN unused environment variables: UCX_HOME; UCX_DIR (maybe: UCX_TLS?)
[1731602949.090863] [00e2be6260b9:49 :0] parser.c:2305 UCX WARN (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
2024-11-14 16:49:14,939 [INFO] PyTorch version 2.3.1 available.
2024-11-14 16:49:22,306 [WARNING] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
2024-11-14 16:49:22,306 [INFO] [TRT-LLM] [I] Starting TensorRT-LLM init.
2024-11-14 16:49:22,319 [INFO] [TRT-LLM] [I] TensorRT-LLM inited.
[TensorRT-LLM] TensorRT-LLM version: 0.11.1.dev20240809
INFO 2024-11-14 16:49:22.395 api_server.py:644] NIM LLM API version 1.1.2
INFO 2024-11-14 16:49:22.412 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='/tmp/LLM-88ox0xbt', speculative_config=None, tokenizer='/tmp/LLM-88ox0xbt', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='lm-format-enforcer'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=meta/llama-3.1-8b-instruct, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 2024-11-14 16:49:23.137 model_runner.py:680] Starting to load model /tmp/LLM-88ox0xbt...
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:28<01:24, 28.12s/it]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:56<00:56, 28.27s/it]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [01:03<00:18, 18.43s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [01:23<00:00, 18.99s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [01:23<00:00, 20.76s/it]
INFO 2024-11-14 16:50:46.783 model_runner.py:692] Loading model weights took 14.9888 GB
INFO 2024-11-14 16:50:49.196 distributed_gpu_executor.py:56] # GPU blocks: 1750, # CPU blocks: 2048
INFO 2024-11-14 16:50:51.823 model_runner.py:980] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 2024-11-14 16:50:51.826 model_runner.py:984] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 2024-11-14 16:51:13.770 model_runner.py:1181] Graph capturing finished in 22 secs.
INFO 2024-11-14 16:51:13.804 serving_chat.py:94] Using supplied tool use configs
INFO 2024-11-14 16:51:13.804 serving_chat.py:94] Using supplied tool use configs
INFO 2024-11-14 16:51:13.804 api_server.py:596] Serving endpoints:
0.0.0.0:8000/openapi.json
0.0.0.0:8000/docs
0.0.0.0:8000/docs/oauth2-redirect
0.0.0.0:8000/metrics
0.0.0.0:8000/v1/health/ready
0.0.0.0:8000/v1/health/live
0.0.0.0:8000/v1/models
0.0.0.0:8000/v1/license
0.0.0.0:8000/v1/metadata
0.0.0.0:8000/v1/version
0.0.0.0:8000/v1/chat/completions
0.0.0.0:8000/v1/completions
0.0.0.0:8000/experimental/ls/inference/chat_completion
0.0.0.0:8000/experimental/ls/inference/completion
INFO 2024-11-14 16:51:13.804 api_server.py:600] An example cURL request:
curl -X 'POST' \
'http://0.0.0.0:8000/v1/chat/completions' \
-H 'accept: application/json' \
-H 'Content-Type: application/json' \
-d '{
"model": "meta/llama-3.1-8b-instruct",
"messages": [
{
"role":"user",
"content":"Hello! How are you?"
},
{
"role":"assistant",
"content":"Hi! I am quite well, how can I help you today?"
},
{
"role":"user",
"content":"Can you write me a song?"
}
],
"top_p": 1,
"n": 1,
"max_tokens": 15,
"stream": true,
"frequency_penalty": 1.0,
"stop": ["hello"]
}'
INFO 2024-11-14 16:51:13.915 server.py:82] Started server process [49]
INFO 2024-11-14 16:51:13.916 on.py:48] Waiting for application startup.
INFO 2024-11-14 16:51:13.916 on.py:62] Application startup complete.
INFO 2024-11-14 16:51:13.947 server.py:214] Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO 2024-11-14 16:51:23.925 metrics.py:396] Avg prompt throughput: 0.3 tokens/s, Avg generation throughput: 1.6 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 2024-11-14 16:51:33.926 metrics.py:396] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
dstack version
master
Server logs
[22:54:41] DEBUG dstack._internal.server.background.tasks.process_instances:610 Check instance llama31-nim-task-0 status. shim health: Service is OK
DEBUG dstack._internal.server.background.tasks.process_runs:87 run(7f2d1a)llama31-nim-task: processing run
[22:54:43] DEBUG dstack._internal.server.background.tasks.process_runs:87 run(7f2d1a)llama31-nim-task: processing run
DEBUG dstack._internal.server.app:213 Processed request POST http://127.0.0.1:3000/api/project/bihan/runs/get in 0.016294s
[22:54:45] DEBUG dstack._internal.server.background.tasks.process_running_jobs:242 job(ee4b0c)llama31-nim-task-0-0: process pulling job with shim,
age=0:04:00.312403
[22:54:47] DEBUG dstack._internal.server.background.tasks.process_instances:610 Check instance llama31-nim-task-0 status. shim health: Service is OK
[22:54:48] DEBUG dstack._internal.server.app:213 Processed request POST http://127.0.0.1:3000/api/project/bihan/runs/get in 0.020562s
[22:54:49] DEBUG dstack._internal.server.background.tasks.process_running_jobs:242 job(ee4b0c)llama31-nim-task-0-0: process pulling job with shim,
age=0:04:04.757083
[22:54:51] WARNING dstack._internal.server.background.tasks.process_running_jobs:475 shim failed to execute job llama31-nim-task-0-0:
CONTAINER_EXITED_WITH_ERROR (/run/sshd must be owned by root and not group or world-writable.)
DEBUG dstack._internal.server.background.tasks.process_running_jobs:481 shim status: {'state': 'pending', 'executor_error': '',
'container_name': 'llama31-nim-task-0-0', 'status': 'exited', 'running': False, 'oom_killed': False, 'dead': False, 'exit_code': 255,
'error': '', 'result': {'reason': 'CONTAINER_EXITED_WITH_ERROR', 'reason_message': '/run/sshd must be owned by root and not group or
world-writable.'}}
[22:54:52] WARNING dstack._internal.server.background.tasks.process_running_jobs:276 job(ee4b0c)llama31-nim-task-0-0: failed because runner is not available
or return an error, age=0:04:07.832166
DEBUG dstack._internal.server.background.tasks.process_instances:610 Check instance llama31-nim-task-0 status. shim health: Service is OK
[22:54:53] DEBUG dstack._internal.server.background.tasks.process_runs:87 run(7f2d1a)llama31-nim-task: processing run
INFO dstack._internal.server.background.tasks.process_runs:330 run(7f2d1a)llama31-nim-task: run status has changed PROVISIONING -> TERMINATING
DEBUG dstack._internal.server.app:213 Processed request POST http://127.0.0.1:3000/api/project/bihan/runs/get in 0.018031s
DEBUG dstack._internal.server.app:213 Processed request POST http://127.0.0.1:3000/api/project/bihan/runs/get in 0.009912s
[22:54:54] DEBUG dstack._internal.server.app:213 Processed request POST http://127.0.0.1:3000/api/project/bihan/runs/get in 0.018990s
[22:54:55] DEBUG dstack._internal.server.background.tasks.process_runs:87 run(7f2d1a)llama31-nim-task: processing run
DEBUG dstack._internal.server.app:213 Processed request POST http://127.0.0.1:3000/api/project/bihan/runs/get in 0.019299s
[22:54:56] DEBUG dstack._internal.server.background.tasks.process_instances:610 Check instance llama31-nim-task-0 status. shim health: Service is OK
DEBUG dstack._internal.server.app:213 Processed request POST http://127.0.0.1:3000/api/project/bihan/runs/get in 0.023599s
[22:54:57] DEBUG dstack._internal.server.background.tasks.process_runs:87 run(7f2d1a)llama31-nim-task: processing run
DEBUG dstack._internal.server.app:213 Processed request POST http://127.0.0.1:3000/api/project/bihan/runs/get in 0.014791s
[22:54:58] DEBUG dstack._internal.server.services.jobs:234 job(ee4b0c)llama31-nim-task-0-0: stopping container
DEBUG dstack._internal.server.app:213 Processed request POST http://127.0.0.1:3000/api/project/bihan/runs/get in 0.023412s
[22:55:00] DEBUG dstack._internal.server.app:213 Processed request POST http://127.0.0.1:3000/api/project/bihan/runs/get in 0.030719s
INFO dstack._internal.server.services.jobs:268 job(ee4b0c)llama31-nim-task-0-0: instance 'llama31-nim-task-0' has been released, new status is
IDLE
INFO dstack._internal.server.services.jobs:283 job(ee4b0c)llama31-nim-task-0-0: job status is FAILED, reason: CONTAINER_EXITED_WITH_ERROR
[22:55:01] DEBUG dstack._internal.server.app:213 Processed request POST http://127.0.0.1:3000/api/project/bihan/runs/get in 0.019889s
DEBUG dstack._internal.server.background.tasks.process_runs:87 run(7f2d1a)llama31-nim-task: processing run
INFO dstack._internal.server.services.runs:952 run(7f2d1a)llama31-nim-task: run status has changed TERMINATING -> FAILED, reason: JOB_FAILED
[22:55:02] DEBUG dstack._internal.server.app:213 Processed request POST http://127.0.0.1:3000/api/project/bihan/runs/get in 0.015102s
Steps to reproduce
Configure a nim task in
examples/deployment/nim/task.dstack.yml
with Llama 3.1-8b-instructActual behaviour
Run the configuration
dstack apply -f examples/deployment/nim/task.dstack.yml
Expected behaviour
When tried directly in host, NIM works with Llama 3.1-8b
dstack version
master
Server logs
Additional information
No response