Which path is actually used for model caching?

Hi team,

I am working on NIM deploy on Amazon EKS pattern. ref: https://github.com/awslabs/data-on-eks/issues/560

I tried to deploy the NIM container with helm chart, and I am using a shared storage (EFS) volume mount to /model-store to share between pods.

I know the first time it needs to download model files from NGC, but later even I launch new pods with the volume, the NIM pods takes very long time (5+ minutes) to be ready to serve the request.

What I have done?

Prefeteched the container image at the host level, so that the image already present and eliminate the time to pull image for every time.
Create shared volume between NIM pods on /model-store path
Checked the path/model-store/ngc/hub/models--nim--meta--llama3-8b-instruct/blobs, from the timestamp, the files inside are write at the very first time a NIM pod start and write to it.
Checked /opt/nim/.cache/ is empty
Found the container logs will keep on below status for a while, I assume it starts download models?

===========================================
== NVIDIA Inference Microservice LLM NIM ==
===========================================

NVIDIA Inference Microservice LLM NIM Version 1.0.0
Model: nim/meta/llama3-8b-instruct

Container image Copyright (c) 2016-2024, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This NIM container is governed by the NVIDIA AI Product Agreement here:
https://www.nvidia.com/en-us/data-center/products/nvidia-ai-enterprise/eula/.
A copy of this license can be found under /opt/nim/LICENSE.

The use of this model is governed by the AI Foundation Models Community License
here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.

ADDITIONAL INFORMATION: Meta Llama 3 Community License, Built with Meta Llama 3.
A copy of the Llama 3 license can be found under /opt/nim/MODEL_LICENSE.

What I expect? I would like to know the real path that the NIM container could use for model caching, so that as long as the pod starts, the container inside can start very fast.

More ideally, if someone from NVIDIA could explain the process how the NIM container start?

Below are some captured logs for reference

Pod Events

Events:
  Type     Reason            Age   From               Message
  ----     ------            ----  ----               -------
  Warning  FailedScheduling  72s   default-scheduler  0/3 nodes are available: 3 Insufficient nvidia.com/gpu. preemption: 0/3 nodes are available: 3 No preemption victims found for incoming pod.
  Normal   Nominated         71s   karpenter          Pod should schedule on: nodeclaim/g5-gpu-hlh6s
  Normal   Scheduled         31s   default-scheduler  Successfully assigned nim/nim-llm-1 to ip-100-64-125-179.us-west-2.compute.internal
  Normal   Pulled            29s   kubelet            Container image "xxxxx/nim/meta/llama3-8b-instruct:latest" already present on machine  # masked image path
  Normal   Created           28s   kubelet            Created container nim-llm
  Normal   Started           8s    kubelet            Started container nim-llm

Full NIM Pod Logs

A copy of this license can be found under /opt/nim/LICENSE.

The use of this model is governed by the AI Foundation Models Community License
here: https://docs.nvidia.com/ai-foundation-models-community-license.pdf.

ADDITIONAL INFORMATION: Meta Llama 3 Community License, Built with Meta Llama 3.
A copy of the Llama 3 license can be found under /opt/nim/MODEL_LICENSE.

2024-06-28 08:23:17,522 [INFO] PyTorch version 2.2.2 available.
2024-06-28 08:23:27,025 [WARNING] [TRT-LLM] [W] Logger level already set from environment. Discard new verbosity: error
2024-06-28 08:23:27,025 [INFO] [TRT-LLM] [I] Starting TensorRT-LLM init.
[TensorRT-LLM][INFO] Set logger level by INFO
[TensorRT-LLM] TensorRT-LLM version: 0.10.1.dev2024053000
2024-06-28 08:23:28,555 [INFO] [TRT-LLM] [I] TensorRT-LLM inited.
{"level": "INFO", "time": "06-28 08:23:50.035", "file_path": "/usr/local/lib/python3.10/dist-packages/vllm_nvext/entrypoints/openai/api_server.py", "line_number": "489", "message": "NIM LLM API version 1.0.0", "exc_info": "None", "stack_info": "None"}
{"level": "INFO", "time": "06-28 08:23:50.040", "file_path": "/usr/local/lib/python3.10/dist-packages/vllm_nvext/hub/ngc_profile.py", "line_number": "217", "message": "Running NIM without LoRA. Only looking for compatible profiles that do not support LoRA.", "exc_info": "None", "stack_info": "None"}
{"level": "INFO", "time": "06-28 08:23:50.040", "file_path": "/usr/local/lib/python3.10/dist-packages/vllm_nvext/hub/ngc_profile.py", "line_number": "219", "message": "Detected 2 compatible profile(s).", "exc_info": "None", "stack_info": "None"}
{"level": "INFO", "time": "06-28 08:23:50.040", "file_path": "/usr/local/lib/python3.10/dist-packages/vllm_nvext/hub/ngc_injector.py", "line_number": "106", "message": "Valid profile: c334b76d50783655bdf62b8138511456f7b23083553d310268d0d05f254c012b (tensorrt_llm-a10g-fp16-tp1-throughput) on GPUs [0]", "exc_info": "None", "stack_info": "None"}
{"level": "INFO", "time": "06-28 08:23:50.041", "file_path": "/usr/local/lib/python3.10/dist-packages/vllm_nvext/hub/ngc_injector.py", "line_number": "106", "message": "Valid profile: 8835c31752fbc67ef658b20a9f78e056914fdef0660206d82f252d62fd96064d (vllm-fp16-tp1) on GPUs [0]", "exc_info": "None", "stack_info": "None"}
{"level": "INFO", "time": "06-28 08:23:50.041", "file_path": "/usr/local/lib/python3.10/dist-packages/vllm_nvext/hub/ngc_injector.py", "line_number": "141", "message": "Selected profile: c334b76d50783655bdf62b8138511456f7b23083553d310268d0d05f254c012b (tensorrt_llm-a10g-fp16-tp1-throughput)", "exc_info": "None", "stack_info": "None"}
{"level": "INFO", "time": "06-28 08:23:51.435", "file_path": "/usr/local/lib/python3.10/dist-packages/vllm_nvext/hub/ngc_injector.py", "line_number": "146", "message": "Profile metadata: feat_lora: false", "exc_info": "None", "stack_info": "None"}
{"level": "INFO", "time": "06-28 08:23:51.435", "file_path": "/usr/local/lib/python3.10/dist-packages/vllm_nvext/hub/ngc_injector.py", "line_number": "146", "message": "Profile metadata: gpu_device: 2237:10de", "exc_info": "None", "stack_info": "None"}
{"level": "INFO", "time": "06-28 08:23:51.435", "file_path": "/usr/local/lib/python3.10/dist-packages/vllm_nvext/hub/ngc_injector.py", "line_number": "146", "message": "Profile metadata: gpu: A10G", "exc_info": "None", "stack_info": "None"}
{"level": "INFO", "time": "06-28 08:23:51.436", "file_path": "/usr/local/lib/python3.10/dist-packages/vllm_nvext/hub/ngc_injector.py", "line_number": "146", "message": "Profile metadata: pp: 1", "exc_info": "None", "stack_info": "None"}
{"level": "INFO", "time": "06-28 08:23:51.436", "file_path": "/usr/local/lib/python3.10/dist-packages/vllm_nvext/hub/ngc_injector.py", "line_number": "146", "message": "Profile metadata: llm_engine: tensorrt_llm", "exc_info": "None", "stack_info": "None"}
{"level": "INFO", "time": "06-28 08:23:51.436", "file_path": "/usr/local/lib/python3.10/dist-packages/vllm_nvext/hub/ngc_injector.py", "line_number": "146", "message": "Profile metadata: precision: fp16", "exc_info": "None", "stack_info": "None"}
{"level": "INFO", "time": "06-28 08:23:51.436", "file_path": "/usr/local/lib/python3.10/dist-packages/vllm_nvext/hub/ngc_injector.py", "line_number": "146", "message": "Profile metadata: profile: throughput", "exc_info": "None", "stack_info": "None"}
{"level": "INFO", "time": "06-28 08:23:51.436", "file_path": "/usr/local/lib/python3.10/dist-packages/vllm_nvext/hub/ngc_injector.py", "line_number": "146", "message": "Profile metadata: tp: 1", "exc_info": "None", "stack_info": "None"}
{"level": "INFO", "time": "06-28 08:23:51.436", "file_path": "/usr/local/lib/python3.10/dist-packages/vllm_nvext/hub/ngc_injector.py", "line_number": "166", "message": "Preparing model workspace. This step might download additional files to run the model.", "exc_info": "None", "stack_info": "None"}
{"level": "INFO", "time": "06-28 08:23:53.768", "file_path": "/usr/local/lib/python3.10/dist-packages/vllm_nvext/hub/ngc_injector.py", "line_number": "172", "message": "Model workspace is now ready. It took 2.332 seconds", "exc_info": "None", "stack_info": "None"}
{"level": "INFO", "time": "06-28 08:23:53.774", "file_path": "/usr/local/lib/python3.10/dist-packages/vllm_nvext/engine/async_trtllm_engine.py", "line_number": "74", "message": "Initializing an LLM engine (v1.0.0) with config: model='/tmp/meta--llama3-8b-instruct-m53wsd6t', speculative_config=None, tokenizer='/tmp/meta--llama3-8b-instruct-m53wsd6t', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)", "exc_info": "None", "stack_info": "None"}
{"level": "WARNING", "time": "06-28 08:23:54.383", "file_path": "/usr/local/lib/python3.10/dist-packages/transformers/utils/logging.py", "line_number": "314", "message": "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.", "exc_info": "None", "stack_info": "None"}
{"level": "INFO", "time": "06-28 08:23:54.400", "file_path": "/usr/local/lib/python3.10/dist-packages/vllm_nvext/trtllm/utils.py", "line_number": "201", "message": "Using 0 bytes of gpu memory for PEFT cache", "exc_info": "None", "stack_info": "None"}
{"level": "INFO", "time": "06-28 08:23:54.400", "file_path": "/usr/local/lib/python3.10/dist-packages/vllm_nvext/trtllm/utils.py", "line_number": "207", "message": "Engine size in bytes 16067779716", "exc_info": "None", "stack_info": "None"}
{"level": "INFO", "time": "06-28 08:23:54.401", "file_path": "/usr/local/lib/python3.10/dist-packages/vllm_nvext/trtllm/utils.py", "line_number": "211", "message": "available device memory 23606263808", "exc_info": "None", "stack_info": "None"}
{"level": "INFO", "time": "06-28 08:23:54.401", "file_path": "/usr/local/lib/python3.10/dist-packages/vllm_nvext/trtllm/utils.py", "line_number": "218", "message": "Setting free_gpu_memory_fraction to 0.9", "exc_info": "None", "stack_info": "None"}
[TensorRT-LLM][INFO] Engine version 0.10.0.dev2024051400 found in the config file, assuming engine(s) built by new builder API.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'cross_attention' not found
[TensorRT-LLM][WARNING] Optional value for parameter cross_attention will not be set.
[TensorRT-LLM][WARNING] Parameter layer_types cannot be read from json:
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'layer_types' not found
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.type_error.302] type must be string, but is null
[TensorRT-LLM][WARNING] Optional value for parameter kv_cache_quant_algo will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'num_medusa_heads' not found
[TensorRT-LLM][WARNING] Optional value for parameter num_medusa_heads will not be set.
[TensorRT-LLM][WARNING] [json.exception.out_of_range.403] key 'max_draft_len' not found
[TensorRT-LLM][WARNING] Optional value for parameter max_draft_len will not be set.
[TensorRT-LLM][INFO] MPI size: 1, rank: 0
[TensorRT-LLM][INFO] Rank 0 is using GPU 0
[TensorRT-LLM][INFO] TRTGptModel maxNumSequences: 32
[TensorRT-LLM][INFO] TRTGptModel maxBatchSize: 32
[TensorRT-LLM][INFO] TRTGptModel mMaxAttentionWindowSize: 16384
[TensorRT-LLM][INFO] TRTGptModel enableTrtOverlap: 0
[TensorRT-LLM][INFO] TRTGptModel normalizeLogProbs: 0
[TensorRT-LLM][INFO] Loaded engine size: 15323 MiB
[TensorRT-LLM][WARNING] Using an engine plan file across different models of devices is not recommended and is likely to affect performance or even cause errors.
[TensorRT-LLM][INFO] Allocated 1504.00 MiB for execution context memory.
[TensorRT-LLM][INFO] [MemUsageChange] TensorRT-managed allocation in IExecutionContext creation: CPU +0, GPU +0, now: CPU 0, GPU 15320 (MiB)
[TensorRT-LLM][INFO] Max KV cache pages per sequence: 128
[TensorRT-LLM][INFO] Max tokens in paged KV cache: 38400. Allocating 5033164800 bytes.
{"level": "WARNING", "time": "06-28 08:25:07.343", "file_path": "/usr/local/lib/python3.10/dist-packages/transformers/utils/logging.py", "line_number": "314", "message": "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.", "exc_info": "None", "stack_info": "None"}
{"level": "INFO", "time": "06-28 08:25:07.356", "file_path": "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/serving_chat.py", "line_number": "347", "message": "Using default chat template:\n{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>\n\n' }}{% endif %}", "exc_info": "None", "stack_info": "None"}
{"level": "WARNING", "time": "06-28 08:25:07.645", "file_path": "/usr/local/lib/python3.10/dist-packages/transformers/utils/logging.py", "line_number": "314", "message": "Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.", "exc_info": "None", "stack_info": "None"}
{"level": "INFO", "time": "06-28 08:25:07.657", "file_path": "/usr/local/lib/python3.10/dist-packages/vllm_nvext/entrypoints/openai/api_server.py", "line_number": "456", "message": "Serving endpoints:\n  0.0.0.0:8000/openapi.json\n  0.0.0.0:8000/docs\n  0.0.0.0:8000/docs/oauth2-redirect\n  0.0.0.0:8000/metrics\n  0.0.0.0:8000/v1/health/ready\n  0.0.0.0:8000/v1/health/live\n  0.0.0.0:8000/v1/models\n  0.0.0.0:8000/v1/version\n  0.0.0.0:8000/v1/chat/completions\n  0.0.0.0:8000/v1/completions", "exc_info": "None", "stack_info": "None"}
{"level": "INFO", "time": "06-28 08:25:07.658", "file_path": "/usr/local/lib/python3.10/dist-packages/vllm_nvext/entrypoints/openai/api_server.py", "line_number": "460", "message": "An example cURL request:\ncurl -X 'POST' \\\n  'http://0.0.0.0:8000/v1/chat/completions' \\\n  -H 'accept: application/json' \\\n  -H 'Content-Type: application/json' \\\n  -d '{\n    \"model\": \"meta/llama3-8b-instruct\",\n    \"messages\": [\n      {\n        \"role\":\"user\",\n        \"content\":\"Hello! How are you?\"\n      },\n      {\n        \"role\":\"assistant\",\n        \"content\":\"Hi! I am quite well, how can I help you today?\"\n      },\n      {\n        \"role\":\"user\",\n        \"content\":\"Can you write me a song?\"\n      }\n    ],\n    \"top_p\": 1,\n    \"n\": 1,\n    \"max_tokens\": 15,\n    \"stream\": true,\n    \"frequency_penalty\": 1.0,\n    \"stop\": [\"hello\"]\n  }'\n", "exc_info": "None", "stack_info": "None"}
{"level": "INFO", "time": "06-28 08:25:08.143", "file_path": "/usr/local/lib/python3.10/dist-packages/uvicorn/server.py", "line_number": "82", "message": "Started server process [32]", "exc_info": "None", "stack_info": "None"}
{"level": "INFO", "time": "06-28 08:25:08.143", "file_path": "/usr/local/lib/python3.10/dist-packages/uvicorn/lifespan/on.py", "line_number": "48", "message": "Waiting for application startup.", "exc_info": "None", "stack_info": "None"}
{"level": "INFO", "time": "06-28 08:25:08.149", "file_path": "/usr/local/lib/python3.10/dist-packages/uvicorn/lifespan/on.py", "line_number": "62", "message": "Application startup complete.", "exc_info": "None", "stack_info": "None"}
[TensorRT-LLM][INFO] Set logger level by INFO
[TensorRT-LLM][INFO] Set logger level by INFO
{"level": "INFO", "time": "06-28 08:25:08.150", "file_path": "/usr/local/lib/python3.10/dist-packages/uvicorn/server.py", "line_number": "214", "message": "Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)", "exc_info": "None", "stack_info": "None"}
{"level": "INFO", "time": "06-28 08:25:13.065", "file_path": "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", "line_number": "481", "message": "100.64.86.7:44088 - \"GET /v1/health/ready HTTP/1.1\" 503", "exc_info": "None", "stack_info": "None"}
[TensorRT-LLM][INFO] Set logger level by INFO
[TensorRT-LLM][INFO] Ignoring already terminated request 1
{"level": "INFO", "time": "06-28 08:25:23.065", "file_path": "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", "line_number": "481", "message": "100.64.86.7:41670 - \"GET /v1/health/ready HTTP/1.1\" 200", "exc_info": "None", "stack_info": "None"}
{"level": "INFO", "time": "06-28 08:25:23.066", "file_path": "/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/http/httptools_impl.py", "line_number": "481", "message": "100.64.86.7:41686 - \"GET /v1/health/ready HTTP/1.1\" 200", "exc_info": "None", "stack_info": "None"}

NVIDIA / nim-deploy

Which path is actually used for model caching? #24