KServe deployment fails with error when using cloud shared disk as PV

lhx692135353 commented 5 months ago

When following the deployment steps of KServe, I used Alibaba Cloud’s shared disk as PV and downloaded the model files to the corresponding directory. However, the system attempts to download the model files and reports an error regardless of whether the model files are already present in the directory. The error message is as follows: Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/entrypoints/openai/api_server.py", line 492, in engine_args, extracted_name = inject_ngc_hub(engine_args) File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/hub/ngc_injector.py", line 168, in inject_ngc_hub cached = repo.download_all() Exception: "Errors fetching files: \nissue a download request for file: \"tokenizer_config.json\"; the repo file list has not been validated. \nissue a download request for file: \"model.safetensors.index.json\"; the repo file list has not been validated. \nissue a download request for file: \"model-00004-of-00004.safetensors\"; the repo file list has not been validated. \nissue a download request for file: \"model-00002-of-00004.safetensors\"; the repo file list has not been validated. \nissue a download request for file: \"special_tokens_map.json\"; the repo file list has not been validated. \nissue a download request for file: \"generation_config.json\"; the repo file list has not been validated. \nissue a download request for file: \"tokenizer.json\"; the repo file list has not been validated. \nissue a download request for file: \"config.json\"; the repo file list has not been validated. \nissue a download request for file: \"model-00001-of-00004.safetensors\"; the repo file list has not been validated. \nissue a download request for file: \"model-00003-of-00004.safetensors\"; the repo file list has not been validated. " reason: Error

supertetelman commented 4 months ago

Because of this open issue with KServe, you must first run the NIM as a docker command, K8s Job, or K8s Pod outside of KServe mounted to the PV so that it downloads the NIM models and creates a re-usable cache.

After populating this cache with the NIM artifacts the above workflow should continue working.

I have added some notes around this to the README and am hoping to have a more comprehensive guide published shortly to ease this workflow with some automation and example YAML.

If you are encountering this outside of the KServe InferenceService, please let me know.

https://github.com/kserve/kserve/issues/3687

inksnw commented 4 months ago

resources:
  limits:
    nvidia.com/gpu: 1
persistence:
  enabled: true
  existingClaim: ''
  storageClass: openebs-zfspv
  accessMode: ReadWriteOnce
  stsPersistentVolumeClaimRetentionPolicy:
    whenDeleted: Retain
    whenScaled: Retain
  size: 50Gi

Using this values.yaml with openebs-zfspv (a local storageClass) to install the Helm chart, everything works fine.
Using kubectl cp to copy the model from the PV to the host disk at /share/nim (a network storage).

Then, create an instance using the already existing model directory.

resources:
limits:
nvidia.com/gpu: 1
hostPath:
enabled: true
path: /share/nim

error log

issue a download request for file: \"model-00004-of-00004.safetensors\"; the repo file list has not been validated. "

supertetelman commented 4 months ago

Looks like you may be copying the NIM Cache incorrectly and it is not being recognizes by the KServe NIM

Trying posting the full log along with the tree output of the NIM Cache path.

It may also be beneficial to open up a ticket with NVIDIA enterprise support of you are having issues with the NIM container itself and not just the Reference KServe documentation .

inksnw commented 4 months ago

Looks like you may be copying the NIM Cache incorrectly and it is not being recognizes by the KServe NIM

Trying posting the full log along with the tree output of the NIM Cache path.

It may also be beneficial to open up a ticket with NVIDIA enterprise support of you are having issues with the NIM container itself and not just the Reference KServe documentation .

 {"level": "INFO", "time": "07-29 09:21:27.100", "file_path": "/usr/local/lib/python3.10/dist-packages/vllm_nvext/hub/ngc_injector.py", "line_number": "166", "message": "Preparing model workspace. This step might download additional files to run the model.", "exc_info": "None", "stack_info": "None"}

 Traceback (most recent call last):

   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main

     return _run_code(code, main_globals, None,

   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code

     exec(code, run_globals)

   File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/entrypoints/openai/api_server.py", line 492, in <module>

     engine_args, extracted_name = inject_ngc_hub(engine_args)

   File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/hub/ngc_injector.py", line 168, in inject_ngc_hub

     cached = repo.download_all()

 Exception: "Errors fetching files: \nissue a download request for file: \"model-00004-of-00004.safetensors\"; the repo file list has not been validated. \nissue a download request for file: \"config.json\"; the repo file list has not been validated. \nissue a download request for file: \"model-00002-of-00004.safetensors\"; the repo file list has not been validated. \nissue a download request for file: \"tokenizer.json\"; the repo file list has not been validated. \nissue a download request for file: \"tokenizer_config.json\"; the repo file list has not been validated. \nissue a download request for file: \"model-00003-of-00004.safetensors\"; the repo file list has not been validated. \nissue a download request for file: \"model-00001-of-00004.safetensors\"; the repo file list has not been validated. \nissue a download request for file: \"generation_config.json\"; the repo file list has not been validated. \nissue a download request for file: \"special_tokens_map.json\"; the repo file list has not been validated. \nissue a download request for file: \"model.safetensors.index.json\"; the repo file list has not been validated. "

NIM Cache path

I have no name!@nim-llm-7qz9l-q5cdy4j89n-0:/model-store$ pwd
/model-store
I have no name!@nim-llm-7qz9l-q5cdy4j89n-0:/model-store$ ls -a
.  ..  huggingface  ngc
I have no name!@nim-llm-7qz9l-q5cdy4j89n-0:/model-store$ du -sh *
1.5K    huggingface
19G     ngc
I have no name!@nim-llm-7qz9l-q5cdy4j89n-0:/model-store$

inksnw commented 4 months ago

I have no name!@nim-llm-7qz9l-88364eiye2c-0:/model-store$ env|grep NIM_CACHE_PATH
NIM_CACHE_PATH=/model-store
I have no name!@nim-llm-7qz9l-88364eiye2c-0:/model-store$ /exec/tree
.
|-- huggingface
|   `-- hub
|       `-- version.txt
`-- ngc
    `-- hub
        |-- models--nim--meta--llama3-8b-instruct
        |   |-- blobs
        |   |   |-- 116ba5d2cd83996786437f70017cbd24
        |   |   |-- 37e8986d8dfb638589a002ef7439184f
        |   |   |-- 3cd03b50b92a3fb6336e697ae10d9d34
        |   |   |-- 40c7647302ea8d49cc6694d8cd5573b7
        |   |   |-- 41bfb74c3bc5c7dc5823f089a9e268ca
        |   |   |-- 63a6311e76ee19b8f09f44da3962bc53
        |   |   |-- 6dae736d5cc6f142f521378dc0852bc5
        |   |   |-- 8d96e78ff88732d4ccefb0992889d85b
        |   |   |-- 8f5764e46d8818196c0ca82b37c4e5bc
        |   |   `-- 9d5825e3a47c1af3102cc90cc11358b5
        |   |-- refs
        |   |   `-- hf
        |   `-- snapshots
        |       `-- hf
        |           |-- config.json -> ../../blobs/6dae736d5cc6f142f521378dc0852bc5
        |           |-- generation_config.json -> ../../blobs/41bfb74c3bc5c7dc5823f089a9e268ca
        |           |-- model-00001-of-00004.safetensors -> ../../blobs/8f5764e46d8818196c0ca82b37c4e5bc
        |           |-- model-00002-of-00004.safetensors -> ../../blobs/9d5825e3a47c1af3102cc90cc11358b5
        |           |-- model-00003-of-00004.safetensors -> ../../blobs/40c7647302ea8d49cc6694d8cd5573b7
        |           |-- model-00004-of-00004.safetensors -> ../../blobs/63a6311e76ee19b8f09f44da3962bc53
        |           |-- model.safetensors.index.json -> ../../blobs/37e8986d8dfb638589a002ef7439184f
        |           |-- special_tokens_map.json -> ../../blobs/8d96e78ff88732d4ccefb0992889d85b
        |           |-- tokenizer.json -> ../../blobs/3cd03b50b92a3fb6336e697ae10d9d34
        |           `-- tokenizer_config.json -> ../../blobs/116ba5d2cd83996786437f70017cbd24
        `-- tmp
            |-- MDuPcfP
            `-- vjRf2rp

10 directories, 24 files

NVIDIA / nim-deploy

KServe deployment fails with error when using cloud shared disk as PV #27