Closed lhx692135353 closed 3 months ago
Because of this open issue with KServe, you must first run the NIM as a docker command, K8s Job, or K8s Pod outside of KServe mounted to the PV
so that it downloads the NIM models and creates a re-usable cache.
After populating this cache with the NIM artifacts the above workflow should continue working.
I have added some notes around this to the README and am hoping to have a more comprehensive guide published shortly to ease this workflow with some automation and example YAML.
If you are encountering this outside of the KServe InferenceService
, please let me know.
resources:
limits:
nvidia.com/gpu: 1
persistence:
enabled: true
existingClaim: ''
storageClass: openebs-zfspv
accessMode: ReadWriteOnce
stsPersistentVolumeClaimRetentionPolicy:
whenDeleted: Retain
whenScaled: Retain
size: 50Gi
a local storageClass
) to install the Helm chart, everything works fine.kubectl cp
to copy the model from the PV to the host disk at /share/nim (a network storage
).resources:
limits:
nvidia.com/gpu: 1
hostPath:
enabled: true
path: /share/nim
error log
issue a download request for file: \"model-00004-of-00004.safetensors\"; the repo file list has not been validated. "
Looks like you may be copying the NIM Cache incorrectly and it is not being recognizes by the KServe NIM
Trying posting the full log along with the tree
output of the NIM Cache path.
It may also be beneficial to open up a ticket with NVIDIA enterprise support of you are having issues with the NIM container itself and not just the Reference KServe documentation .
Looks like you may be copying the NIM Cache incorrectly and it is not being recognizes by the KServe NIM
Trying posting the full log along with the
tree
output of the NIM Cache path.It may also be beneficial to open up a ticket with NVIDIA enterprise support of you are having issues with the NIM container itself and not just the Reference KServe documentation .
{"level": "INFO", "time": "07-29 09:21:27.100", "file_path": "/usr/local/lib/python3.10/dist-packages/vllm_nvext/hub/ngc_injector.py", "line_number": "166", "message": "Preparing model workspace. This step might download additional files to run the model.", "exc_info": "None", "stack_info": "None"}
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/entrypoints/openai/api_server.py", line 492, in <module>
engine_args, extracted_name = inject_ngc_hub(engine_args)
File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/hub/ngc_injector.py", line 168, in inject_ngc_hub
cached = repo.download_all()
Exception: "Errors fetching files: \nissue a download request for file: \"model-00004-of-00004.safetensors\"; the repo file list has not been validated. \nissue a download request for file: \"config.json\"; the repo file list has not been validated. \nissue a download request for file: \"model-00002-of-00004.safetensors\"; the repo file list has not been validated. \nissue a download request for file: \"tokenizer.json\"; the repo file list has not been validated. \nissue a download request for file: \"tokenizer_config.json\"; the repo file list has not been validated. \nissue a download request for file: \"model-00003-of-00004.safetensors\"; the repo file list has not been validated. \nissue a download request for file: \"model-00001-of-00004.safetensors\"; the repo file list has not been validated. \nissue a download request for file: \"generation_config.json\"; the repo file list has not been validated. \nissue a download request for file: \"special_tokens_map.json\"; the repo file list has not been validated. \nissue a download request for file: \"model.safetensors.index.json\"; the repo file list has not been validated. "
NIM Cache path
I have no name!@nim-llm-7qz9l-q5cdy4j89n-0:/model-store$ pwd
/model-store
I have no name!@nim-llm-7qz9l-q5cdy4j89n-0:/model-store$ ls -a
. .. huggingface ngc
I have no name!@nim-llm-7qz9l-q5cdy4j89n-0:/model-store$ du -sh *
1.5K huggingface
19G ngc
I have no name!@nim-llm-7qz9l-q5cdy4j89n-0:/model-store$
I have no name!@nim-llm-7qz9l-88364eiye2c-0:/model-store$ env|grep NIM_CACHE_PATH
NIM_CACHE_PATH=/model-store
I have no name!@nim-llm-7qz9l-88364eiye2c-0:/model-store$ /exec/tree
.
|-- huggingface
| `-- hub
| `-- version.txt
`-- ngc
`-- hub
|-- models--nim--meta--llama3-8b-instruct
| |-- blobs
| | |-- 116ba5d2cd83996786437f70017cbd24
| | |-- 37e8986d8dfb638589a002ef7439184f
| | |-- 3cd03b50b92a3fb6336e697ae10d9d34
| | |-- 40c7647302ea8d49cc6694d8cd5573b7
| | |-- 41bfb74c3bc5c7dc5823f089a9e268ca
| | |-- 63a6311e76ee19b8f09f44da3962bc53
| | |-- 6dae736d5cc6f142f521378dc0852bc5
| | |-- 8d96e78ff88732d4ccefb0992889d85b
| | |-- 8f5764e46d8818196c0ca82b37c4e5bc
| | `-- 9d5825e3a47c1af3102cc90cc11358b5
| |-- refs
| | `-- hf
| `-- snapshots
| `-- hf
| |-- config.json -> ../../blobs/6dae736d5cc6f142f521378dc0852bc5
| |-- generation_config.json -> ../../blobs/41bfb74c3bc5c7dc5823f089a9e268ca
| |-- model-00001-of-00004.safetensors -> ../../blobs/8f5764e46d8818196c0ca82b37c4e5bc
| |-- model-00002-of-00004.safetensors -> ../../blobs/9d5825e3a47c1af3102cc90cc11358b5
| |-- model-00003-of-00004.safetensors -> ../../blobs/40c7647302ea8d49cc6694d8cd5573b7
| |-- model-00004-of-00004.safetensors -> ../../blobs/63a6311e76ee19b8f09f44da3962bc53
| |-- model.safetensors.index.json -> ../../blobs/37e8986d8dfb638589a002ef7439184f
| |-- special_tokens_map.json -> ../../blobs/8d96e78ff88732d4ccefb0992889d85b
| |-- tokenizer.json -> ../../blobs/3cd03b50b92a3fb6336e697ae10d9d34
| `-- tokenizer_config.json -> ../../blobs/116ba5d2cd83996786437f70017cbd24
`-- tmp
|-- MDuPcfP
`-- vjRf2rp
10 directories, 24 files
When following the deployment steps of KServe, I used Alibaba Cloud’s shared disk as PV and downloaded the model files to the corresponding directory. However, the system attempts to download the model files and reports an error regardless of whether the model files are already present in the directory. The error message is as follows: Traceback (most recent call last): File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/entrypoints/openai/api_server.py", line 492, in
engine_args, extracted_name = inject_ngc_hub(engine_args)
File "/usr/local/lib/python3.10/dist-packages/vllm_nvext/hub/ngc_injector.py", line 168, in inject_ngc_hub
cached = repo.download_all()
Exception: "Errors fetching files: \nissue a download request for file: \"tokenizer_config.json\"; the repo file list has not been validated. \nissue a download request for file: \"model.safetensors.index.json\"; the repo file list has not been validated. \nissue a download request for file: \"model-00004-of-00004.safetensors\"; the repo file list has not been validated. \nissue a download request for file: \"model-00002-of-00004.safetensors\"; the repo file list has not been validated. \nissue a download request for file: \"special_tokens_map.json\"; the repo file list has not been validated. \nissue a download request for file: \"generation_config.json\"; the repo file list has not been validated. \nissue a download request for file: \"tokenizer.json\"; the repo file list has not been validated. \nissue a download request for file: \"config.json\"; the repo file list has not been validated. \nissue a download request for file: \"model-00001-of-00004.safetensors\"; the repo file list has not been validated. \nissue a download request for file: \"model-00003-of-00004.safetensors\"; the repo file list has not been validated. "
reason: Error