...
{"level": "None", "time": "None", "file_name": "None", "file_path": "None", "line_number": "-1", "message": "[09-25 19:03:45.989 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:119] error sending request for url (https://api.ngc.nvidia.com/v2/org/nim/team/meta/models/llama-3_1-70b-instruct/hf-1d54af3-nim1.2/files)", "exc_info": "None", "stack_info": "None"}
{"level": "ERROR", "time": "None", "file_name": "None", "file_path": "None", "line_number": "-1", "message": "", "exc_info": "Traceback (most recent call last):\n File \"/usr/lib/python3.10/runpy.py\", line 196, in _run_module_as_main\n return _run_code(code, main_globals, None,\n File \"/usr/lib/python3.10/runpy.py\", line 86, in _run_code\n exec(code, run_globals)\n File \"/opt/nim/llm/vllm_nvext/entrypoints/launch.py\", line 99, in \n main()\n File \"/opt/nim/llm/vllm_nvext/entrypoints/launch.py\", line 42, in main\n inference_env = prepare_environment()\n File \"/opt/nim/llm/vllm_nvext/entrypoints/args.py\", line 155, in prepare_environment\n engine_args, extracted_name = inject_ngc_hub(engine_args)\n File \"/opt/nim/llm/vllm_nvext/hub/ngc_injector.py\", line 247, in inject_ngc_hub\n cached = repo.get_all()\nException: error sending request for url (https://api.ngc.nvidia.com/v2/org/nim/team/meta/models/llama-3_1-70b-instruct/hf-1d54af3-nim1.2/files)", "stack_info": "None"}
kubectl describe pod my-nim-01
...
Events:
Type Reason Age From Message
Warning BackOff 5m3s (x90 over 102m) kubelet Back-off restarting failed container nim-llm in pod my-nim-0_default(ce8f1e3a-f0e6-4a95-9086-2901091b7a57)
Normal Pulled 4m52s (x15 over 116m) kubelet Container image "nvcr.io/nim/meta/llama-3.1-70b-instruct:latest" already present on machine
kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
default my-nim-0 0/1 Running 14 (6m46s ago) 117m
vim custom-value.yaml
image:
repository: "nvcr.io/nim/meta/llama-3.1-70b-instruct" # container location
tag: latest # NIM version you want to deploy
model:
ngcAPISecret: ngc-api # name of a secret in the cluster that includes a key named NGC_API_KEY and is an NGC API key
imagePullSecrets:
name: ngc-secret # name of a secret used to pull nvcr.io images, see https://kubernetes.io/docs/tasks/configure-pod-container/pull-image-private-registry/
persistence:
enabled: true
size: 800Gi
accessMode: ReadWriteMany
storageClass: ""
annotations:
helm.sh/resource-policy: "keep"
livenessProbe:
initialDelaySeconds: 600
periodSeconds: 60
timeoutSeconds: 10
startupProbe:
initialDelaySeconds: 600
periodSeconds: 60
timeoutSeconds: 10
failureThreshold: 1500
resources:
limits:
nvidia.com/gpu: 4 # much more GPU memory is required
we follow the steps here: https://docs.nvidia.com/nim/large-language-models/latest/deploy-helm.html
after helm install ....
kubectl logs my-nim-01 --previous
... {"level": "None", "time": "None", "file_name": "None", "file_path": "None", "line_number": "-1", "message": "[09-25 19:03:45.989 ERROR nim_sdk::hub::repo rust/nim-sdk/src/hub/repo.rs:119] error sending request for url (https://api.ngc.nvidia.com/v2/org/nim/team/meta/models/llama-3_1-70b-instruct/hf-1d54af3-nim1.2/files)", "exc_info": "None", "stack_info": "None"} {"level": "ERROR", "time": "None", "file_name": "None", "file_path": "None", "line_number": "-1", "message": "", "exc_info": "Traceback (most recent call last):\n File \"/usr/lib/python3.10/runpy.py\", line 196, in _run_module_as_main\n return _run_code(code, main_globals, None,\n File \"/usr/lib/python3.10/runpy.py\", line 86, in _run_code\n exec(code, run_globals)\n File \"/opt/nim/llm/vllm_nvext/entrypoints/launch.py\", line 99, in\n main()\n File \"/opt/nim/llm/vllm_nvext/entrypoints/launch.py\", line 42, in main\n inference_env = prepare_environment()\n File \"/opt/nim/llm/vllm_nvext/entrypoints/args.py\", line 155, in prepare_environment\n engine_args, extracted_name = inject_ngc_hub(engine_args)\n File \"/opt/nim/llm/vllm_nvext/hub/ngc_injector.py\", line 247, in inject_ngc_hub\n cached = repo.get_all()\nException: error sending request for url (https://api.ngc.nvidia.com/v2/org/nim/team/meta/models/llama-3_1-70b-instruct/hf-1d54af3-nim1.2/files)", "stack_info": "None"}
kubectl describe pod my-nim-01
... Events: Type Reason Age From MessageWarning BackOff 5m3s (x90 over 102m) kubelet Back-off restarting failed container nim-llm in pod my-nim-0_default(ce8f1e3a-f0e6-4a95-9086-2901091b7a57) Normal Pulled 4m52s (x15 over 116m) kubelet Container image "nvcr.io/nim/meta/llama-3.1-70b-instruct:latest" already present on machine
kubectl get pods -A NAMESPACE NAME READY STATUS RESTARTS AGE default my-nim-0 0/1 Running 14 (6m46s ago) 117m
vim custom-value.yaml
image: repository: "nvcr.io/nim/meta/llama-3.1-70b-instruct" # container location tag: latest # NIM version you want to deploy model: ngcAPISecret: ngc-api # name of a secret in the cluster that includes a key named NGC_API_KEY and is an NGC API key imagePullSecrets: