NVIDIA / nim-deploy

A collection of YAML files, Helm Charts, Operator code, and guides to act as an example reference implementation for NVIDIA NIM deployment.
https://build.nvidia.com/
Apache License 2.0
141 stars 64 forks source link

Unable to deploy NIM via helm charts #7

Closed tuninger closed 3 months ago

tuninger commented 5 months ago

The following error is reported when deploying via helm charts.

WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/rancher/k3s/k3s.yaml WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /etc/rancher/k3s/k3s.yaml E0603 15:26:34.744113 4011485 memcache.go:255] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0603 15:26:34.754524 4011485 memcache.go:255] couldn't get resource list for subresources.kubevirt.io/v1: the server is currently unable to handle the request E0603 15:26:34.754592 4011485 memcache.go:255] couldn't get resource list for subresources.kubevirt.io/v1alpha3: the server is currently unable to handle the request E0603 15:26:34.756254 4011485 memcache.go:255] couldn't get resource list for upload.cdi.kubevirt.io/v1beta1: the server is currently unable to handle the request E0603 15:26:34.756499 4011485 memcache.go:255] couldn't get resource list for upload.cdi.kubevirt.io/v1alpha1: the server is currently unable to handle the request E0603 15:26:34.760913 4011485 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0603 15:26:34.764458 4011485 memcache.go:106] couldn't get resource list for subresources.kubevirt.io/v1: the server is currently unable to handle the request E0603 15:26:34.779575 4011485 memcache.go:106] couldn't get resource list for upload.cdi.kubevirt.io/v1beta1: the server is currently unable to handle the request E0603 15:26:34.792693 4011485 memcache.go:106] couldn't get resource list for upload.cdi.kubevirt.io/v1alpha1: the server is currently unable to handle the request E0603 15:26:34.795258 4011485 memcache.go:106] couldn't get resource list for subresources.kubevirt.io/v1alpha3: the server is currently unable to handle the request E0603 15:26:35.078813 4011485 memcache.go:255] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0603 15:26:35.085645 4011485 memcache.go:255] couldn't get resource list for subresources.kubevirt.io/v1alpha3: the server is currently unable to handle the request E0603 15:26:35.088455 4011485 memcache.go:255] couldn't get resource list for subresources.kubevirt.io/v1: the server is currently unable to handle the request E0603 15:26:35.091571 4011485 memcache.go:255] couldn't get resource list for upload.cdi.kubevirt.io/v1beta1: the server is currently unable to handle the request E0603 15:26:35.093234 4011485 memcache.go:255] couldn't get resource list for upload.cdi.kubevirt.io/v1alpha1: the server is currently unable to handle the request E0603 15:26:35.094846 4011485 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0603 15:26:35.105210 4011485 memcache.go:106] couldn't get resource list for upload.cdi.kubevirt.io/v1beta1: the server is currently unable to handle the request E0603 15:26:35.119386 4011485 memcache.go:106] couldn't get resource list for upload.cdi.kubevirt.io/v1alpha1: the server is currently unable to handle the request E0603 15:26:35.121750 4011485 memcache.go:106] couldn't get resource list for subresources.kubevirt.io/v1: the server is currently unable to handle the request E0603 15:26:35.124135 4011485 memcache.go:106] couldn't get resource list for subresources.kubevirt.io/v1alpha3: the server is currently unable to handle the request E0603 15:26:35.149343 4011485 memcache.go:255] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0603 15:26:35.151760 4011485 memcache.go:255] couldn't get resource list for subresources.kubevirt.io/v1: the server is currently unable to handle the request E0603 15:26:35.151812 4011485 memcache.go:255] couldn't get resource list for subresources.kubevirt.io/v1alpha3: the server is currently unable to handle the request E0603 15:26:35.152251 4011485 memcache.go:255] couldn't get resource list for upload.cdi.kubevirt.io/v1beta1: the server is currently unable to handle the request E0603 15:26:35.155927 4011485 memcache.go:255] couldn't get resource list for upload.cdi.kubevirt.io/v1alpha1: the server is currently unable to handle the request E0603 15:26:35.157864 4011485 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0603 15:26:35.160211 4011485 memcache.go:106] couldn't get resource list for subresources.kubevirt.io/v1alpha3: the server is currently unable to handle the request E0603 15:26:35.171382 4011485 memcache.go:106] couldn't get resource list for upload.cdi.kubevirt.io/v1beta1: the server is currently unable to handle the request E0603 15:26:35.183259 4011485 memcache.go:106] couldn't get resource list for upload.cdi.kubevirt.io/v1alpha1: the server is currently unable to handle the request E0603 15:26:35.185508 4011485 memcache.go:106] couldn't get resource list for subresources.kubevirt.io/v1: the server is currently unable to handle the request E0603 15:26:35.212870 4011485 memcache.go:255] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0603 15:26:35.213350 4011485 memcache.go:255] couldn't get resource list for subresources.kubevirt.io/v1alpha3: the server is currently unable to handle the request E0603 15:26:35.213617 4011485 memcache.go:255] couldn't get resource list for subresources.kubevirt.io/v1: the server is currently unable to handle the request E0603 15:26:35.215707 4011485 memcache.go:255] couldn't get resource list for upload.cdi.kubevirt.io/v1beta1: the server is currently unable to handle the request E0603 15:26:35.220770 4011485 memcache.go:255] couldn't get resource list for upload.cdi.kubevirt.io/v1alpha1: the server is currently unable to handle the request E0603 15:26:35.223051 4011485 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0603 15:26:35.234824 4011485 memcache.go:106] couldn't get resource list for upload.cdi.kubevirt.io/v1beta1: the server is currently unable to handle the request E0603 15:26:35.246733 4011485 memcache.go:106] couldn't get resource list for upload.cdi.kubevirt.io/v1alpha1: the server is currently unable to handle the request E0603 15:26:35.249252 4011485 memcache.go:106] couldn't get resource list for subresources.kubevirt.io/v1alpha3: the server is currently unable to handle the request E0603 15:26:35.251595 4011485 memcache.go:106] couldn't get resource list for subresources.kubevirt.io/v1: the server is currently unable to handle the request Error: INSTALLATION FAILED: request declared a Content-Length of 3363 but only wrote 0 bytes

kubectl version WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.7+k3s-4159ac46", GitCommit:"4159ac4638b8617d909060dbf8ea923e622a92b4", GitTreeState:"clean", BuildDate:"2023-11-02T02:40:46Z", GoVersion:"go1.19.6", Compiler:"gc", Platform:"linux/amd64"} Kustomize Version: v4.5.7 Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.7+k3s-4159ac46", GitCommit:"4159ac4638b8617d909060dbf8ea923e622a92b4", GitTreeState:"clean", BuildDate:"2023-11-02T02:40:46Z", GoVersion:"go1.19.6", Compiler:"gc", Platform:"linux/amd64"}

How can I resolve this issue? Thanks.

supertetelman commented 5 months ago

Can you provide additional debug information about your K8s, hardware, OS, etc?

tuninger commented 5 months ago

Can you provide additional debug information about your K8s, hardware, OS, etc?

@supertetelman ,Thanks for your response.

OS:CentOS Linux release 7.9.2009 (Core) hardware: 64C, 784G memory,500G disk kubectl version WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.7+k3s-4159ac46", GitCommit:"4159ac4638b8617d909060dbf8ea923e622a92b4", GitTreeState:"clean", BuildDate:"2023-11-02T02:46:44Z", GoVersion:"go1.19.6", Compiler:"gc", Platform:"linux/amd64"} Kustomize Version: v4.5.7 Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.7+k3s-4159ac46", GitCommit:"4159ac4638b8617d909060dbf8ea923e622a92b4", GitTreeState:"clean", BuildDate:"2023-11-02T02:46:44Z", GoVersion:"go1.19.6", Compiler:"gc", Platform:"linux/amd64"}

What operating systems and K8S versions do we currently support? Also, what are the hardware configurations that can be supported? Thanks.

supertetelman commented 5 months ago

Do you have any GPUs in your cluster or a valid running copy of the GPU Operator?

The pre-requisites for the different NIMs are highlighted in the official NVIDIA docs here: https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html.

I don't know that we have tested this helm chart on CentOS, but the GPU Operator fully supports CentOS and I don't see any reason that the NIM LLM Helm chart should not work there.

tuninger commented 5 months ago

Yes, I refer to the documentation( https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html) for NIM LLMS deployment, but a new issue has arisen, using $oauthtoken and API KEY Executing can successfully authenticate, but can't pull the image llama3-8b-instruct:1.0.0, is it not open for pulling now?

Meanwhile,I refer to (https://org.ngc.nvidia.com/setup/personal-keys) for the steps to generate the API KEY.

root@root:~# echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin WARNING! Your password will be stored unencrypted in /root/.docker/config.json. Configure a credential helper to remove this warning. See https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded

root@root:~# docker run -it --rm --name=$CONTAINER_NAME --runtime=nvidia --gpus all --shm-size=16GB -e NGC_API_KEY -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" -u $(id -u) -p 8000:8000 $IMG_NAME Unable to find image 'nvcr.io/nim/meta/llama3-8b-instruct:1.0.0' locally docker: Error response from daemon: Head "https://nvcr.io/v2/nim/meta/llama3-8b-instruct/manifests/1.0.0": unauthorized:

401 Authorization Required

401 Authorization Required


nginx/1.22.1

. See 'docker run --help'.

supertetelman commented 4 months ago

Were you able to resolve your access issues?

rofinn commented 2 months ago

I am also hitting this issue. What's weird is that the same personal access keys appears to work on a different machine.

rofinn commented 2 months ago

N/m, for me it was a local docker permissions issue.