NVIDIA / nim-deploy

A collection of YAML files, Helm Charts, Operator code, and guides to act as an example reference implementation for NVIDIA NIM deployment.
Apache License 2.0
45 stars 15 forks source link

Unable to deploy NIM via helm charts #7

Open tuninger opened 2 weeks ago

tuninger commented 2 weeks ago

The following error is reported when deploying via helm charts.

WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /etc/rancher/k3s/k3s.yaml WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /etc/rancher/k3s/k3s.yaml E0603 15:26:34.744113 4011485 memcache.go:255] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0603 15:26:34.754524 4011485 memcache.go:255] couldn't get resource list for subresources.kubevirt.io/v1: the server is currently unable to handle the request E0603 15:26:34.754592 4011485 memcache.go:255] couldn't get resource list for subresources.kubevirt.io/v1alpha3: the server is currently unable to handle the request E0603 15:26:34.756254 4011485 memcache.go:255] couldn't get resource list for upload.cdi.kubevirt.io/v1beta1: the server is currently unable to handle the request E0603 15:26:34.756499 4011485 memcache.go:255] couldn't get resource list for upload.cdi.kubevirt.io/v1alpha1: the server is currently unable to handle the request E0603 15:26:34.760913 4011485 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0603 15:26:34.764458 4011485 memcache.go:106] couldn't get resource list for subresources.kubevirt.io/v1: the server is currently unable to handle the request E0603 15:26:34.779575 4011485 memcache.go:106] couldn't get resource list for upload.cdi.kubevirt.io/v1beta1: the server is currently unable to handle the request E0603 15:26:34.792693 4011485 memcache.go:106] couldn't get resource list for upload.cdi.kubevirt.io/v1alpha1: the server is currently unable to handle the request E0603 15:26:34.795258 4011485 memcache.go:106] couldn't get resource list for subresources.kubevirt.io/v1alpha3: the server is currently unable to handle the request E0603 15:26:35.078813 4011485 memcache.go:255] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0603 15:26:35.085645 4011485 memcache.go:255] couldn't get resource list for subresources.kubevirt.io/v1alpha3: the server is currently unable to handle the request E0603 15:26:35.088455 4011485 memcache.go:255] couldn't get resource list for subresources.kubevirt.io/v1: the server is currently unable to handle the request E0603 15:26:35.091571 4011485 memcache.go:255] couldn't get resource list for upload.cdi.kubevirt.io/v1beta1: the server is currently unable to handle the request E0603 15:26:35.093234 4011485 memcache.go:255] couldn't get resource list for upload.cdi.kubevirt.io/v1alpha1: the server is currently unable to handle the request E0603 15:26:35.094846 4011485 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0603 15:26:35.105210 4011485 memcache.go:106] couldn't get resource list for upload.cdi.kubevirt.io/v1beta1: the server is currently unable to handle the request E0603 15:26:35.119386 4011485 memcache.go:106] couldn't get resource list for upload.cdi.kubevirt.io/v1alpha1: the server is currently unable to handle the request E0603 15:26:35.121750 4011485 memcache.go:106] couldn't get resource list for subresources.kubevirt.io/v1: the server is currently unable to handle the request E0603 15:26:35.124135 4011485 memcache.go:106] couldn't get resource list for subresources.kubevirt.io/v1alpha3: the server is currently unable to handle the request E0603 15:26:35.149343 4011485 memcache.go:255] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0603 15:26:35.151760 4011485 memcache.go:255] couldn't get resource list for subresources.kubevirt.io/v1: the server is currently unable to handle the request E0603 15:26:35.151812 4011485 memcache.go:255] couldn't get resource list for subresources.kubevirt.io/v1alpha3: the server is currently unable to handle the request E0603 15:26:35.152251 4011485 memcache.go:255] couldn't get resource list for upload.cdi.kubevirt.io/v1beta1: the server is currently unable to handle the request E0603 15:26:35.155927 4011485 memcache.go:255] couldn't get resource list for upload.cdi.kubevirt.io/v1alpha1: the server is currently unable to handle the request E0603 15:26:35.157864 4011485 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0603 15:26:35.160211 4011485 memcache.go:106] couldn't get resource list for subresources.kubevirt.io/v1alpha3: the server is currently unable to handle the request E0603 15:26:35.171382 4011485 memcache.go:106] couldn't get resource list for upload.cdi.kubevirt.io/v1beta1: the server is currently unable to handle the request E0603 15:26:35.183259 4011485 memcache.go:106] couldn't get resource list for upload.cdi.kubevirt.io/v1alpha1: the server is currently unable to handle the request E0603 15:26:35.185508 4011485 memcache.go:106] couldn't get resource list for subresources.kubevirt.io/v1: the server is currently unable to handle the request E0603 15:26:35.212870 4011485 memcache.go:255] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0603 15:26:35.213350 4011485 memcache.go:255] couldn't get resource list for subresources.kubevirt.io/v1alpha3: the server is currently unable to handle the request E0603 15:26:35.213617 4011485 memcache.go:255] couldn't get resource list for subresources.kubevirt.io/v1: the server is currently unable to handle the request E0603 15:26:35.215707 4011485 memcache.go:255] couldn't get resource list for upload.cdi.kubevirt.io/v1beta1: the server is currently unable to handle the request E0603 15:26:35.220770 4011485 memcache.go:255] couldn't get resource list for upload.cdi.kubevirt.io/v1alpha1: the server is currently unable to handle the request E0603 15:26:35.223051 4011485 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request E0603 15:26:35.234824 4011485 memcache.go:106] couldn't get resource list for upload.cdi.kubevirt.io/v1beta1: the server is currently unable to handle the request E0603 15:26:35.246733 4011485 memcache.go:106] couldn't get resource list for upload.cdi.kubevirt.io/v1alpha1: the server is currently unable to handle the request E0603 15:26:35.249252 4011485 memcache.go:106] couldn't get resource list for subresources.kubevirt.io/v1alpha3: the server is currently unable to handle the request E0603 15:26:35.251595 4011485 memcache.go:106] couldn't get resource list for subresources.kubevirt.io/v1: the server is currently unable to handle the request Error: INSTALLATION FAILED: request declared a Content-Length of 3363 but only wrote 0 bytes

kubectl version WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.7+k3s-4159ac46", GitCommit:"4159ac4638b8617d909060dbf8ea923e622a92b4", GitTreeState:"clean", BuildDate:"2023-11-02T02:40:46Z", GoVersion:"go1.19.6", Compiler:"gc", Platform:"linux/amd64"} Kustomize Version: v4.5.7 Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.7+k3s-4159ac46", GitCommit:"4159ac4638b8617d909060dbf8ea923e622a92b4", GitTreeState:"clean", BuildDate:"2023-11-02T02:40:46Z", GoVersion:"go1.19.6", Compiler:"gc", Platform:"linux/amd64"}

How can I resolve this issue? Thanks.

supertetelman commented 2 weeks ago

Can you provide additional debug information about your K8s, hardware, OS, etc?

tuninger commented 2 weeks ago

Can you provide additional debug information about your K8s, hardware, OS, etc?

@supertetelman ,Thanks for your response.

OS:CentOS Linux release 7.9.2009 (Core) hardware: 64C, 784G memory,500G disk kubectl version WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.7+k3s-4159ac46", GitCommit:"4159ac4638b8617d909060dbf8ea923e622a92b4", GitTreeState:"clean", BuildDate:"2023-11-02T02:46:44Z", GoVersion:"go1.19.6", Compiler:"gc", Platform:"linux/amd64"} Kustomize Version: v4.5.7 Server Version: version.Info{Major:"1", Minor:"25", GitVersion:"v1.25.7+k3s-4159ac46", GitCommit:"4159ac4638b8617d909060dbf8ea923e622a92b4", GitTreeState:"clean", BuildDate:"2023-11-02T02:46:44Z", GoVersion:"go1.19.6", Compiler:"gc", Platform:"linux/amd64"}

What operating systems and K8S versions do we currently support? Also, what are the hardware configurations that can be supported? Thanks.

supertetelman commented 1 week ago

Do you have any GPUs in your cluster or a valid running copy of the GPU Operator?

The pre-requisites for the different NIMs are highlighted in the official NVIDIA docs here: https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html.

I don't know that we have tested this helm chart on CentOS, but the GPU Operator fully supports CentOS and I don't see any reason that the NIM LLM Helm chart should not work there.

tuninger commented 1 week ago

Yes, I refer to the documentation( https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html) for NIM LLMS deployment, but a new issue has arisen, using $oauthtoken and API KEY Executing can successfully authenticate, but can't pull the image llama3-8b-instruct:1.0.0, is it not open for pulling now?

Meanwhile,I refer to (https://org.ngc.nvidia.com/setup/personal-keys) for the steps to generate the API KEY.

root@root:~# echo "$NGC_API_KEY" | docker login nvcr.io --username '$oauthtoken' --password-stdin WARNING! Your password will be stored unencrypted in /root/.docker/config.json. Configure a credential helper to remove this warning. See https://docs.docker.com/engine/reference/commandline/login/#credentials-store

Login Succeeded

root@root:~# docker run -it --rm --name=$CONTAINER_NAME --runtime=nvidia --gpus all --shm-size=16GB -e NGC_API_KEY -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" -u $(id -u) -p 8000:8000 $IMG_NAME Unable to find image 'nvcr.io/nim/meta/llama3-8b-instruct:1.0.0' locally docker: Error response from daemon: Head "https://nvcr.io/v2/nim/meta/llama3-8b-instruct/manifests/1.0.0": unauthorized:

401 Authorization Required

401 Authorization Required


. See 'docker run --help'.