NVIDIA / gpu-operator

NVIDIA GPU Operator creates/configures/manages GPUs atop Kubernetes
Apache License 2.0
1.67k stars 274 forks source link

RFE - Support for GPU Operator on ARM (Specifically Nvidia Jetson AGX Xavier) #230

Open schmaustech opened 2 years ago

schmaustech commented 2 years ago

I currently have been able to deploy a development release of Red Hat OpenShift 4.9 running on RHCOS in a single node scenario on my Nvidia Jetson AGX Xavier:

$ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME master-0.kni7.schmaustech.com Ready master,worker 43h v1.21.0-rc.0+ec0996b Red Hat Enterprise Linux CoreOS 49.84.202106272247-0 (Ootpa) 4.18.0-305.3.1.el8_4.aarch64 cri-o://1.21.0-88.rhaos4.8.gitfd485de.el8 $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE authentication 4.9.0-0.nightly-arm64-2021-06-29-064214 True False False 123m
baremetal 4.9.0-0.nightly-arm64-2021-06-29-064214 True False False 43h
cloud-credential 4.9.0-0.nightly-arm64-2021-06-29-064214 True False False 43h
cluster-autoscaler 4.9.0-0.nightly-arm64-2021-06-29-064214 True False False 43h
config-operator 4.9.0-0.nightly-arm64-2021-06-29-064214 True False False 43h
console 4.9.0-0.nightly-arm64-2021-06-29-064214 True False False 122m
csi-snapshot-controller 4.9.0-0.nightly-arm64-2021-06-29-064214 True False False 27h
dns 4.9.0-0.nightly-arm64-2021-06-29-064214 True False False 125m
etcd 4.9.0-0.nightly-arm64-2021-06-29-064214 True False False 43h
image-registry 4.9.0-0.nightly-arm64-2021-06-29-064214 True False False 43h
ingress 4.9.0-0.nightly-arm64-2021-06-29-064214 True False False 43h
insights 4.9.0-0.nightly-arm64-2021-06-29-064214 True False False 43h
kube-apiserver 4.9.0-0.nightly-arm64-2021-06-29-064214 True False False 43h
kube-controller-manager 4.9.0-0.nightly-arm64-2021-06-29-064214 True False False 43h
kube-scheduler 4.9.0-0.nightly-arm64-2021-06-29-064214 True False False 43h
kube-storage-version-migrator 4.9.0-0.nightly-arm64-2021-06-29-064214 True False False 43h
machine-api 4.9.0-0.nightly-arm64-2021-06-29-064214 True False False 43h
machine-approver 4.9.0-0.nightly-arm64-2021-06-29-064214 True False False 43h
machine-config 4.9.0-0.nightly-arm64-2021-06-29-064214 True False False 43h
marketplace 4.9.0-0.nightly-arm64-2021-06-29-064214 True False False 43h
monitoring 4.9.0-0.nightly-arm64-2021-06-29-064214 True False False 122m
network 4.9.0-0.nightly-arm64-2021-06-29-064214 True False False 43h
node-tuning 4.9.0-0.nightly-arm64-2021-06-29-064214 True False False 43h
openshift-apiserver 4.9.0-0.nightly-arm64-2021-06-29-064214 True False False 123m
openshift-controller-manager 4.9.0-0.nightly-arm64-2021-06-29-064214 True False False 20h
openshift-samples 4.9.0-0.nightly-arm64-2021-06-29-064214 True False False 43h
operator-lifecycle-manager 4.9.0-0.nightly-arm64-2021-06-29-064214 True False False 43h
operator-lifecycle-manager-catalog 4.9.0-0.nightly-arm64-2021-06-29-064214 True False False 43h
operator-lifecycle-manager-packageserver 4.9.0-0.nightly-arm64-2021-06-29-064214 True False False 123m
service-ca 4.9.0-0.nightly-arm64-2021-06-29-064214 True False False 43h
storage 4.9.0-0.nightly-arm64-2021-06-29-064214 True False False 43h

I would like to be able to use the GPU-Operator to be able to access the GPU in the AGX Xavier but believe its not possible as of right now as I tried to deploy it and got the following:

$ oc get all -n gpu-operator-resources No resources found in gpu-operator-resources namespace. $ oc get all | egrep 'node|gpu' pod/gpu-operator-64df558567-r6zr8 0/1 CrashLoopBackOff 6 8m54s deployment.apps/gpu-operator 0/1 1 0 8m54s replicaset.apps/gpu-operator-64df558567 1 1 0 8m54s $ oc logs gpu-operator-64df558567-r6zr8 standard_init_linux.go:219: exec user process caused: exec format error

Is this something planned in the future?

shivamerla commented 2 years ago

@schmaustech I will get back to you on this.

schmaustech commented 2 years ago

@shivamerla Any movement or update on this?

shivamerla commented 2 years ago

@schmaustech Support for GPU Operator on ARM is currently targeted for Q1 2022.

David-VTUK commented 2 years ago

@shivamerla - Any update on this, please?

jasonbarbee commented 2 years ago

@shivamerla How's this going?

shivamerla commented 2 years ago

@jasonbarbee @David-VTUK While GPU operator v1.10.x added support for ARM platform, support for Jetson devices is not yet there. It needs changes in k8s-device-plugin and container-toolkit which is in the roadmap.

gaolingminhhh commented 5 months ago

Any update here 2024?