Custom build GPU operator to eliminate cuda-validator container - OCP 4.8 - GPU-Operator 1.7 and 1.8

robynellis-zz commented 2 years ago

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

1. Quick Debug Checklist

[ ] Are you running on an Ubuntu 18.04 node? - no
[ ] Are you running Kubernetes v1.13+? - yes
[ ] Are you running Docker (>= 18.06) or CRIO (>= 1.13+)? - CRIO
[ ] Do you have i2c_core and ipmi_msghandler loaded on the nodes? - NA
[ ] Did you apply the CRD (kubectl describe clusterpolicies --all-namespaces) - NA

1. Issue or feature description

Non-urgent issue.

Is it possible to stop the GPU operator from running cuda-validator container on GPU-Operator 1.7 and 1.8 in Openshift 4.7 and 4.8? I am using older GPUs in a POC environment. Operator v 1.6 works just fine. Nvidia-smi in separate pod with GPU request runs and reports back on GPU successfully. Verified cuda functionality utilizing 3rd party cuda enabled app (FaH), but as expected it is very slow and inefficient. Still, sufficient for POC and integration/automation example. On operator V 1.7 and 1.8, Driver pod compiles and runs fine. Default DCGM image does not run and complains about GPU. I copied DCGM image signature from 1.6 default policy and injected it into cluster policy for 1.7 and 1.8 to get around this issue and had success / documented process. Now, Cuda-validator container does not have binary for my GPUs, and I don't expect it to as they are not in supported list (K2000 and K600). I have tried many env variables in cluster policy to attempt to stop cuda-validator container from running and not had success. Recommend option for operator 1.7 and 1.8? If not possible to stop, and cuda-validator container required for operator function, possible to re-build custom binary into cuda-validator container myself and upload/pull from local repo? Any old binaries / samples publicly avail?

Reason for ask is education and enablement of staff and enablement of cheaper point-of-entry / education for older tech in homelabs, etc... This would never be put into production or used for a customer env. Thanks in advance for any help.

2. Steps to reproduce the issue

Install operator from Operator hub. Create default cluster policy. After fail of mentioned pods, attempt to modify cluster policy and auto-redeploy

3. Information to attach (optional if deemed irrelevant)

[ ] kubernetes pods status: kubectl get pods --all-namespaces
[ ] kubernetes daemonset status: kubectl get ds --all-namespaces
[ ] If a pod/ds is in an error state or pending state kubectl describe pod -n NAMESPACE POD_NAME
[ ] If a pod/ds is in an error state or pending state kubectl logs -n NAMESPACE POD_NAME
[ ] Output of running a container on the GPU machine: docker run -it alpine echo foo
[ ] Docker configuration file: cat /etc/docker/daemon.json
[ ] Docker runtime configuration: docker info | grep runtime
[ ] NVIDIA shared directory: ls -la /run/nvidia
[ ] NVIDIA packages directory: ls -la /usr/local/nvidia/toolkit
[ ] NVIDIA driver directory: ls -la /run/nvidia/driver
[ ] kubelet logs journalctl -u kubelet > kubelet.logs

shivamerla commented 2 years ago

@robynellis since this is for non-production use, you can build a private gpu-operator-validator image and replace vectorAdd sample with your custom script or binary: https://github.com/NVIDIA/gpu-operator/blob/master/validator/Dockerfile#L35

This image can be overridden during install with validator.repository, validator.image, validator.version variables in ClusterPolicy.

robynellis-zz commented 2 years ago

THANK YOU!!!!! Seriously, this is a great help to me! One more add-on question if you wouldn't mind. Can you comment as to support-ability of T400 card? It is on the supported list on the platforms page, but I'd like to verify before I replace my old hardware with this card. Comparison with something ala P2000 would be nice if possible as this is what I was looking at for replacement. Expected workloads are jupyterhub notebooks for POC samples and also VFIO/Etc.. to both coreOS and windows with containerd for directx graphics testing.

shivamerla commented 2 years ago

Yes, T400 is supported with GPU Operator, I don't think we have any documentation comparing features with P2000, but i will ask. May be @dualvtable know?

NVIDIA / gpu-operator