NVIDIA / cloud-native-docs

Documentation repository for NVIDIA Cloud Native Technologies
https://docs.nvidia.com/datacenter/cloud-native/
Apache License 2.0
16 stars 18 forks source link

Describe pre-installed experience #103

Closed mikemckiernan closed 2 weeks ago

mikemckiernan commented 2 weeks ago

@francisguillier , PTAL when you can. This PR attempts to address a Slack exchange.

Review HTML: https://nvidia.github.io/cloud-native-docs/review/pr-103/gpu-operator/latest/getting-started.html#pre-installed-nvidia-gpu-drivers

github-actions[bot] commented 2 weeks ago

Documentation preview

https://nvidia.github.io/cloud-native-docs/review/pr-103

francisguillier commented 2 weeks ago

I think we should split the section "Pre-Installed NVIDIA GPU Drivers" in 2:

sub-section 1: disabling entirely the driver container:

" In this scenario, the NVIDIA GPU driver is already installed on the worker nodes that have GPUs:

helm install --wait --generate-name \ -n gpu-operator --create-namespace \ nvidia/gpu-operator \ --set driver.enabled=false

The preceding command prevents the Operator from installing the GPU driver on any nodes in the cluster "

sub-section 2: driver container enabled but some or all nodes have pre-installed driver:

"If any nodes in the cluster have the GPU driver pre-installed, the GPU driver pod detects the kernel and exits. The Operator proceeds to start other pods, such as the container toolkit pod.

If all the nodes in the cluster have the GPU driver pre-installed, the Operator detects that all GPU driver pods exited and stops the GPU driver daemon set, regardless of the driver.enabled value. "

Note: I cannot confirm the validity of the last statement. Checking internally

francisguillier commented 2 weeks ago

@mikemckiernan

I checked with Chris and the statement

"If all the nodes in the cluster have the GPU driver pre-installed, the Operator detects that all GPU driver pods exited and stops the GPU driver daemon set, regardless of the driver.enabled value."

may be confusing.

We don't stop or delete any GPU driver daemon set per se. the GPU driver daemon set will always exist, even if the above condition (all nodes in the cluster have GPU driver pre-installed).

We need to rework the statement appropriately.

Chris wrote this and it may help:

If we detect drivers are pre-installed, we disable the driver pod on the node by labelling the node with nvidia.com/gpu.deploy.driver=preinstalled

mikemckiernan commented 2 weeks ago

Thank you, Francis. I misunderstood in the exchange that the original config was to use precompiled--and that's what prevented the daemon set from starting--not that pre-installed drivers prevented the daemon set.

I'm cautious about typing too much about this scenario. Up to now, it hasn't proven particularly thorny.

francisguillier commented 2 weeks ago

@mikemckiernan

I chatted with Chris and he gave this very good information I think we should put in the doc:

"The init container in the driver pod detects if drivers are preinstalled. If preinstalled, it labels the node so that the driver pod is terminated and does not get re-scheduled on to the node."