NVIDIA / deepops

Tools for building GPU clusters
BSD 3-Clause "New" or "Revised" License
1.26k stars 330 forks source link

GPU Operator Support #397

Closed rmccorm4 closed 4 years ago

rmccorm4 commented 4 years ago

In order to support the GPU Operator as an easy way of spinning up GPU nodes with variable driver versions in a K8s cluster on the fly, I'd like to propose the ability to de-couple NVIDIA GPU components from the K8s cluster setup per the GPU Operator's requirements: https://github.com/NVIDIA/gpu-operator

I think the playbooks/k8s-cluster.yml's default functionality can remain as is, but I would like to see how we can ignore NVIDIA components via flags, tags, or config variables for flexibility.

Personally I would think adding more nvidia tags into certain parts of playbook and then using --skip-tags "nvidia" when running the playbook would be the simplest solution, but I don't know enough to know if that would work out of the box or not.

I'm happy to discuss, as well as contribute once we have a good approach 🙂

supertetelman commented 4 years ago

As far as I can tell, the only pieces in the k8s-cluster.yml playbook that have to do with NVIDIA components are:

# Install driver and container runtime on GPU servers
- include: nvidia-driver.yml
  tags:
    - nvidia
- include: nvidia-docker.yml
  tags:
    - nvidia

# Install k8s GPU device plugin
- include: k8s-gpu-plugin.yml

We could just wrap those three includes in a block to skip with the skip-nvidia flag.

Other than that, there are some monitorin components that deploy the DCGM exporter. I believe that functionality is in/is coming into the GPU operator, so we will probably want to do the same with those roles in the ./scripts/k8s_deploy_monitoring.sh script as well.

rmccorm4 commented 4 years ago

@ajdecon Thanks for the PR!

Just curious, does your gpu-operator playbook work to setup a single node? That's our current use case, with plans to scale out later once single node gpu-operator works well

ajdecon commented 4 years ago

@rmccorm4 : Can you clarify what you mean by a single node use case?

I've confirmed that this playbook does work to set up a single-node Kubernetes cluster on a test host with the GPU operator enabled.

I need to add better docs to the repo, but here's the short version of how to build a single-node cluster this way:

  1. In config/group_vars/k8s-cluster.yml, set deepops_gpu_operator_enabled to true
  2. In config/inventory, add your node to the sections all, kube-master, etcd, and kube-node. Note that you're putting the same name in all these locations.
  3. Run ansible-playbook playbooks/k8s-cluster.yml. Because we've set deepops_gpu_operator_enabled to true, this playbook will include the nvidia-gpu-operator.yml as part of the run.
  4. At the end, you should have a single-node cluster with the GPU operator.

Let me know if you have any questions, or if you're thinking of a different use case.

michael-balint commented 4 years ago

@rmccorm4 - We've tested this several times and have not been able to reproduce a failure. Can you give additional details about your case (including output from your ansible run)?

rmccorm4 commented 4 years ago

Maybe I'm just going about this the wrong way by using my machine as the provisioner rather than another lightweight AWS "admin/login" instance, but I would think this path would be supported.

So here's what I've done so far:


AWS Instance

1. Can't verify gpu-operator pods are running via kubectl

:x: kubectl get pods --all-namespaces

ubuntu@gpu01:~$ kubectl get pods --all-namespaces
The connection to the server localhost:8080 was refused - did you specify the right host or port?

2. Probably the cause of (1), empty kubectl config

:x: kubectl config view empty

ubuntu@gpu01:~$ kubectl config view
apiVersion: v1
clusters: []
contexts: []
current-context: ""
kind: Config
preferences: {}
users: []

3. However, GPU Operator seems to be working:

:heavy_check_mark: No driver on the host (as expected):

ubuntu@gpu01:~$ nvidia-smi

Command 'nvidia-smi' not found, but can be installed with:

:heavy_check_mark: But can successfully execute containers with NVIDIA container runtime:

ubuntu@gpu01:~$ sudo docker run -it nvcr.io/nvidia/tensorrt:19.12-py3
=====================
== NVIDIA TensorRT ==
=====================

NVIDIA Release 19.12 (build 9143065)

NVIDIA TensorRT 6.0.1 (c) 2016-2019, NVIDIA CORPORATION.  All rights reserved.
Container image (c) 2019, NVIDIA CORPORATION.  All rights reserved.

https://developer.nvidia.com/tensorrt

To install Python sample dependencies, run /opt/tensorrt/python/python_setup.sh

To install open source parsers, plugins, and samples, run /opt/tensorrt/install_opensource.sh. See https://github.com/NVIDIA/TensorRT/tree/19.12 for more information.

NOTE: Legacy NVIDIA Driver detected.  Compatibility mode ENABLED.

root@4ee3a26bbbf0:/workspace# nvidia-smi
Mon Jan 27 22:15:12 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.40.04    Driver Version: 418.40.04    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:1E.0 Off |                    0 |
| N/A   30C    P0    23W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
root@4ee3a26bbbf0:/workspace# exit

On provisioning machine

:x: kubectl get pods --all-namespaces

ryan@nvbox:~/github/deepops$ kubectl get pods --all-namespaces
Unable to connect to the server: dial tcp 172.31.10.36:6443: i/o timeout

:warning: kubectl config view not sure if server IP should be different here, this is the Private IP.

ryan@nvbox:~/github/deepops$ kubectl config view
apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: DATA+OMITTED
    server: https://172.31.10.36:6443
  name: kubernetes
contexts:
- context:
    cluster: kubernetes
    user: kubernetes-admin
  name: kubernetes-admin@kubernetes
current-context: kubernetes-admin@kubernetes
kind: Config
preferences: {}
users:
- name: kubernetes-admin
  user:
    client-certificate-data: REDACTED
    client-key-data: REDACTED

It seems like Kubernetes isn't configured correctly on the AWS instance, and it seems like it's using the Private IP when run on my provisioning machine.

dholt commented 4 years ago

@rmccorm4 did you open up access to your AMI on 6443/TCP? seems like kubectl can't connect to the API server (tcp 172.31.10.36:6443). If you don't want to expose it, you could tunnel over SSH and modify your admin.conf to point to localhost.

rmccorm4 commented 4 years ago

@dholt good catch, I didn't have 6443/TCP exposed, I just added that to the security group. However, I still think there's an issue with:

# KUBECONFIG
    ...
    server: https://172.31.10.36:6443

My provisioner, which is outside of AWS, can't access this private IP.

If I replace the private IP with the public IP, I now get this:

$ kubectl get nodes
Unable to connect to the server: x509: certificate is valid for 10.233.0.1, 172.31.10.36, 172.31.10.36, 10.233.0.1, 127.0.0.1, 172.31.10.36, not <PUBLIC_IP>

What confuses me is that in my config/inventory I've specified the public IP of the AWS instance, but yet the private IP gets set when setting up Kubernetes. Is there some flag or toggle I can set in DeepOps to fix this?

dholt commented 4 years ago

This is most likely going to be completely in the Kubespray config, try starting here: https://github.com/kubernetes-sigs/kubespray/blob/master/docs/aws.md

ajdecon commented 4 years ago

Closing this as it's most likely a Kubespray config issue, feel free to reopen if still an issue.