Closed rmccorm4 closed 4 years ago
As far as I can tell, the only pieces in the k8s-cluster.yml
playbook that have to do with NVIDIA components are:
# Install driver and container runtime on GPU servers
- include: nvidia-driver.yml
tags:
- nvidia
- include: nvidia-docker.yml
tags:
- nvidia
# Install k8s GPU device plugin
- include: k8s-gpu-plugin.yml
We could just wrap those three includes in a block to skip with the skip-nvidia
flag.
Other than that, there are some monitorin components that deploy the DCGM exporter. I believe that functionality is in/is coming into the GPU operator, so we will probably want to do the same with those roles in the ./scripts/k8s_deploy_monitoring.sh
script as well.
@ajdecon Thanks for the PR!
Just curious, does your gpu-operator playbook work to setup a single node? That's our current use case, with plans to scale out later once single node gpu-operator works well
@rmccorm4 : Can you clarify what you mean by a single node use case?
I've confirmed that this playbook does work to set up a single-node Kubernetes cluster on a test host with the GPU operator enabled.
I need to add better docs to the repo, but here's the short version of how to build a single-node cluster this way:
config/group_vars/k8s-cluster.yml
, set deepops_gpu_operator_enabled
to trueconfig/inventory
, add your node to the sections all
, kube-master
, etcd
, and kube-node
. Note that you're putting the same name in all these locations.ansible-playbook playbooks/k8s-cluster.yml
. Because we've set deepops_gpu_operator_enabled
to true, this playbook will include the nvidia-gpu-operator.yml
as part of the run.Let me know if you have any questions, or if you're thinking of a different use case.
@rmccorm4 - We've tested this several times and have not been able to reproduce a failure. Can you give additional details about your case (including output from your ansible run)?
Maybe I'm just going about this the wrong way by using my machine as the provisioner rather than another lightweight AWS "admin/login" instance, but I would think this path would be supported.
So here's what I've done so far:
deepops_gpu_operator_enabled: true
[x] In config/inventory, add your node to the sections all, kube-master, etcd, and kube-node. Note that you're putting the same name in all these locations.
[all]
gpu01 ansible_host=<AWS_IP>
[kube-master]
gpu01
[etcd]
gpu01
[kube-node]
gpu01
[k8s-cluster:children]
kube-master
kube-node
[all:vars]
ansible_user=ubuntu
ansible_ssh_private_key_file=~/.ssh/<AWS_KEY.pem>
ansible-playbook -l k8s-cluster playbooks/k8s-cluster.yml --skip-tags "ssh-public"
per #419 kubectl get pods --all-namespaces
1. Can't verify gpu-operator pods are running via kubectl
:x: kubectl get pods --all-namespaces
ubuntu@gpu01:~$ kubectl get pods --all-namespaces
The connection to the server localhost:8080 was refused - did you specify the right host or port?
2. Probably the cause of (1), empty kubectl
config
:x: kubectl config view
empty
ubuntu@gpu01:~$ kubectl config view
apiVersion: v1
clusters: []
contexts: []
current-context: ""
kind: Config
preferences: {}
users: []
3. However, GPU Operator seems to be working:
:heavy_check_mark: No driver on the host (as expected):
ubuntu@gpu01:~$ nvidia-smi
Command 'nvidia-smi' not found, but can be installed with:
:heavy_check_mark: But can successfully execute containers with NVIDIA container runtime:
ubuntu@gpu01:~$ sudo docker run -it nvcr.io/nvidia/tensorrt:19.12-py3
=====================
== NVIDIA TensorRT ==
=====================
NVIDIA Release 19.12 (build 9143065)
NVIDIA TensorRT 6.0.1 (c) 2016-2019, NVIDIA CORPORATION. All rights reserved.
Container image (c) 2019, NVIDIA CORPORATION. All rights reserved.
https://developer.nvidia.com/tensorrt
To install Python sample dependencies, run /opt/tensorrt/python/python_setup.sh
To install open source parsers, plugins, and samples, run /opt/tensorrt/install_opensource.sh. See https://github.com/NVIDIA/TensorRT/tree/19.12 for more information.
NOTE: Legacy NVIDIA Driver detected. Compatibility mode ENABLED.
root@4ee3a26bbbf0:/workspace# nvidia-smi
Mon Jan 27 22:15:12 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.40.04 Driver Version: 418.40.04 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... On | 00000000:00:1E.0 Off | 0 |
| N/A 30C P0 23W / 300W | 0MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
root@4ee3a26bbbf0:/workspace# exit
:x: kubectl get pods --all-namespaces
ryan@nvbox:~/github/deepops$ kubectl get pods --all-namespaces
Unable to connect to the server: dial tcp 172.31.10.36:6443: i/o timeout
:warning: kubectl config view
not sure if server IP should be different here, this is the Private IP.
ryan@nvbox:~/github/deepops$ kubectl config view
apiVersion: v1
clusters:
- cluster:
certificate-authority-data: DATA+OMITTED
server: https://172.31.10.36:6443
name: kubernetes
contexts:
- context:
cluster: kubernetes
user: kubernetes-admin
name: kubernetes-admin@kubernetes
current-context: kubernetes-admin@kubernetes
kind: Config
preferences: {}
users:
- name: kubernetes-admin
user:
client-certificate-data: REDACTED
client-key-data: REDACTED
It seems like Kubernetes isn't configured correctly on the AWS instance, and it seems like it's using the Private IP when run on my provisioning machine.
@rmccorm4 did you open up access to your AMI on 6443/TCP? seems like kubectl can't connect to the API server (tcp 172.31.10.36:6443). If you don't want to expose it, you could tunnel over SSH and modify your admin.conf to point to localhost.
@dholt good catch, I didn't have 6443/TCP exposed, I just added that to the security group. However, I still think there's an issue with:
# KUBECONFIG
...
server: https://172.31.10.36:6443
My provisioner, which is outside of AWS, can't access this private IP.
If I replace the private IP with the public IP, I now get this:
$ kubectl get nodes
Unable to connect to the server: x509: certificate is valid for 10.233.0.1, 172.31.10.36, 172.31.10.36, 10.233.0.1, 127.0.0.1, 172.31.10.36, not <PUBLIC_IP>
What confuses me is that in my config/inventory
I've specified the public IP of the AWS instance, but yet the private IP gets set when setting up Kubernetes. Is there some flag or toggle I can set in DeepOps to fix this?
This is most likely going to be completely in the Kubespray config, try starting here: https://github.com/kubernetes-sigs/kubespray/blob/master/docs/aws.md
Closing this as it's most likely a Kubespray config issue, feel free to reopen if still an issue.
In order to support the GPU Operator as an easy way of spinning up GPU nodes with variable driver versions in a K8s cluster on the fly, I'd like to propose the ability to de-couple NVIDIA GPU components from the K8s cluster setup per the GPU Operator's requirements: https://github.com/NVIDIA/gpu-operator
I think the
playbooks/k8s-cluster.yml
's default functionality can remain as is, but I would like to see how we can ignore NVIDIA components via flags, tags, or config variables for flexibility.Personally I would think adding more
nvidia
tags into certain parts of playbook and then using--skip-tags "nvidia"
when running the playbook would be the simplest solution, but I don't know enough to know if that would work out of the box or not.I'm happy to discuss, as well as contribute once we have a good approach 🙂