Convert K8s-prod nodes k8s-node-7 and k8s-node-8 from VMs to bare-metal

nickatnceas commented 3 months ago

K8s-prod nodes k8s-node-7 and k8s-node-8 are currently on physical hosts host-ucsb-24 and host-ucsb-25. Deleting the node VMs and redeploying the node directly on the host will allow us to use memory that was previously reserved for the host, and provide a small performance boost (~5% ?) due to it no longer being virtualized.

Since these nodes do not benefit from Live Migration, ie they can be drained at any time without major interruptions in services, and because the physical hosts will not be sharing resources with any other VMs, the there are no benefits of having VMs in this case.

Dev nodes will move from hosts 24 and 25 to hosts 9 and 10, and move from 16 to 32 vCPUs

Current:

host-ucsb-24
- k8s-node-7 VM
  - 256 vCPUs
  - 352 GB memory
- k8s-dev-node-4 VM
  - 16 vCPUs
  - 128 GB memory
host-ucsb-25
- k8s-node-8 VM
  - 256 vCPUs
  - 352 GB memory
- k8s-dev-node-5 VM
  - 16 vCPUs
  - 128 GB memory

Planned:

host-ucsb-24
- 256 vCPUs
- 512 GB memory
host-ucsb-25
- 256 vCPUs
- 512 GB memory
host-ucsb-9
- k8s-dev-node-4 VM
  - 32 vCPUs
  - 128 GB memory
host-ucsb-10
- k8s-dev-node-5 VM
  - 32 vCPUs
  - 128 GB memory

nickatnceas commented 1 month ago

I attempted deploying the k8s software onto host-ucsb-24 to run a bare-metal node, but hit some issues:

K8s 1.23.4 packages are no longer being distributed by the K8s project
The oldest packages which are being distributed, 1.24.17, do not work with our cluster. The software successfully connects to the controller nodes, but never successfully starts up any containers

Instead of troubleshooting this old version, I'm going to move back to using the VMs for now. Once we have successfully upgraded K8s #35 we can try again.

nickatnceas commented 1 month ago

Quick view of the issue:

outin@bluey:~/.kube$ kubectl get pods -A -o wide | grep host-ucsb-24
ceph-csi-cephfs   ceph-csi-cephfs-csi-cephfsplugin-jc89x         0/3     CrashLoopBackOff   18 (4m39s ago)    16m      128.111.85.154    host-ucsb-24    <none>           <none>
ceph-csi-rbd      ceph-csi-rbd-csi-cephrbdplugin-q9wn8           3/3     Running            18 (3m32s ago)    16m      128.111.85.154    host-ucsb-24    <none>           <none>
kube-system       calico-node-hdpx4                              0/1     CrashLoopBackOff   6 (112s ago)      17m      128.111.85.154    host-ucsb-24    <none>           <none>
kube-system       kube-proxy-mqdd8                               0/1     CrashLoopBackOff   5 (113s ago)      17m      128.111.85.154    host-ucsb-24    <none>           <none>
velero            node-agent-dwwp2                               0/1     CrashLoopBackOff   6 (2m26s ago)     16m      192.168.99.136    host-ucsb-24    <none>           <none>

Here is the k8s-node-7 VM after about the same amount of startup time:

outin@bluey:~/.kube$ kubectl get pods -A -o wide | grep k8s-node-7
ceph-csi-cephfs   ceph-csi-cephfs-csi-cephfsplugin-c78rc         3/3     Running      3                 16m      128.111.85.146    k8s-node-7      <none>           <none>
ceph-csi-rbd      ceph-csi-rbd-csi-cephrbdplugin-jr8c2           3/3     Running      3                 16m      128.111.85.146    k8s-node-7      <none>           <none>
kube-system       calico-node-pbchl                              1/1     Running      1                 16m      128.111.85.146    k8s-node-7      <none>           <none>
kube-system       kube-proxy-6kbc5                               1/1     Running      1                 16m      128.111.85.146    k8s-node-7      <none>           <none>
velero            node-agent-wqwn6                               1/1     Running      3 (11m ago)       16m      192.168.197.192   k8s-node-7      <none>           <none>

mbjones commented 1 month ago

For the pods in CrashLoopBackOff, you should get some helpful troubleshooting info by describing the pod status (e.g., kubectl describe -n kube-system pod kube-proxy-mqdd8).

nickatnceas commented 1 month ago

For the pods in CrashLoopBackOff, you should get some helpful troubleshooting info by describing the pod status (e.g., kubectl describe -n kube-system pod kube-proxy-mqdd8).

I don't feel like this is worth troubleshooting for a few reasons, but mainly because these two versions are so old (1.23 and 1.24), and the time would be better spent upgrading to the latest version and troubleshooting those issues (#35), then fixing any issue that arrive from this migration.

mbjones commented 1 month ago

Yep, totally agree on the version/upgrade stuff. Sorry for the diversion.

DataONEorg / k8s-cluster

Convert K8s-prod nodes k8s-node-7 and k8s-node-8 from VMs to bare-metal #47