DataONEorg / k8s-cluster

Documentation on the DataONE Kubernetes cluster
Apache License 2.0
2 stars 1 forks source link

Convert K8s-prod nodes k8s-node-7 and k8s-node-8 from VMs to bare-metal #47

Open nickatnceas opened 2 months ago

nickatnceas commented 2 months ago

K8s-prod nodes k8s-node-7 and k8s-node-8 are currently on physical hosts host-ucsb-24 and host-ucsb-25. Deleting the node VMs and redeploying the node directly on the host will allow us to use memory that was previously reserved for the host, and provide a small performance boost (~5% ?) due to it no longer being virtualized.

Since these nodes do not benefit from Live Migration, ie they can be drained at any time without major interruptions in services, and because the physical hosts will not be sharing resources with any other VMs, the there are no benefits of having VMs in this case.

Dev nodes will move from hosts 24 and 25 to hosts 9 and 10, and move from 16 to 32 vCPUs

Current:

Planned:

nickatnceas commented 3 days ago

I attempted deploying the k8s software onto host-ucsb-24 to run a bare-metal node, but hit some issues:

Instead of troubleshooting this old version, I'm going to move back to using the VMs for now. Once we have successfully upgraded K8s #35 we can try again.

nickatnceas commented 3 days ago

Quick view of the issue:

outin@bluey:~/.kube$ kubectl get pods -A -o wide | grep host-ucsb-24
ceph-csi-cephfs   ceph-csi-cephfs-csi-cephfsplugin-jc89x         0/3     CrashLoopBackOff   18 (4m39s ago)    16m      128.111.85.154    host-ucsb-24    <none>           <none>
ceph-csi-rbd      ceph-csi-rbd-csi-cephrbdplugin-q9wn8           3/3     Running            18 (3m32s ago)    16m      128.111.85.154    host-ucsb-24    <none>           <none>
kube-system       calico-node-hdpx4                              0/1     CrashLoopBackOff   6 (112s ago)      17m      128.111.85.154    host-ucsb-24    <none>           <none>
kube-system       kube-proxy-mqdd8                               0/1     CrashLoopBackOff   5 (113s ago)      17m      128.111.85.154    host-ucsb-24    <none>           <none>
velero            node-agent-dwwp2                               0/1     CrashLoopBackOff   6 (2m26s ago)     16m      192.168.99.136    host-ucsb-24    <none>           <none>

Here is the k8s-node-7 VM after about the same amount of startup time:

outin@bluey:~/.kube$ kubectl get pods -A -o wide | grep k8s-node-7
ceph-csi-cephfs   ceph-csi-cephfs-csi-cephfsplugin-c78rc         3/3     Running      3                 16m      128.111.85.146    k8s-node-7      <none>           <none>
ceph-csi-rbd      ceph-csi-rbd-csi-cephrbdplugin-jr8c2           3/3     Running      3                 16m      128.111.85.146    k8s-node-7      <none>           <none>
kube-system       calico-node-pbchl                              1/1     Running      1                 16m      128.111.85.146    k8s-node-7      <none>           <none>
kube-system       kube-proxy-6kbc5                               1/1     Running      1                 16m      128.111.85.146    k8s-node-7      <none>           <none>
velero            node-agent-wqwn6                               1/1     Running      3 (11m ago)       16m      192.168.197.192   k8s-node-7      <none>           <none>
mbjones commented 3 days ago

For the pods in CrashLoopBackOff, you should get some helpful troubleshooting info by describing the pod status (e.g., kubectl describe -n kube-system pod kube-proxy-mqdd8).

nickatnceas commented 2 days ago

For the pods in CrashLoopBackOff, you should get some helpful troubleshooting info by describing the pod status (e.g., kubectl describe -n kube-system pod kube-proxy-mqdd8).

I don't feel like this is worth troubleshooting for a few reasons, but mainly because these two versions are so old (1.23 and 1.24), and the time would be better spent upgrading to the latest version and troubleshooting those issues (#35), then fixing any issue that arrive from this migration.

mbjones commented 2 days ago

Yep, totally agree on the version/upgrade stuff. Sorry for the diversion.