aws / eks-anywhere

Run Amazon EKS on your own infrastructure 🚀
https://anywhere.eks.amazonaws.com
Apache License 2.0
1.94k stars 274 forks source link

Bare Metal: Nodes do not consistently netboot #6742

Open chrisdoherty4 opened 9 months ago

chrisdoherty4 commented 9 months ago

Summary

When provisioning a bare metal cluster nodes are meant to netboot. Nodes are turned off by EKS-A and configured to netboot by CAPT. Occasionally nodes boot to the disk instead of netbooting requiring intervention from operators.

This was observed on Dell R240 machines only.

chrisdoherty4 commented 9 months ago

I've noticed this on numerous occasions. If I were to speculate, the nodes may be are instructed to netboot and then turn on in such quick succession that the netboot doesn't actually get configured.

chrisdoherty4 commented 8 months ago

CAPT deprovisions nodes by turning them off. That means its possible they could boot and rejoin the cluster.

If the netboot isn't correctly selected and a recycled node is booted it can rejoin an existing cluster and create provisioning issues.

The problem would be fixed by adjusting CAPI/CAPT to ensure recycled nodes cannot rejoin clusters.