Open ph-armada opened 1 year ago
Hi @ph-armada - have you looked at the Bare Metal Troubleshooting guide? https://anywhere.eks.amazonaws.com/docs/troubleshooting/troubleshooting/#bare-metal-troubleshooting
Hi @csplinter - thanks for the information. Yes, the diagnostics is stuck at step 5 from the link above.
If the machine has already started provisioning the OS and it’s in irrecoverable state, get the workflow of the provisioning/provisioned machine using:
kubectl get workflows -n eksa-system
kubectl describe workflow/<workflow-name> -n eksa-system
The logs will show which step succeeded or failed, but I wonder where to obtain the detailed logs?
For example, the following one was marked as success, which is great:
Environment:
COMPRESSED: true
DEST_DISK: /dev/sda
IMG_URL: https://xxx/ubuntu-2004-kube-v1.25.10.gz
Image: quay.io/tinkerbell-actions/image2disk:v1.0.0
Name: stream-image-3
Seconds: 225
Started At: 2023-07-19T22:31:25Z
Status: STATE_SUCCESS
Timeout: 1200
But for the failed one, we would like to see if we can access the detailed logs:
Environment:
DEST_DISK: /dev/sda
DEST_PATH: /etc/netplan/config.yaml
DIRMODE: 0755
FS_TYPE: ext4
GID: 0
MODE: 0644
STATIC_NETPLAN: true
UID: 0
Image: public.ecr.aws/eks-anywhere/tinkerbell/hub/writefile:6c0f0d437bde2c836d90b000312c8b25fa1b65e1-eks-a-15
Name: write-netplan
Status: STATE_FAILED
Timeout: 90
What do the capt controller manager logs show? (yours will have a different name)
kubectl logs -n capt-system capt-controller-manager-9f8b95b-frbq
Hi @csplinter - it shows the following workflow failure message:
E0720 17:08:45.499588 1 controller.go:326] "msg"="Reconciler error" "error"="workflow failed" "TinkerbellMachine"={"name":"mgmt01-control-plane-template-1689806247103-xpfvz","namespace":"eksa-system"} "controller"="tinkerbellmachine" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="TinkerbellMachine" "name"="mgmt01-control-plane-template-1689806247103-xpfvz" "namespace"="eksa-system" "reconcileID"="e19fb3fc-5543-4e3d-8e51-20ec8d4005b5"
Hi @csplinter - I found the logs from the LinuxKit OS. There is a Tinkerbell worker docker container there that contains the logs.
It says the following fatal error:
got an error while the discovery request: no matching response packet received
May I know if you can shed some lights on what might have caused this error? Thanks.
May I also know if the writefile
action image is a customized one than the Tinkerbell's original one? If so, may I know where is the code that we can refer to?
The reason I asked is because this is the Tinkerbell's: https://github.com/tinkerbell/hub/blob/main/actions/writefile/v1/main.go
blockDevice := os.Getenv("DEST_DISK")
filesystemType := os.Getenv("FS_TYPE")
filePath := os.Getenv("DEST_PATH")
contents := os.Getenv("CONTENTS")
uid := os.Getenv("UID")
gid := os.Getenv("GID")
mode := os.Getenv("MODE")
dirMode := os.Getenv("DIRMODE")
But it doesn't process any input of STATIC_NETPLAN
that is recommended in the EKS Anywhere doc:
https://anywhere.eks.amazonaws.com/docs/getting-started/baremetal/bare-spec/#ubuntu-tinkerbelltemplateconfig-example
Quote:
- environment:
DEST_DISK: /dev/sda2
DEST_PATH: /etc/netplan/config.yaml
STATIC_NETPLAN: true
DIRMODE: "0755"
FS_TYPE: ext4
GID: "0"
MODE: "0644"
UID: "0"
image: public.ecr.aws/eks-anywhere/tinkerbell/hub/writefile:6c0f0d437bde2c836d90b000312c8b25fa1b65e1-eks-a-15
name: write-netplan
timeout: 90
Hi @csplinter - it seems the netplan write file logic is indeed customized for EKSA, and it's dealing with DHCP. So now the error start to correlate. May I know what might have caused this error and how can we mitigate it?
got an error while the discovery request: no matching response packet received
What happened: Tinkerbell workflow failed in the EKSA provisioning process.
What you expected to happen: The workflow succeeded.
How to reproduce it (as minimally and precisely as possible): Create an EKSA cluster with an Ubuntu image, instead of the default Bottlerocket one.
Anything else we need to know?: Logs:
Environment:
May I know how to locate the logs and debug the issue?