aws / eks-anywhere

Run Amazon EKS on your own infrastructure 🚀
https://anywhere.eks.amazonaws.com
Apache License 2.0
1.94k stars 277 forks source link

Where to find logs and diagnostics information for Tinkerbell workflow? #6196

Open ph-armada opened 12 months ago

ph-armada commented 12 months ago

What happened: Tinkerbell workflow failed in the EKSA provisioning process.

What you expected to happen: The workflow succeeded.

How to reproduce it (as minimally and precisely as possible): Create an EKSA cluster with an Ubuntu image, instead of the default Bottlerocket one.

Anything else we need to know?: Logs:

Environment:
        DEST_DISK:       /dev/sda2
        DEST_PATH:       /etc/netplan/config.yaml
        DIRMODE:         0755
        FS_TYPE:         ext4
        GID:             0
        MODE:            0644
        STATIC_NETPLAN:  true
        UID:             0
      Image:             public.ecr.aws/eks-anywhere/tinkerbell/hub/writefile:6c0f0d437bde2c836d90b000312c8b25fa1b65e1-eks-a-41
      Name:              write-netplan
      Pid:               host
      Seconds:           3
      Started At:        2023-07-18T22:03:30Z
      Status:            STATE_FAILED
      Timeout:           90

Environment:

May I know how to locate the logs and debug the issue?

csplinter commented 12 months ago

Hi @ph-armada - have you looked at the Bare Metal Troubleshooting guide? https://anywhere.eks.amazonaws.com/docs/troubleshooting/troubleshooting/#bare-metal-troubleshooting

ph-armada commented 12 months ago

Hi @csplinter - thanks for the information. Yes, the diagnostics is stuck at step 5 from the link above.

If the machine has already started provisioning the OS and it’s in irrecoverable state, get the workflow of the provisioning/provisioned machine using:

kubectl get workflows -n eksa-system
kubectl describe workflow/<workflow-name> -n eksa-system 

The logs will show which step succeeded or failed, but I wonder where to obtain the detailed logs?

For example, the following one was marked as success, which is great:

      Environment:
        COMPRESSED:  true
        DEST_DISK:   /dev/sda
        IMG_URL:     https://xxx/ubuntu-2004-kube-v1.25.10.gz
      Image:         quay.io/tinkerbell-actions/image2disk:v1.0.0
      Name:          stream-image-3
      Seconds:       225
      Started At:    2023-07-19T22:31:25Z
      Status:        STATE_SUCCESS
      Timeout:       1200

But for the failed one, we would like to see if we can access the detailed logs:

      Environment:
        DEST_DISK:       /dev/sda
        DEST_PATH:       /etc/netplan/config.yaml
        DIRMODE:         0755
        FS_TYPE:         ext4
        GID:             0
        MODE:            0644
        STATIC_NETPLAN:  true
        UID:             0
      Image:             public.ecr.aws/eks-anywhere/tinkerbell/hub/writefile:6c0f0d437bde2c836d90b000312c8b25fa1b65e1-eks-a-15
      Name:              write-netplan
      Status:            STATE_FAILED
      Timeout:           90
csplinter commented 12 months ago

What do the capt controller manager logs show? (yours will have a different name)

kubectl logs -n capt-system capt-controller-manager-9f8b95b-frbq
ph-armada commented 12 months ago

Hi @csplinter - it shows the following workflow failure message:

E0720 17:08:45.499588       1 controller.go:326]  "msg"="Reconciler error" "error"="workflow failed" "TinkerbellMachine"={"name":"mgmt01-control-plane-template-1689806247103-xpfvz","namespace":"eksa-system"} "controller"="tinkerbellmachine" "controllerGroup"="infrastructure.cluster.x-k8s.io" "controllerKind"="TinkerbellMachine" "name"="mgmt01-control-plane-template-1689806247103-xpfvz" "namespace"="eksa-system" "reconcileID"="e19fb3fc-5543-4e3d-8e51-20ec8d4005b5"
ph-armada commented 12 months ago

Hi @csplinter - I found the logs from the LinuxKit OS. There is a Tinkerbell worker docker container there that contains the logs.

It says the following fatal error:

got an error while the discovery request: no matching response packet received

image

May I know if you can shed some lights on what might have caused this error? Thanks.

ph-armada commented 12 months ago

May I also know if the writefile action image is a customized one than the Tinkerbell's original one? If so, may I know where is the code that we can refer to?

The reason I asked is because this is the Tinkerbell's: https://github.com/tinkerbell/hub/blob/main/actions/writefile/v1/main.go

    blockDevice := os.Getenv("DEST_DISK")
    filesystemType := os.Getenv("FS_TYPE")
    filePath := os.Getenv("DEST_PATH")

    contents := os.Getenv("CONTENTS")
    uid := os.Getenv("UID")
    gid := os.Getenv("GID")
    mode := os.Getenv("MODE")
    dirMode := os.Getenv("DIRMODE")

But it doesn't process any input of STATIC_NETPLAN that is recommended in the EKS Anywhere doc: https://anywhere.eks.amazonaws.com/docs/getting-started/baremetal/bare-spec/#ubuntu-tinkerbelltemplateconfig-example

Quote:

      - environment:
          DEST_DISK: /dev/sda2
          DEST_PATH: /etc/netplan/config.yaml
          STATIC_NETPLAN: true
          DIRMODE: "0755"
          FS_TYPE: ext4
          GID: "0"
          MODE: "0644"
          UID: "0"
        image: public.ecr.aws/eks-anywhere/tinkerbell/hub/writefile:6c0f0d437bde2c836d90b000312c8b25fa1b65e1-eks-a-15
        name: write-netplan
        timeout: 90
ph-armada commented 12 months ago

Here? https://github.com/aws/eks-anywhere-build-tooling/blob/main/projects/tinkerbell/hub/patches/0003-Write-DHCP-offer-to-static-Netplan-file.patch

ph-armada commented 12 months ago

Hi @csplinter - it seems the netplan write file logic is indeed customized for EKSA, and it's dealing with DHCP. So now the error start to correlate. May I know what might have caused this error and how can we mitigate it?

got an error while the discovery request: no matching response packet received