Sonobuoy Conformance on EKS-A on Baremetal shows failures.

elamaran11 commented 1 year ago

I was trying to run a Conformance test using Sonobuoy on EKS-A deployed on Bare metal with partner hardware Equinix. The Sonobuoy validation failed with following errors when I ran this with sonobuoy v0.56.10.

When i ran with Sonobuoy v0.50.0, Sonobuoy validation never made a progress and thats another error i want to report.

Appreciate if you can take a look in to these failures and lets us know if these are good to go or we will be having any bug fixes to EKSA version. Thanks in advance.



[Fail] [sig-scheduling] SchedulerPredicates [Serial] [It] validates that NodeSelector is respected if not matching  [Conformance]
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/scheduling/predicates.go:436

[Fail] [sig-scheduling] SchedulerPredicates [Serial] [It] validates that there exists conflict between pods with same hostPort and protocol but one using 0.0.0.0 hostIP [Conformance]
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/scheduling/predicates.go:1068

[Fail] [sig-apps] Daemon set [Serial] [It] should rollback without unnecessary restarts [Conformance]
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/apps/daemon_set.go:432

[Fail] [sig-scheduling] SchedulerPredicates [Serial] [It] validates resource limits of pods that are allowed to run  [Conformance]
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/scheduling/predicates.go:323```

jacobweinstock commented 1 year ago

Hey @elamaran11, thanks for reporting this. I'm not seeing any failures testing on bare metal hardware with v0.56.10. My initial impression is there is possibly something specific to Equinix Metal, and/or deployment options, etc.

Would you mind sharing the details of your cluster creation, please (cluster spec, hardware csv, etc)? Are you following the Equinix guide here?

I will test on Equinix Metal to see about reproducing. Thanks again for the report!

elamaran11 commented 1 year ago

@jacobweinstock Yes im following the exact equinix guide. Please provide us an update as soon as you can reproduce and fix the issue.

Here is my hardware.csv file :

root@eksa-admin:~# cat hardware.csv
hostname,vendor,mac,ip_address,gateway,netmask,nameservers,disk,labels
eksa-node-cp-001,Equinix,10:70:fd:86:eb:f6,147.75.90.243,147.75.90.241,255.255.255.240,8.8.8.8,/dev/sda,type=cp
eksa-node-dp-001,Equinix,10:70:fd:7f:94:9e,147.75.90.244,147.75.90.241,255.255.255.240,8.8.8.8,/dev/sda,type=dp

apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: Cluster
metadata:
  name: my-eksa-cluster
  namespace: default
spec:
  bundlesRef:
    apiVersion: anywhere.eks.amazonaws.com/v1alpha1
    name: bundles-15
    namespace: eksa-system
  clusterNetwork:
    cniConfig:
      cilium: {}
    pods:
      cidrBlocks:
      - 192.168.0.0/16
    services:
      cidrBlocks:
      - 10.96.0.0/12
  controlPlaneConfiguration:
    count: 1
    endpoint:
      host: 147.75.90.254
    machineGroupRef:
      kind: TinkerbellMachineConfig
      name: my-eksa-cluster-cp
  datacenterRef:
    kind: TinkerbellDatacenterConfig
    name: my-eksa-cluster
  kubernetesVersion: "1.23"
  managementCluster:
    name: my-eksa-cluster
  workerNodeGroupConfigurations:
  - count: 1
    machineGroupRef:
      kind: TinkerbellMachineConfig
      name: my-eksa-cluster
    name: md-0
---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: TinkerbellDatacenterConfig
metadata:
  name: my-eksa-cluster
  namespace: default
spec:
  tinkerbellIP: 147.75.90.253
---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: TinkerbellMachineConfig
metadata:
  annotations:
    anywhere.eks.amazonaws.com/control-plane: "true"
  name: my-eksa-cluster-cp
  namespace: default
spec:
  hardwareSelector:
    type: cp
  osFamily: bottlerocket
  templateRef:
    kind: TinkerbellTemplateConfig
    name: my-eksa-cluster
  users:
  - name: ec2-user
    sshAuthorizedKeys:
---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: TinkerbellMachineConfig
metadata:
  name: my-eksa-cluster
  namespace: default
spec:
  hardwareSelector:
    type: dp
  osFamily: bottlerocket
  templateRef:
    kind: TinkerbellTemplateConfig
    name: my-eksa-cluster
  users:
  - name: ec2-user
    sshAuthorizedKeys:
---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: TinkerbellTemplateConfig
metadata:
  name: my-eksa-cluster
  namespace: default
spec:
  template:
    global_timeout: 6000
    id: ""
    name: my-eksa-cluster
    tasks:
    - actions:
      - environment:
          COMPRESSED: "true"
          DEST_DISK: /dev/sda
          IMG_URL: https://anywhere-assets.eks.amazonaws.com/releases/bundles/15/artifacts/raw/1-23/bottlerocket-v1.23.7-eks-d-1-23-4-eks-a-15-amd64.img.gz
        image: public.ecr.aws/eks-anywhere/tinkerbell/hub/image2disk:6c0f0d437bde2c836d90b000312c8b25fa1b65e1-eks-a-15
        name: stream-image
        timeout: 600
      - environment:
          CONTENTS: |
            # Version is required, it will change as we support
            # additional settings
            version = 1

            # "eno1" is the interface name
            # Users may turn on dhcp4 and dhcp6 via boolean
            [enp1s0f0np0]
            dhcp4 = true
            dhcp6 = false
            # Define this interface as the "primary" interface
            # for the system.  This IP is what kubelet will use
            # as the node IP.  If none of the interfaces has
            # "primary" set, we choose the first interface in
            # the file
            primary = true
          DEST_DISK: /dev/sda12
          DEST_PATH: /net.toml
          DIRMODE: "0755"
          FS_TYPE: ext4
          GID: "0"
          MODE: "0644"
          UID: "0"
        image: public.ecr.aws/eks-anywhere/tinkerbell/hub/writefile:6c0f0d437bde2c836d90b000312c8b25fa1b65e1-eks-a-15
        name: write-netplan
        pid: host
        timeout: 90
      - environment:
          BOOTCONFIG_CONTENTS: |
            kernel {
                console = "ttyS1,115200n8"
            }
          DEST_DISK: /dev/sda12
          DEST_PATH: /bootconfig.data
          DIRMODE: "0700"
          FS_TYPE: ext4
          GID: "0"
          MODE: "0644"
          UID: "0"
        image: public.ecr.aws/eks-anywhere/tinkerbell/hub/writefile:6c0f0d437bde2c836d90b000312c8b25fa1b65e1-eks-a-15
        name: write-bootconfig
        pid: host
        timeout: 90
      - environment:
          DEST_DISK: /dev/sda12
          DEST_PATH: /user-data.toml
          DIRMODE: "0700"
          FS_TYPE: ext4
          GID: "0"
          HEGEL_URLS: http://147.75.90.242:50061,http://147.75.90.253:50061
          MODE: "0644"
          UID: "0"
        image: public.ecr.aws/eks-anywhere/tinkerbell/hub/writefile:6c0f0d437bde2c836d90b000312c8b25fa1b65e1-eks-a-15
        name: write-user-data
        pid: host
        timeout: 90
      - image: public.ecr.aws/eks-anywhere/tinkerbell/hub/reboot:6c0f0d437bde2c836d90b000312c8b25fa1b65e1-eks-a-15
        name: reboot-image
        pid: host
        timeout: 90
        volumes:
        - /worker:/worker
      name: my-eksa-cluster
      volumes:
      - /dev:/dev
      - /dev/console:/dev/console
      - /lib/firmware:/lib/firmware:ro
      worker: '{{.device_1}}'
    version: "0.1"

jacobweinstock commented 1 year ago

Hey @elamaran11. Here's the results from my conformance test. I wasn't able to reproduce the failures you posted. I did get one failure, but it is only because my cluster had only a single worker node. One thing that did stand out was the bottlerocket and Kubernetes versions. Yours: bottlerocket-v1.23.7-eks-d-1-23-4-eks-a-15-amd64 mine: bottlerocket-v1.23.9-eks-d-1-23-5-eks-a-17-amd64

@elamaran11, would you mind doing another run on your side?

If you have only one worker node, Sonobuoy will throw the following error: [sig-apps] Daemon set [Serial] should rollback without unnecessary restarts [Conformance] -- ref

I followed the guide from here, https://github.com/equinix-labs/terraform-equinix-metal-eks-anywhere, to setup the cluster.

cd terraform-equinix-metal-eks-anywhere/examples/deploy
terraform init
terraform apply

Then, on the admin node i ran the conformance test with sonobuoy version v0.56.10.

sonobuoy run --wait

Results

Plugin: e2e
Status: failed
Total: 7050
Passed: 343
Failed: 1
Skipped: 6706

Failed tests:
[sig-apps] Daemon set [Serial] should rollback without unnecessary restarts [Conformance]

Plugin: systemd-logs
Status: passed
Total: 2
Passed: 2
Failed: 0
Skipped: 0

Run Details:
API Server version: v1.23.9-eks-68c1cba
Node health: 2/2 (100%)
Pods health: 35/36 (97%)
Details for failed pods:
sonobuoy/sonobuoy-e2e-job-62d8ed75dd74406a Ready:False: ContainersNotReady: containers with unready status: [e2e sonobuoy-worker]
Errors detected in files:
Errors:
1705 podlogs/kube-system/cilium-jdlsz/logs/cilium-agent.txt
1347 podlogs/kube-system/kube-controller-manager-139.178.68.19/logs/kube-controller-manager.txt
 588 podlogs/sonobuoy/sonobuoy-e2e-job-62d8ed75dd74406a/logs/e2e.txt
 107 podlogs/kube-system/kube-apiserver-139.178.68.19/logs/kube-apiserver.txt
  70 podlogs/kube-system/kube-scheduler-139.178.68.19/logs/kube-scheduler.txt
   8 podlogs/kube-system/kube-proxy-vkptp/logs/kube-proxy.txt
   8 podlogs/kube-system/kube-proxy-tp52s/logs/kube-proxy.txt
   5 podlogs/kube-system/cilium-6thdk/logs/cilium-agent.txt
   1 podlogs/kube-system/etcd-139.178.68.19/logs/etcd.txt
   1 podlogs/kube-system/kube-vip-139.178.68.19/logs/kube-vip.txt
Warnings:
486 podlogs/kube-system/kube-controller-manager-139.178.68.19/logs/kube-controller-manager.txt
379 podlogs/kube-system/cilium-jdlsz/logs/cilium-agent.txt
103 podlogs/kube-system/kube-apiserver-139.178.68.19/logs/kube-apiserver.txt
 37 podlogs/kube-system/kube-scheduler-139.178.68.19/logs/kube-scheduler.txt
 14 podlogs/sonobuoy/sonobuoy-e2e-job-62d8ed75dd74406a/logs/e2e.txt
 10 podlogs/kube-system/cilium-6thdk/logs/cilium-agent.txt
  4 podlogs/kube-system/etcd-139.178.68.19/logs/etcd.txt
  2 podlogs/sonobuoy/sonobuoy/logs/kube-sonobuoy.txt

Here is the final generated EKSA cluster config:

apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: Cluster
metadata:
  name: my-eksa-cluster
spec:
  clusterNetwork:
    cniConfig:
      cilium: {}
    pods:
      cidrBlocks:
        - 192.168.0.0/16
    services:
      cidrBlocks:
        - 10.96.0.0/12
  controlPlaneConfiguration:
    count: 1
    endpoint:
      host: "139.178.68.30"
    machineGroupRef:
      kind: TinkerbellMachineConfig
      name: my-eksa-cluster-cp
  datacenterRef:
    kind: TinkerbellDatacenterConfig
    name: my-eksa-cluster
  kubernetesVersion: "1.23"
  managementCluster:
    name: my-eksa-cluster
  workerNodeGroupConfigurations:
    - count: 1
      machineGroupRef:
        kind: TinkerbellMachineConfig
        name: my-eksa-cluster
      name: md-0
---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: TinkerbellDatacenterConfig
metadata:
  name: my-eksa-cluster
spec:
  tinkerbellIP: "139.178.68.29"
---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: TinkerbellMachineConfig
metadata:
  name: my-eksa-cluster-cp
spec:
  hardwareSelector:
    type: cp
  osFamily: bottlerocket
  templateRef:
    kind: TinkerbellTemplateConfig
    name: cp-my-eksa-cluster-m3-small-x86
  users:
    - name: ec2-user
      sshAuthorizedKeys:
        - ssh-rsa AA...
---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: TinkerbellMachineConfig
metadata:
  name: my-eksa-cluster
spec:
  hardwareSelector:
    type: dp
  osFamily: bottlerocket
  templateRef:
    kind: TinkerbellTemplateConfig
    name: dp-my-eksa-cluster-m3-small-x86
  users:
    - name: ec2-user
      sshAuthorizedKeys:
        - ssh-rsa AA...
---
{}
---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: TinkerbellTemplateConfig
metadata:
  name: cp-my-eksa-cluster-m3-small-x86
spec:
  template:
    global_timeout: 6000
    id: ""
    name: cp-my-eksa-cluster-m3-small-x86
    tasks:
    - actions:
      - environment:
          COMPRESSED: "true"
          DEST_DISK: /dev/sda
          IMG_URL: https://anywhere-assets.eks.amazonaws.com/releases/bundles/17/artifacts/raw/1-23/bottlerocket-v1.23.9-eks-d-1-23-5-eks-a-17-amd64.img.gz
        image: public.ecr.aws/eks-anywhere/tinkerbell/hub/image2disk:6c0f0d437bde2c836d90b000312c8b25fa1b65e1-eks-a-17
        name: stream-image
        timeout: 600
      - environment:
          CONTENTS: |
            # Version is required, it will change as we support
            # additional settings
            version = 1

            # "eno1" is the interface name
            # Users may turn on dhcp4 and dhcp6 via boolean
            [enp1s0f0np0]
            dhcp4 = true
            dhcp6 = false
            # Define this interface as the "primary" interface
            # for the system.  This IP is what kubelet will use
            # as the node IP.  If none of the interfaces has
            # "primary" set, we choose the first interface in
            # the file
            primary = true
          DEST_DISK: /dev/sda12
          DEST_PATH: /net.toml
          DIRMODE: "0755"
          FS_TYPE: ext4
          GID: "0"
          MODE: "0644"
          UID: "0"
        image: public.ecr.aws/eks-anywhere/tinkerbell/hub/writefile:6c0f0d437bde2c836d90b000312c8b25fa1b65e1-eks-a-17
        name: write-netplan
        pid: host
        timeout: 90
      - environment:
          BOOTCONFIG_CONTENTS: |
            kernel {
                console = "ttyS1,115200n8"
            }
          DEST_DISK: /dev/sda12
          DEST_PATH: /bootconfig.data
          DIRMODE: "0700"
          FS_TYPE: ext4
          GID: "0"
          MODE: "0644"
          UID: "0"
        image: public.ecr.aws/eks-anywhere/tinkerbell/hub/writefile:6c0f0d437bde2c836d90b000312c8b25fa1b65e1-eks-a-17
        name: write-bootconfig
        pid: host
        timeout: 90
      - environment:
          DEST_DISK: /dev/sda12
          DEST_PATH: /user-data.toml
          DIRMODE: "0700"
          FS_TYPE: ext4
          GID: "0"
          HEGEL_URLS: http://139.178.68.18:50061,http://139.178.68.29:50061
          MODE: "0644"
          UID: "0"
        image: public.ecr.aws/eks-anywhere/tinkerbell/hub/writefile:6c0f0d437bde2c836d90b000312c8b25fa1b65e1-eks-a-17
        name: write-user-data
        pid: host
        timeout: 90
      - image: public.ecr.aws/eks-anywhere/tinkerbell/hub/reboot:6c0f0d437bde2c836d90b000312c8b25fa1b65e1-eks-a-17
        name: reboot-image
        pid: host
        timeout: 90
        volumes:
        - /worker:/worker
      name: cp-my-eksa-cluster-m3-small-x86
      volumes:
        - /dev:/dev
        - /dev/console:/dev/console
        - /lib/firmware:/lib/firmware:ro
      worker: '{{.device_1}}'
    version: "0.1"
---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: TinkerbellTemplateConfig
metadata:
  name: dp-my-eksa-cluster-m3-small-x86
spec:
  template:
    global_timeout: 6000
    id: ""
    name: dp-my-eksa-cluster-m3-small-x86
    tasks:
    - actions:
      - environment:
          COMPRESSED: "true"
          DEST_DISK: /dev/sda
          IMG_URL: https://anywhere-assets.eks.amazonaws.com/releases/bundles/17/artifacts/raw/1-23/bottlerocket-v1.23.9-eks-d-1-23-5-eks-a-17-amd64.img.gz
        image: public.ecr.aws/eks-anywhere/tinkerbell/hub/image2disk:6c0f0d437bde2c836d90b000312c8b25fa1b65e1-eks-a-17
        name: stream-image
        timeout: 600
      - environment:
          CONTENTS: |
            # Version is required, it will change as we support
            # additional settings
            version = 1

            # "eno1" is the interface name
            # Users may turn on dhcp4 and dhcp6 via boolean
            [enp1s0f0np0]
            dhcp4 = true
            dhcp6 = false
            # Define this interface as the "primary" interface
            # for the system.  This IP is what kubelet will use
            # as the node IP.  If none of the interfaces has
            # "primary" set, we choose the first interface in
            # the file
            primary = true
          DEST_DISK: /dev/sda12
          DEST_PATH: /net.toml
          DIRMODE: "0755"
          FS_TYPE: ext4
          GID: "0"
          MODE: "0644"
          UID: "0"
        image: public.ecr.aws/eks-anywhere/tinkerbell/hub/writefile:6c0f0d437bde2c836d90b000312c8b25fa1b65e1-eks-a-17
        name: write-netplan
        pid: host
        timeout: 90
      - environment:
          BOOTCONFIG_CONTENTS: |
            kernel {
                console = "ttyS1,115200n8"
            }
          DEST_DISK: /dev/sda12
          DEST_PATH: /bootconfig.data
          DIRMODE: "0700"
          FS_TYPE: ext4
          GID: "0"
          MODE: "0644"
          UID: "0"
        image: public.ecr.aws/eks-anywhere/tinkerbell/hub/writefile:6c0f0d437bde2c836d90b000312c8b25fa1b65e1-eks-a-17
        name: write-bootconfig
        pid: host
        timeout: 90
      - environment:
          DEST_DISK: /dev/sda12
          DEST_PATH: /user-data.toml
          DIRMODE: "0700"
          FS_TYPE: ext4
          GID: "0"
          HEGEL_URLS: http://139.178.68.18:50061,http://139.178.68.29:50061
          MODE: "0644"
          UID: "0"
        image: public.ecr.aws/eks-anywhere/tinkerbell/hub/writefile:6c0f0d437bde2c836d90b000312c8b25fa1b65e1-eks-a-17
        name: write-user-data
        pid: host
        timeout: 90
      - image: public.ecr.aws/eks-anywhere/tinkerbell/hub/reboot:6c0f0d437bde2c836d90b000312c8b25fa1b65e1-eks-a-17
        name: reboot-image
        pid: host
        timeout: 90
        volumes:
        - /worker:/worker
      name: dp-my-eksa-cluster-m3-small-x86
      volumes:
        - /dev:/dev
        - /dev/console:/dev/console
        - /lib/firmware:/lib/firmware:ro
      worker: '{{.device_1}}'
    version: "0.1"

Here is the hardware.csv:

hostname,vendor,mac,ip_address,gateway,netmask,nameservers,disk,labels
eksa-gi3g9q-node-cp-001,Equinix,10:70:fd:7f:99:a2,139.178.68.19,139.178.68.17,255.255.255.240,8.8.8.8,/dev/sda,type=cp
eksa-gi3g9q-node-dp-001,Equinix,10:70:fd:86:ee:aa,139.178.68.20,139.178.68.17,255.255.255.240,8.8.8.8,/dev/sda,type=dp

jacobweinstock commented 1 year ago

Hey @displague and @cprivitere, would either of you, by chance, have any thoughts or insights on this?

displague commented 1 year ago

Looks like this was the only failed test, as you pointed out, because of the limited cluster size. What does it test?

Failed tests: [sig-apps] Daemon set [Serial] should rollback without unnecessary restarts [Conformance]

We've released v0.3.2 but I can't think of any significant changes you'd encounter in the previous builds.

jacobweinstock commented 1 year ago

Looks like this was the only failed test, as you pointed out, because of the limited cluster size. What does it test?
Failed tests: [sig-apps] Daemon set [Serial] should rollback without unnecessary restarts [Conformance] 
We've released v0.3.2 but I can't think of any significant changes you'd encounter in the previous builds.

Hey @displague, thanks for the response. Any insight into @elamaran11 's original failures at the very top, by chance?

elamaran11 commented 1 year ago

Team - ANy updates on this issue. We installed EKSA on Dell hardware for a customer and we ran Sonobuoy and we got same issues.

elamaran11 commented 1 year ago

Team - Any updates on this issue. We installed EKSA on Intell/Dell hardware for a customer and we ran Sonobuoy and we still see the following issues :

Summarizing 3 Failures:

[Fail] [sig-apps] Daemon set [Serial] [It] should rollback without unnecessary restarts [Conformance] 
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/apps/daemon_set.go:432

[Fail] [sig-scheduling] SchedulerPredicates [Serial] [It] validates that there exists conflict between pods with same hostPort and protocol but one using 0.0.0.0 hostIP [Conformance] 
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/scheduling/predicates.go:1068

[Fail] [sig-scheduling] SchedulerPredicates [Serial] [It] validates resource limits of pods that are allowed to run  [Conformance] 
/workspace/src/k8s.io/kubernetes/_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/scheduling/predicates.go:323

Ran 346 of 7044 Specs in 8646.423 seconds
FAIL! -- 343 Passed | 3 Failed | 0 Pending | 6698 Skipped
--- FAIL: TestE2E (8653.07s)

aws / eks-anywhere

Sonobuoy Conformance on EKS-A on Baremetal shows failures. #3423