bug: support bundle collection fails loudly when cluster inaccessible

danbudris commented 2 years ago

If the EKS A CLI fails, and then attempts to collect a support bundle, but the target cluster is inaccessible, the bundle collection produces loud and alarming logs indicating that it failed to collect. This should be revisited, and we should ensure that if the bundle collection fails because of the cluster being inaccessible it is made clear to the customer.

We should also move more of the bundle collection logs into the debug logs to avoid noise in the ux.

example failure:

./eksctl-anywhere create cluster --filename ${CLUSTER_NAME}.yaml --hardwarefile ClusterB-Hardware.yaml
Warning: The recommended number of control plane nodes is 3 or 5
Performing setup and validations
Warning: The tinkerbell infrastructure provider is still in development and should not be used in production
:white_check_mark: Tinkerbell Config is valid
:white_check_mark: Connected to tinkerbell stack
:white_check_mark: BMC connectivity validated
:white_check_mark: Hardware Config file validated
:white_check_mark: Machine Configs are valid
:white_check_mark: Cluster  controlPlaneConfiguration host IP available
:white_check_mark: Tinkerbell Provider setup is valid
:white_check_mark: Validate certificate for registry mirror
:white_check_mark: Create preflight validations pass
Creating new bootstrap cluster
Installing cluster-api providers on bootstrap cluster
collecting management cluster diagnostics
WARNING: failed to create eksa-diagnostics namespace. Some collectors may fail to run.  {“err”: “error creating namespace eksa-diagnostics: The connection to the server 127.0.0.1:60149 was refused - did you specify the right host or port?\n”}
:hourglass_flowing_sand: Collecting support bundle from cluster, this can take a while  {“cluster”: “bootstrap-cluster”, “bundle”: “youngjeong-eksa-baremetal/generated/bootstrap-cluster-2022-03-22T13:03:19-05:00-bundle.yaml”, “since”: 1647968599455104000, “kubeconfig”: “youngjeong-eksa-baremetal/generated/youngjeong-eksa-baremetal.kind.kubeconfig”}
Support bundle archive created  {“path”: “support-bundle-2022-03-22T18_04_31.tar.gz”}
Analyzing support bundle    {“bundle”: “youngjeong-eksa-baremetal/generated/bootstrap-cluster-2022-03-22T13:03:19-05:00-bundle.yaml”, “archive”: “support-bundle-2022-03-22T18_04_31.tar.gz”}
Analysis output generated   {“path”: “youngjeong-eksa-baremetal/generated/bootstrap-cluster-2022-03-22T13:13:27-05:00-analysis.yaml”}
WARNING: failed to clean up roles for eksa-diagnostics. {“err”: “error executing apply: error: error when deleting \“STDIN\“: Delete \“https://127.0.0.1:60149/apis/rbac.authorization.k8s.io/v1/clusterroles/diagnostic-collector-crd-reader\“: net/http: TLS handshake timeout\n”}
WARNING: failed to clean up eksa-diagnostics namespace. {“err”: “error creating namespace eksa-diagnostics: Error from server (NotFound): namespaces \“eksa-diagnostics\” not found\n”, “namespace”: “eksa-diagnostics”}
Error: error initializing capi resources in cluster: error executing init: Fetching providers
Using Override=“core-components.yaml” Provider=“cluster-api” Version=“v1.0.2+cdba5e2”
Using Override=“bootstrap-components.yaml” Provider=“bootstrap-kubeadm” Version=“v1.0.2+e14bc68"
Using Override=“bootstrap-components.yaml” Provider=“bootstrap-etcdadm-bootstrap” Version=“v1.0.0-rc4+ee47c17”
Using Override=“bootstrap-components.yaml” Provider=“bootstrap-etcdadm-controller” Version=“v1.0.0-rc6+ecc0a79"
Using Override=“control-plane-components.yaml” Provider=“control-plane-kubeadm” Version=“v1.0.2+898d2b1”
Using Override=“infrastructure-components.yaml” Provider=“infrastructure-tinkerbell” Version=“v0.1.0+7960ca9"
Installing cert-manager Version=“v1.5.3+0ff88ee”
Using Override=“cert-manager.yaml” Provider=“cert-manager” Version=“v1.5.3+0ff88ee”
Waiting for cert-manager to be available...
Installing Provider=“cluster-api” Version=“v1.0.2+cdba5e2" TargetNamespace=“capi-system”
Installing Provider=“bootstrap-kubeadm” Version=“v1.0.2+e14bc68” TargetNamespace=“capi-kubeadm-bootstrap-system”
E0322 17:38:23.359568     475 request.go:1085] Unexpected error when reading response body: net/http: request canceled (Client.Timeout or context cancellation while reading body)
Error: action failed after 10 attempts: failed to connect to the management cluster: action failed after 9 attempts: Get “https://127.0.0.1:60149/api?timeout=30s”: net/http: TLS handshake timeout

Something failed on the bootstrap cluster and then diag bundle failed. Not sure what failed on bootstrap (the guy cleaned up everything, so can’t tell now), but what’s this diag bundle error?

chrisdoherty4 commented 2 years ago

@danbudris Do we have clearly defined log levels. It seems like theres a mashup of logger.V(<some number that vaguely seems appropriate at the time>) right now?

danbudris commented 2 years ago

so there are in fact defined log levels which we should apply here (see https://github.com/aws/eks-anywhere/blob/main/pkg/logger/doc.go#L12),

aws / eks-anywhere

bug: support bundle collection fails loudly when cluster inaccessible #1578