Open nirs opened 7 months ago
Example errors that are impossible to debug without collecting data:
kubectl events -A
With what we have now, we can only blindly increase the timeout which may slow down the retry that can recover from this issue.
drenv.commands.Error: Command failed:
command: ('addons/rook-operator/start', 'dr1')
exitcode: 1
error:
Traceback (most recent call last):
File "/home/nsoffer/ramen/test/addons/rook-operator/start", line 55, in <module>
wait(cluster)
File "/home/nsoffer/ramen/test/addons/rook-operator/start", line 28, in wait
kubectl.rollout(
File "/home/nsoffer/ramen/test/drenv/kubectl.py", line 134, in rollout
_watch("rollout", *args, context=context, log=log)
File "/home/nsoffer/ramen/test/drenv/kubectl.py", line 157, in _watch
for line in commands.watch(*cmd, input=input):
File "/home/nsoffer/ramen/test/drenv/commands.py", line 155, in watch
raise Error(args, error, exitcode=p.returncode)
drenv.commands.Error: Command failed:
command: ('kubectl', 'rollout', '--context', 'dr1', 'status', 'deploy/rook-ceph-operator', '--namespace=rook-ceph', '--timeout=300s')
exitcode: 1
error:
error: timed out waiting for the condition
I played with oc adm must-gather to understand what it can give us with upstream setup.
Testing on regional dr setup with one busybox application, running few hours for reproducing another issue.
Running with the default image did not collect anything, since image was not accessible.
$ time oc adm must-gather --context dr1
[must-gather ] OUT the server could not find the requested resource (get imagestreams.image.openshift.io must-gather)
[must-gather ] OUT
[must-gather ] OUT Using must-gather plug-in image: registry.redhat.io/openshift4/ose-must-gather:latest
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
error getting cluster version: the server could not find the requested resource (get clusterversions.config.openshift.io version)
ClusterID:
ClientVersion: 4.15.0-202403061939.p0.gd6175eb.assembly.stream.el8-d6175eb
ClusterVersion: Installing "" for <unknown>: <unknown>
error getting cluster operators: the server could not find the requested resource (get clusteroperators.config.openshift.io)
ClusterOperators:
clusteroperators are missing
[must-gather ] OUT namespace/openshift-must-gather-5tl4p created
[must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-jp25f created
[must-gather ] OUT pod for plug-in image registry.redhat.io/openshift4/ose-must-gather:latest created
[must-gather-c6vqn] OUT gather did not start: unable to pull image: ImagePullBackOff: Back-off pulling image "registry.redhat.io/openshift4/ose-must-gather:latest"
[must-gather ] OUT namespace/openshift-must-gather-5tl4p deleted
[must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-jp25f deleted
Error running must-gather collection:
gather did not start for pod must-gather-c6vqn: unable to pull image: ImagePullBackOff: Back-off pulling image "registry.redhat.io/openshift4/ose-must-gather:latest"
Falling back to `oc adm inspect clusteroperators.v1.config.openshift.io` to collect basic cluster information.
error running backup collection: the server doesn't have a resource type "clusteroperators"
Reprinting Cluster State:
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
error getting cluster version: the server could not find the requested resource (get clusterversions.config.openshift.io version)
ClusterID:
ClientVersion: 4.15.0-202403061939.p0.gd6175eb.assembly.stream.el8-d6175eb
ClusterVersion: Installing "" for <unknown>: <unknown>
error getting cluster operators: the server could not find the requested resource (get clusteroperators.config.openshift.io)
ClusterOperators:
clusteroperators are missing
error: gather did not start for pod must-gather-c6vqn: unable to pull image: ImagePullBackOff: Back-off pulling image "registry.redhat.io/openshift4/ose-must-gather:latest"
real 0m10.318s
user 0m0.206s
sys 0m0.070s
Looking in the source I found quay.io/openshift/origin-must-gather which works:
$ time oc adm must-gather --image=quay.io/openshift/origin-must-gather --context dr1
[must-gather ] OUT Using must-gather plug-in image: quay.io/openshift/origin-must-gather
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
error getting cluster version: the server could not find the requested resource (get clusterversions.config.openshift.io version)
ClusterID:
ClientVersion: 4.15.0-202403061939.p0.gd6175eb.assembly.stream.el8-d6175eb
ClusterVersion: Installing "" for <unknown>: <unknown>
error getting cluster operators: the server could not find the requested resource (get clusteroperators.config.openshift.io)
ClusterOperators:
clusteroperators are missing
[must-gather ] OUT namespace/openshift-must-gather-4zkbd created
[must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-9r7n9 created
[must-gather ] OUT pod for plug-in image quay.io/openshift/origin-must-gather created
[must-gather-pcsww] POD 2024-05-02T17:16:16.388647410Z volume percentage checker started.....
[must-gather-pcsww] POD 2024-05-02T17:16:16.424396680Z volume usage percentage 0
[must-gather-pcsww] POD 2024-05-02T17:16:16.621507357Z Error from server (NotFound): namespaces "openshift-cluster-version" not found
[must-gather-pcsww] POD 2024-05-02T17:16:16.621534644Z Error from server (NotFound): namespaces "openshift" not found
[must-gather-pcsww] POD 2024-05-02T17:16:16.621538355Z Error from server (NotFound): namespaces "openshift-etcd" not found
[must-gather-pcsww] POD 2024-05-02T17:16:18.919380951Z Waiting on subprocesses to finish execution.
[must-gather-pcsww] POD 2024-05-02T17:16:18.928886316Z INFO: Gathering HAProxy config files
[must-gather-pcsww] POD 2024-05-02T17:16:18.948180257Z WARNING: Collecting one or more kube-apiserver related logs on ALL masters in your cluster. This could take a large amount of time.
[must-gather-pcsww] POD 2024-05-02T17:16:18.956392041Z INFO: Gathering on-disk MachineConfig from degraded nodes
[must-gather-pcsww] POD 2024-05-02T17:16:18.972163514Z INFO: Collecting host service logs for crio
[must-gather-pcsww] POD 2024-05-02T17:16:18.972705417Z INFO: Collecting host service logs for kubelet
[must-gather-pcsww] POD 2024-05-02T17:16:18.973526928Z INFO: Collecting host service logs for rpm-ostreed
[must-gather-pcsww] POD 2024-05-02T17:16:18.973913424Z INFO: Collecting host service logs for ostree-finalize-staged
[must-gather-pcsww] POD 2024-05-02T17:16:18.974570133Z INFO: Collecting host service logs for machine-config-daemon-firstboot
[must-gather-pcsww] POD 2024-05-02T17:16:18.974976920Z INFO: Collecting host service logs for machine-config-daemon-host
[must-gather-pcsww] POD 2024-05-02T17:16:18.975389209Z INFO: Collecting host service logs for NetworkManager
[must-gather-pcsww] POD 2024-05-02T17:16:18.975758829Z INFO: Collecting host service logs for openvswitch
[must-gather-pcsww] POD 2024-05-02T17:16:18.976128475Z INFO: Collecting host service logs for ovs-configuration
[must-gather-pcsww] POD 2024-05-02T17:16:18.976518418Z INFO: Collecting host service logs for ovsdb-server
[must-gather-pcsww] POD 2024-05-02T17:16:18.976887768Z INFO: Collecting host service logs for ovs-vswitchd
[must-gather-pcsww] POD 2024-05-02T17:16:18.977333446Z INFO: Waiting for worker host service log collection to complete ...
[must-gather-pcsww] POD 2024-05-02T17:16:19.323209555Z INFO: Waiting for node performance related collection to complete ...
[must-gather-pcsww] POD 2024-05-02T17:16:19.747628728Z error: the server doesn't have a resource type "clustercsidriver"
[must-gather-pcsww] POD 2024-05-02T17:16:19.806296372Z error: the server doesn't have a resource type "clusterversion"
[must-gather-pcsww] POD 2024-05-02T17:16:20.067845016Z error: the server doesn't have a resource type "podnetworkconnectivitychecks"
[must-gather-pcsww] POD 2024-05-02T17:16:20.174215337Z error: a resource cannot be retrieved by name across all namespaces
[must-gather-pcsww] POD 2024-05-02T17:16:20.364051213Z error: the server doesn't have a resource type "routes"
[must-gather-pcsww] POD 2024-05-02T17:16:20.863092471Z error: the server doesn't have a resource type "performanceprofile"
[must-gather-pcsww] POD 2024-05-02T17:16:20.932718419Z INFO: "metallb-operator" not detected. Skipping.
[must-gather-pcsww] POD 2024-05-02T17:16:20.954723928Z INFO: Collecting Insights Archives from
[must-gather-pcsww] POD 2024-05-02T17:16:21.081034002Z error: the server doesn't have a resource type "ingresscontroller"
[must-gather-pcsww] POD 2024-05-02T17:16:21.084002336Z No resources found
[must-gather-pcsww] POD 2024-05-02T17:16:21.104097761Z No resources found in openshift-etcd namespace.
[must-gather-pcsww] POD 2024-05-02T17:16:21.114010831Z INFO: "sriov-network-operator" not detected. Skipping.
[must-gather-pcsww] POD 2024-05-02T17:16:21.119205041Z INFO: "kubernetes-nmstate-operator" not detected. Skipping.
[must-gather-pcsww] POD 2024-05-02T17:16:21.125265772Z INFO: Worker host service log collection to complete.
[must-gather-pcsww] POD 2024-05-02T17:16:21.126094344Z INFO: Waiting for HAProxy config collection to complete ...
[must-gather-pcsww] POD 2024-05-02T17:16:21.126117096Z INFO: HAProxy config collection complete.
[must-gather-pcsww] POD 2024-05-02T17:16:21.177920500Z INFO: Waiting for on-disk MachineConfig collection to complete ...
[must-gather-pcsww] POD 2024-05-02T17:16:21.177942283Z INFO: on-disk MachineConfig config collection complete.
[must-gather-pcsww] POD 2024-05-02T17:16:21.214639597Z Wrote inspect data to must-gather.
[must-gather-pcsww] POD 2024-05-02T17:16:21.272760034Z Wrote inspect data to must-gather.
[must-gather-pcsww] POD 2024-05-02T17:16:21.358301106Z error: resource name may not be empty
[must-gather-pcsww] POD 2024-05-02T17:16:21.365273232Z Wrote inspect data to must-gather.
[must-gather-pcsww] POD 2024-05-02T17:16:21.389881470Z error: the server doesn't have a resource type "network"
[must-gather-pcsww] POD 2024-05-02T17:16:21.649852626Z volume usage percentage 0
[must-gather-pcsww] POD 2024-05-02T17:16:21.706715719Z error: the server doesn't have a resource type "machineconfigs"
[must-gather-pcsww] POD 2024-05-02T17:16:21.869744850Z error: the server doesn't have a resource type "multi-networkpolicy"
[must-gather-pcsww] POD 2024-05-02T17:16:22.015923944Z error: the server doesn't have a resource type "machineconfigpools"
[must-gather-pcsww] POD 2024-05-02T17:16:22.078964549Z error: the server doesn't have a resource type "net-attach-def"
[must-gather-pcsww] POD 2024-05-02T17:16:22.226188970Z error: the server doesn't have a resource type "overlappingrangeipreservations"
[must-gather-pcsww] POD 2024-05-02T17:16:22.277823579Z error: the server doesn't have a resource type "ippools"
[must-gather-pcsww] POD 2024-05-02T17:16:22.346043447Z No resources found
[must-gather-pcsww] POD 2024-05-02T17:16:22.479528067Z INFO: Waiting for network log collection to complete ...
[must-gather-pcsww] POD 2024-05-02T17:16:22.506073546Z INFO: Network log collection complete.
[must-gather-pcsww] POD 2024-05-02T17:16:22.580723347Z error: the server doesn't have a resource type "featuregates"
[must-gather-pcsww] POD 2024-05-02T17:16:22.751836119Z error: the server doesn't have a resource type "kubeletconfigs"
[must-gather-pcsww] POD 2024-05-02T17:16:22.919859455Z error: the server doesn't have a resource type "tuneds"
[must-gather-pcsww] POD 2024-05-02T17:16:23.090284347Z Wrote inspect data to must-gather.
[must-gather-pcsww] POD 2024-05-02T17:16:23.425026194Z Error from server (NotFound): namespaces "openshift-cluster-node-tuning-operator" not found
[must-gather-pcsww] POD 2024-05-02T17:16:23.429920128Z ERROR: Failed to identify the container image with node tools.
[must-gather-pcsww] POD 2024-05-02T17:16:23.429940952Z INFO: Node performance data collection will not contain node level data.
[must-gather-pcsww] POD 2024-05-02T17:16:23.430826233Z INFO: Node performance data collection complete.
[must-gather-pcsww] OUT waiting for gather to complete
[must-gather-pcsww] OUT downloading gather output
WARNING: rsync command not found in path. Please use your package manager to install it.
[must-gather-pcsww] OUT ./timestamp
[must-gather-pcsww] OUT ./host_service_logs/masters/rpm-ostreed_service.log
[must-gather-pcsww] OUT ./host_service_logs/masters/machine-config-daemon-firstboot_service.log
[must-gather-pcsww] OUT ./host_service_logs/masters/ostree-finalize-staged_service.log
[must-gather-pcsww] OUT ./host_service_logs/masters/crio_service.log
[must-gather-pcsww] OUT ./host_service_logs/masters/NetworkManager_service.log
[must-gather-pcsww] OUT ./host_service_logs/masters/ovs-configuration_service.log
[must-gather-pcsww] OUT ./host_service_logs/masters/machine-config-daemon-host_service.log
[must-gather-pcsww] OUT ./host_service_logs/masters/ovs-vswitchd_service.log
[must-gather-pcsww] OUT ./host_service_logs/masters/openvswitch_service.log
[must-gather-pcsww] OUT ./host_service_logs/masters/ovsdb-server_service.log
[must-gather-pcsww] OUT ./host_service_logs/masters/kubelet_service.log
[must-gather-pcsww] OUT ./nodes/debug
[must-gather-pcsww] OUT ./event-filter.html
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1alpha1.ramendr.openshift.io.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1.submariner.io.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v2.operators.coreos.com.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1beta1.snapshot.storage.k8s.io.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1.velero.io.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1alpha2.operators.coreos.com.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1.events.k8s.io.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1.scheduling.k8s.io.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1..yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1.operators.coreos.com.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1.policy.open-cluster-management.io.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v2.autoscaling.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1.work.open-cluster-management.io.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1.node.k8s.io.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1alpha1.multicluster.x-k8s.io.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1alpha1.replication.storage.openshift.io.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1.ceph.rook.io.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1beta1.policy.open-cluster-management.io.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1.autoscaling.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1alpha1.csiaddons.openshift.io.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1.discovery.k8s.io.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1alpha1.cluster.open-cluster-management.io.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1alpha1.apps.open-cluster-management.io.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1.batch.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1beta2.flowcontrol.apiserver.k8s.io.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1beta3.flowcontrol.apiserver.k8s.io.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1.apps.open-cluster-management.io.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1.admissionregistration.k8s.io.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1alpha1.submariner.io.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1.packages.operators.coreos.com.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1.operator.open-cluster-management.io.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1alpha1.operators.coreos.com.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1alpha1.volsync.backube.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1.apiextensions.k8s.io.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1alpha1.objectbucket.io.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1.authentication.k8s.io.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1.policy.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v2alpha1.velero.io.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1.storage.k8s.io.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1.authorization.k8s.io.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1.snapshot.storage.k8s.io.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1.apps.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1.rbac.authorization.k8s.io.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1.certificates.k8s.io.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1.networking.k8s.io.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/apiregistration.k8s.io/apiservices/v1.coordination.k8s.io.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/operators.coreos.com/olmconfigs/cluster.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/operators.coreos.com/operators/ramen-dr-cluster-operator.ramen-system.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/core/nodes/dr1.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/core/persistentvolumes/pvc-23e31759-6ffe-4977-b70a-e3b2a5c103e0.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/core/persistentvolumes/pvc-02987ebf-1b0c-42ce-b3a2-7b5efe1de612.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/snapshot.storage.k8s.io/volumesnapshotclasses/csi-hostpath-snapclass.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/storage.k8s.io/volumeattachments/csi-b42acd6adc1cc5095f2c2e3b4e93fd531aa635c772cec28ed8165eb31e23cdd9.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/storage.k8s.io/csinodes/dr1.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/storage.k8s.io/csidrivers/rook-ceph.rbd.csi.ceph.com.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/storage.k8s.io/csidrivers/rook-ceph.cephfs.csi.ceph.com.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/storage.k8s.io/csidrivers/hostpath.csi.k8s.io.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/storage.k8s.io/storageclasses/standard.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/storage.k8s.io/storageclasses/rook-ceph-block.yaml
[must-gather-pcsww] OUT ./cluster-scoped-resources/storage.k8s.io/storageclasses/csi-hostpath-sc.yaml
[must-gather-pcsww] OUT ./version
[must-gather-pcsww] OUT ./namespaces/ramen-system/operators.coreos.com/subscriptions/ramen-dr-cluster-subscription.yaml
[must-gather-pcsww] OUT ./namespaces/ramen-system/operators.coreos.com/operatorgroups/ramen-operator-group.yaml
[must-gather-pcsww] OUT ./namespaces/operators/operators.coreos.com/operatorgroups/global-operators.yaml
[must-gather-pcsww] OUT ./namespaces/olm/operators.coreos.com/clusterserviceversions/packageserver.yaml
[must-gather-pcsww] OUT ./namespaces/olm/operators.coreos.com/operatorconditions/packageserver.yaml
[must-gather-pcsww] OUT ./namespaces/olm/operators.coreos.com/catalogsources/operatorhubio-catalog.yaml
[must-gather-pcsww] OUT ./namespaces/olm/operators.coreos.com/operatorgroups/olm-operators.yaml
[must-gather-pcsww] OUT ./pod_network_connectivity_check/podnetworkconnectivitychecks.yaml
[must-gather-pcsww] OUT ./network_logs/net-attach-def
[must-gather-pcsww] OUT ./network_logs/overlappingrangeipreservations.whereabouts.cni.cncf.io
[must-gather-pcsww] OUT ./network_logs/multi-networkpolicy
[must-gather-pcsww] OUT ./network_logs/cluster_scale
[must-gather-pcsww] OUT ./network_logs/ippools.whereabouts.cni.cncf.io
Ignoring the following flags because they only apply to rsync: -z
[must-gather ] OUT namespace/openshift-must-gather-4zkbd deleted
[must-gather ] OUT clusterrolebinding.rbac.authorization.k8s.io/must-gather-9r7n9 deleted
Reprinting Cluster State:
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information:
error getting cluster version: the server could not find the requested resource (get clusterversions.config.openshift.io version)
ClusterID:
ClientVersion: 4.15.0-202403061939.p0.gd6175eb.assembly.stream.el8-d6175eb
ClusterVersion: Installing "" for <unknown>: <unknown>
error getting cluster operators: the server could not find the requested resource (get clusteroperators.config.openshift.io)
ClusterOperators:
clusteroperators are missing
real 0m40.417s
user 0m0.174s
sys 0m0.075s
But it did not collect much data:
$ du -sh must-gather.local.890950841412942527
356K must-gather.local.890950841412942527
$ tree must-gather.local.890950841412942527 | tail -1
44 directories, 90 files
Gathered data: must-gather.local.890950841412942527.tar.gz
Gathered data structure:
├── quay-io-openshift-origin-must-gather-sha256-a9f3d2f463ef11da0debde26ef99766c391ba97dee4094405b75abc3a548c749
│ ├── cluster-scoped-resources
│ │ ├── apiregistration.k8s.io
│ │ │ └── apiservices
│ │ │ ├── v1.admissionregistration.k8s.io.yaml
...
│ │ ├── core
│ │ │ ├── nodes
│ │ │ │ └── dr1.yaml
│ │ │ └── persistentvolumes
│ │ │ ├── pvc-02987ebf-1b0c-42ce-b3a2-7b5efe1de612.yaml
│ │ │ └── pvc-23e31759-6ffe-4977-b70a-e3b2a5c103e0.yaml
│ │ ├── operators.coreos.com
│ │ │ ├── olmconfigs
│ │ │ │ └── cluster.yaml
│ │ │ └── operators
│ │ │ └── ramen-dr-cluster-operator.ramen-system.yaml
│ │ ├── snapshot.storage.k8s.io
│ │ │ └── volumesnapshotclasses
│ │ │ └── csi-hostpath-snapclass.yaml
│ │ └── storage.k8s.io
│ │ ├── csidrivers
│ │ │ ├── hostpath.csi.k8s.io.yaml
│ │ │ ├── rook-ceph.cephfs.csi.ceph.com.yaml
│ │ │ └── rook-ceph.rbd.csi.ceph.com.yaml
│ │ ├── csinodes
│ │ │ └── dr1.yaml
│ │ ├── storageclasses
│ │ │ ├── csi-hostpath-sc.yaml
│ │ │ ├── rook-ceph-block.yaml
│ │ │ └── standard.yaml
│ │ └── volumeattachments
│ │ └── csi-b42acd6adc1cc5095f2c2e3b4e93fd531aa635c772cec28ed8165eb31e23cdd9.yaml
...
│ ├── namespaces
│ │ ├── olm
│ │ │ └── operators.coreos.com
│ │ │ ├── catalogsources
│ │ │ │ └── operatorhubio-catalog.yaml
│ │ │ ├── clusterserviceversions
│ │ │ │ └── packageserver.yaml
│ │ │ ├── operatorconditions
│ │ │ │ └── packageserver.yaml
│ │ │ └── operatorgroups
│ │ │ └── olm-operators.yaml
│ │ ├── operators
│ │ │ └── operators.coreos.com
│ │ │ └── operatorgroups
│ │ │ └── global-operators.yaml
│ │ └── ramen-system
│ │ └── operators.coreos.com
│ │ ├── operatorgroups
│ │ │ └── ramen-operator-group.yaml
│ │ └── subscriptions
│ │ └── ramen-dr-cluster-subscription.yaml
Summary:
We need to try odf must-gather image, hopefully it is public.
When starting an env fails in CI environment, we don't have good way to debug the issue. We need to collect data from the failed cluster to allow understanding the failure later. The data can be published for few days as build artifact.
Use cases
CI unattended build
When unattended build fails, we want to delete the environment quickly and use it to run the next job. Without collecting data from the failed system we cannot analyze the failure. We can try to reproduce the issue with a local test environment but this will not help with random errors.
Debugging a system
When debugging a system we can inspect the system manually, but this is very hard and time consuming. It is much easier to gather all the data and use grep on local files.
Creating a snapshot of the system
When debugging an issue, you may want top take a snapshot of the system before an operation, perform the operation, and compare the state of the system to the state before the operation. This can be done manually for few resources using kubectl, but in the time to copy few resources manually, you can gather all resources from the entire cluster.
Getting help from people in different time zone
When you an issue with a system, you can wait few hours for help when someone wakes up on the other side of the planet, or gather everything for the cluster and recreated your environment.
Helping upstream users
On OpenShift users can use oc adm must-gather. It seems that there is no similar tool for upstream Kubernetes.
Data to collect
Resources
It hard to tell which resources are needed to debug an issue, and many resources are well hidden (no way to discover them without knowing about the kind). We will collect all resources from the entire system.
non-namespaced resources
Logs
It is hard to tell which logs are needed to debug an issue. We will collect all logs from all pods in the system.
Nodes
Commands
Not sure we can run sos on minikube nodes, and it creates huge reports and very slow, but we can use some commands it uses to collect basic data about a failed system.
General info about the system
Can run via
minikube ssh
orkubectl debug
.top -o %CPU
from all nodes (see #1282)top -o %MEM
from all nodes (see #1282)minikube
submariner
We can use subctl commands to get info about the health of the cluster.
Submariner includes also a gather command
subctl gather all
, but using it will probably collect the same info we already collect.kubevirt
Maybe use https://github.com/kubevirt/must-gather?
rook
No gather tool.
Thread in rook slack: https://rook-io.slack.com/archives/C46Q5UC05/p1711399331728259
We can open an issue to add this to rook-ceph plugin.
We can use rbd and ceph commands via rook-ceph-tools pod to get info about the health of the system:
ocm
No gather tool.
Output format
For long term we want to be compatible with
oc adm must-gather
, so if we create tools for analyzing gathered data, we can use the same tool for upstream and downstream.We don't know if
oc adm must-gather
works on upstream (e.g. minikube) and if it is quick enough for development purposes.It is not clear how
oc adm must-gather
can be used to collect custom data. It looks like we need to run the tool several times with different images.We will start with a simple solution and check integration or using
oc adm must-gather
later.Testing
I think testing the tool as part of e2e will be the most useful test.