[BUG] Operator image-registry is in a degraded state (ServiceUnavailable when get buildconfigs.build.openshift.io)

cameronkerrnz commented 3 years ago

General information

OS: Windows
Hypervisor: Hyper-V
Did you run crc setup before starting it (Yes/No)? Yes
Running CRC on: Laptop

CRC version

CodeReady Containers version: 1.24.0+5f06e84b
OpenShift version: 4.7.2 (embedded in executable)

CRC status

DEBU CodeReady Containers version: 1.24.0+5f06e84b
DEBU OpenShift version: 4.7.2 (embedded in executable)
DEBU Running 'crc status'
DEBU Checking file: C:\Users\ME\.crc\machines\crc\.crc-exist
DEBU Checking file: C:\Users\ME\.crc\machines\crc\.crc-exist
DEBU [executing ==>] : C:\WINDOWS\System32\WindowsPowerShell\v1.0\powershell.exe -NoProfile -NonInteractive ( Hyper-V\Get-VM crc ).state
                   ] : Running
DEBU [stderr =====>] :
DEBU Running SSH command: df -B1 --output=size,used,target /sysroot | tail -1
DEBU Using ssh private keys: [C:\Users\ME\.crc\machines\crc\id_ecdsa C:\Users\ME\.crc\cache\crc_hyperv_4.7.2\id_ecdsa_crc]
DEBU SSH command results: err: <nil>, output: 32737570816 20896116736 /sysroot
DEBU Running SSH command: <hidden>
DEBU SSH command succeeded
DEBU image-registry operator is degraded, Reason: ImagePrunerJobFailed
DEBU Unexpected operator status for network: ManagementStateDegraded
CRC VM:          Running
OpenShift:       Degraded (v4.7.2)
Disk Usage:      20.9GB of 32.74GB (Inside the CRC VM)
Cache Usage:     0B
Cache Directory: C:\Users\ME\.crc\cache

CRC config

- consent-telemetry                     : yes
- enable-cluster-monitoring             : true
- memory                                : 16384
- network-mode                          : vsock

Host Operating System

Host Name:                 REDACTED
OS Name:                   Microsoft Windows 10 Enterprise
OS Version:                10.0.19042 N/A Build 19042
OS Manufacturer:           Microsoft Corporation
OS Configuration:          Member Workstation
OS Build Type:             Multiprocessor Free
Registered Owner:          REDACTED
Registered Organization:   REDACTED
Product ID:                00329-00000-00003-AA300
Original Install Date:     21/01/2021, 5:13:35 pm
System Boot Time:          2/04/2021, 9:53:21 am
System Manufacturer:       Dell Inc.
System Model:              Precision 3551
System Type:               x64-based PC
Processor(s):              1 Processor(s) Installed.
                           [01]: Intel64 Family 6 Model 165 Stepping 2 GenuineIntel ~2400 Mhz
BIOS Version:              Dell Inc. 1.3.1, 29/09/2020
Windows Directory:         C:\WINDOWS
System Directory:          C:\WINDOWS\system32
Boot Device:               \Device\HarddiskVolume1
System Locale:             en-us;English (United States)
Input Locale:              00000481
Time Zone:                 (UTC+12:00) Auckland, Wellington
Total Physical Memory:     32,365 MB
Available Physical Memory: 5,281 MB
Virtual Memory: Max Size:  47,258 MB
Virtual Memory: Available: 5,021 MB
Virtual Memory: In Use:    42,237 MB
Page File Location(s):     C:\pagefile.sys
Domain:                    REDACTED
Logon Server:              REDACTED
Hotfix(s):                 7 Hotfix(s) Installed.
                           [01]: KB4601050
                           [02]: KB4562830
                           [03]: KB4570334
                           [04]: KB4577266
                           [05]: KB4580325
                           [06]: KB5000802
                           [07]: KB5000858
Network Card(s):           7 NIC(s) Installed.
                           [01]: Intel(R) Wi-Fi 6 AX201 160MHz
                                 Connection Name: Wi-Fi
                                 DHCP Enabled:    Yes
                                 DHCP Server:     192.168.1.1
                                 IP address(es)
                                 [01]: 192.168.1.236
                           [02]: Intel(R) Ethernet Connection (11) I219-LM
                                 Connection Name: Ethernet
                                 Status:          Media disconnected
                           [03]: Bluetooth Device (Personal Area Network)
                                 Connection Name: Bluetooth Network Connection
                                 Status:          Media disconnected
                           [04]: Hyper-V Virtual Ethernet Adapter
                                 Connection Name: vEthernet (Default Switch)
                                 DHCP Enabled:    No
                                 IP address(es)
                                 [01]: 172.28.128.1
                                 [02]: fe80::c1b:b9c9:b321:bfef
                           [05]: Realtek USB GbE Family Controller
                                 Connection Name: Ethernet 3
                                 Status:          Media disconnected
                           [06]: Hyper-V Virtual Ethernet Adapter
                                 Connection Name: vEthernet (WSL)
                                 DHCP Enabled:    No
                                 IP address(es)
                                 [01]: 172.31.32.1
                                 [02]: fe80::a000:83b:c629:dc8c
                           [07]: Cisco AnyConnect Secure Mobility Client Virtual Miniport Adapter for Windows x64
                                 Connection Name: Ethernet 4
                                 Status:          Hardware not present
Hyper-V Requirements:      A hypervisor has been detected. Features required for Hyper-V will not be displayed.

Steps to reproduce

crc start will take at least 10 minutes to complete
'INFO Operator image-registry is degraded' will be output 15 times
'ERRO Cluster is not ready: cluster operators are still not stable after 10m3.9363243s'

Expected

Working cluster with an image registry I can be confident is in a workable state.

Actual

I do appear to have a working cluster; but it took a long time to start, and the degraded state appears due to issues with the image registry (specifically the image-pruner CronJob)

Logs

Before gather the logs try following if that fix your issue

$ crc delete -f
$ crc cleanup
$ crc setup
$ crc start --log-level debug

Yes, it still happens. Here's some diagnostics

PS> oc get co
NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
…
image-registry                             4.7.2     True        False         True       4d
…

PS> oc get pods -n openshift-image-registry
NAME                                               READY     STATUS      RESTARTS   AGE
cluster-image-registry-operator-54778dff55-4mrlq   1/1       Running     0          17d
image-pruner-1617408000-jftmp                      0/1       Completed   0          3d 
image-pruner-1617667200-gr7q6                      0/1       Error       0          41m
image-registry-76dbb65978-pvvt6                    1/1       Running     0          4d 
node-ca-6csl8                                      1/1       Running     0          18d

PS> oc get pv
NAME      CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS      CLAIM
     STORAGECLASS   REASON    AGE
pv0001    100Gi      RWO,ROX,RWX    Recycle          Bound       openshift-image-registry/crc-image-registry-storage                            17d
…

PS> oc logs pod/image-pruner-1617667200-gr7q6 -n openshift-image-registry
Error from server (ServiceUnavailable): the server is currently unable to handle the request (get buildconfigs.build.openshift.io)

I suspect this is an issue due to image-pruner being a CronJob issue that fires before the cluster is stable.

If this is a CronJob, do I just workaround this issue by deleting the Job and Pod that it created? Would that be enough to get the cluster out of 'degraded' state? I've yet to experience CronJob objects in Kubernetes.

praveenkumar commented 3 years ago

@cameronkerrnz can you paste the output of oc describe pod/image-pruner-1617667200-gr7q6 -n openshift-image-registry it should have an event which tell us why error is happened for this pruner pod. Also you can manually try to delete that pod and see if new one created and working as expected oc delete pod/image-pruner-1617667200-gr7q6 -n openshift-image-registry?

cameronkerrnz commented 3 years ago

PS> oc describe pod/image-pruner-1617667200-gr7q6 -n openshift-image-registry
Name:         image-pruner-1617667200-gr7q6
Namespace:    openshift-image-registry
Priority:     0
Node:         crc-rsppg-master-0/192.168.126.11
Start Time:   Tue, 06 Apr 2021 12:13:59 +1200
Labels:       controller-uid=ab13eb6d-9f08-4a29-8c01-d881378642dd
              job-name=image-pruner-1617667200
Annotations:  k8s.v1.cni.cncf.io/network-status:
                [{
                    "name": "",
                    "interface": "eth0",
                    "ips": [
                        "10.217.0.63"
                    ],
                    "default": true,
                    "dns": {}
                }]
              k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "",
                    "interface": "eth0",
                    "ips": [
                        "10.217.0.63"
                    ],
                    "default": true,
                    "dns": {}
                }]
              openshift.io/scc: restricted
Status:       Failed
IP:           10.217.0.63
IPs:
  IP:           10.217.0.63
Controlled By:  Job/image-pruner-1617667200
Containers:
  image-pruner:
    Container ID:  cri-o://597e972955d090f224e453218cd7550d0c0c7b463f187d5bb2b10d5f5fbce41f
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:50ca35f6d1b839a790d657d298db7f32844ece8da14f99f2b55e5b6d5642fd2a
    Image ID:      image-registry.openshift-image-registry.svc:5000/openshift/cli@sha256:50ca35f6d1b839a790d657d298db7f32844ece8da14f99f2b55e5b6d5642fd2a
    Port:          <none>
    Host Port:     <none>
    Command:
      oc
    Args:
      adm
      prune
      images
      --confirm=true
      --certificate-authority=/var/run/configmaps/serviceca/service-ca.crt
      --keep-tag-revisions=3
      --keep-younger-than=60m
      --ignore-invalid-refs=true
      --loglevel=1
      --prune-registry=true
      --registry-url=https://image-registry.openshift-image-registry.svc:5000
    State:      Terminated
      Reason:   Error
      Message:  Error from server (ServiceUnavailable): the server is currently unable to handle the request (get buildconfigs.build.openshift.io)

      Exit Code:    1
      Started:      Tue, 06 Apr 2021 12:14:17 +1200
      Finished:     Tue, 06 Apr 2021 12:14:18 +1200
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:        100m
      memory:     256Mi
    Environment:  <none>
    Mounts:
      /var/run/configmaps/serviceca from serviceca (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from pruner-token-n2x2s (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  serviceca:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      serviceca
    Optional:  false
  pruner-token-n2x2s:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  pruner-token-n2x2s
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:          <none>

After deleting the pod as requested, the cluster still shows as degraded; but not sure how it determins that.

Deleting the Job (that the CronJob created) does seem to be sufficient to get the degraded state of the controller happy again:

PS> oc delete job image-pruner-1617667200 -n openshift-image-registry
job.batch "image-pruner-1617667200" deleted
PS> oc get co image-registry
NAME             VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
image-registry   4.7.2     True        False         False      4d7h

Thanks, Cameron

praveenkumar commented 3 years ago

@cameronkerrnz Thanks, I will take closer look into this as soon as I am able to reproduce it 👍🏼

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

crc-org / crc