[BUG] crc start error doesn't exit with non-zero exit code

bobbygryzynger commented 1 month ago

General information

OS: Linux
Hypervisor: KVM
Did you run crc setup before starting it (Yes/No)? Yes
Running CRC on: Laptop

CRC version

CRC version: 2.22.1+e8068b4
OpenShift version: 4.13.3
Podman version: 4.4.4

CRC status

$ crc status --log-level debug
DEBU CRC version: 2.22.1+e8068b4                  
DEBU OpenShift version: 4.13.3                    
DEBU Podman version: 4.4.4                        
DEBU Running 'crc status'                         
CRC VM:          Running
OpenShift:       Starting (v4.13.3)
RAM Usage:       9.412GB of 16.77GB
Disk Usage:      20.8GB of 79.93GB (Inside the CRC VM)
Cache Usage:     202.3GB
Cache Directory: /home/bgryzyng/.crc/cache

CRC config

$ crc config view
- consent-telemetry                     : yes
- cpus                                  : 6
- disk-size                             : 75
- enable-cluster-monitoring             : true
- host-network-access                   : true
- memory                                : 16384
- network-mode                          : user

Host Operating System

$ cat /etc/os-release
NAME="Fedora Linux"
...

Steps to reproduce

Run crc start
If an error, occurs a non-zero exit code is not provided

Expected

If an error occurs, exit code should be 1 or greater

Actual

Exit code is zero

Logs

...
INFO 3 operators are progressing: authentication, kube-apiserver, monitoring 
INFO 3 operators are progressing: authentication, kube-apiserver, monitoring 
INFO 3 operators are progressing: authentication, kube-apiserver, monitoring 
INFO 3 operators are progressing: authentication, kube-apiserver, monitoring 
INFO 3 operators are progressing: authentication, kube-apiserver, monitoring 
ERRO Cluster is not ready: cluster operators are still not stable after 10m20.6920432s 
INFO Adding crc-admin and crc-developer contexts to kubeconfig... 
Started the OpenShift cluster.
...

praveenkumar commented 1 month ago

ERRO Cluster is not ready: cluster operators are still not stable after 10m20.6920432s

@bobbygryzynger this we display as error but actually it is mostly a warning (we might change the messaging) sometime it happens that due to slow IO or cert regenerate it takes more than expected time for operator reconciliation but that's doesn't mean the cluster is completely useless.

bobbygryzynger commented 1 month ago

Thanks @praveenkumar, understood. For this particular error, the cluster was unstable after this.

A suggestion: make anything that's error level exit with a non-zero. This particular issue could just be a warning, but my experience with it suggests it should still be an error. Maybe it could remain an error with a suggestion logged on how to increase the timeout (if that's possible).

praveenkumar commented 1 month ago

but my experience with it suggests it should still be an error.

@bobbygryzynger Because for you the operator never become stable?

bobbygryzynger commented 1 month ago

@praveenkumar, that's right. When I saw this, even after waiting a bit, the operators were still unstable.

praveenkumar commented 1 month ago

@praveenkumar, that's right. When I saw this, even after waiting a bit, the operators were still unstable.

In that case, if you able to access the kube api then try to use https://docs.openshift.com/container-platform/4.16/support/troubleshooting/troubleshooting-operator-issues.html to see why an operator is not stable.

bobbygryzynger commented 1 month ago

@praveenkumar the operators being unstable is really a separate issue from what I'm requesting here. All I'd really like to see here is that when errors occur, a non-zero code is produced so that my scripts can pick up on that.

praveenkumar commented 1 month ago

@bobbygryzynger Right and as I said what you observe as error should be a warning and it is issue with our messaging apart from that we are already returning non-zero exit code when error happen.

bobbygryzynger commented 1 month ago

It seems to me that the cluster being unstable is worthy of an error, but I won't belabor the point.

praveenkumar commented 1 month ago

@bobbygryzynger I am reopening this issue since we didn't make change in the messaging yet.

It seems to me that the cluster being unstable is worthy of an error, but I won't belabor the point.

At some point I do agree but if we error out and not execute next steps which actually update the kubeconfig with user context then there is no easy way to access the cluster api to debug which clusteroperator is behaving differently and that's the reason I consider this as warning and let user use the API for debugging or may be the operator which is misbehaving is not even required by user then this can be ignored.

cfergeau commented 2 days ago

At some point I do agree but if we error out and not execute next steps

I think the request is that when there is this "cluster not ready" message, after crc completes, echo $? should be non-0. I don't think this asks for crc to exit right away, so it could still try to update ~/.kube/config.

crc-org / crc