crc-org / crc

CRC is a tool to help you run containers. It manages a local OpenShift 4.x cluster, Microshift or a Podman VM optimized for testing and development purposes
https://crc.dev
Apache License 2.0
1.25k stars 240 forks source link

[BUG] After stopping CRC the Kube context is left in inconsistent state causing timeouts #1569

Open deboer-tim opened 4 years ago

deboer-tim commented 4 years ago

General information

CRC version

CodeReady Containers version: 1.15.0+e317bed OpenShift version: 4.5.7 (embedded in binary)

CRC status

DEBU CodeReady Containers version: 1.15.0+e317bed DEBU OpenShift version: 4.5.7 (embedded in binary) CRC VM: Stopped OpenShift: Stopped Disk Usage: 0B of 0B (Inside the CRC VM) Cache Usage: 12.8GB Cache Directory: /Users/deboer/.crc/cache

CRC config

no output

Host Operating System

ProductName: Mac OS X ProductVersion: 10.15.6 BuildVersion: 19G2021

Steps to reproduce

  1. crc start
  2. crc stop
  3. kubectl get pods, odo push, or basically anything that uses the kube context

Expected

If I connect to a remote OpenShift cluster or use other local Kube tools and then disconnect/stop, the Kube context is left pointing to a cluster that I can't connect to anymore, but it 'fails fast': tools that try to connect fail immediately.

e.g. after stopping minikube and running 'kubectl get pods' it immediately responds with: The connection to the server localhost:8080 was refused - did you specify the right host or port? I expect CRC to have the same behaviour.

Actual

After stopping CRC the Kube context is left pointing to a cluster (api-crc-testing or api.crc.testing) on a bridge network (192.168.*). For some reason clients can't tell this host doesn't exist anymore and connections to it don't fail fast, which eventually causes timeouts on the client side. This is bad enough with kubectl (20s timeout?), but odo has an even longer timeout (4min?) which makes it unusable and appear to hang.

When stopping CRC please remove the kube context, remove the bridge network, remove the host resolution, or do something similar so that clients can tell it doesn't exist or will fail immediately trying to connect.

gbraad commented 4 years ago

When stopping CRC please remove the kube context, remove the bridge network, remove the host resolution, or do something similar so that clients can tell it doesn't exist or will fail immediately trying to connect.

@praveenkumar any idea what causes the response not to reply 'Host unreachable' or 'Connection refused'? Also, would removing the context be possible?

praveenkumar commented 4 years ago

I tested this on linux, will check on the mac also but I didn't get that much waiting time as described in the issue.

$ oc whoami
kube:admin

$ crc stop
INFO Stopping the OpenShift cluster, this may take a few minutes... 
Stopped the OpenShift cluster

$ time oc whoami -v=10
I1007 14:03:42.797261  693344 loader.go:375] Config loaded from file:  /home/prkumar/.kube/config
I1007 14:03:42.798023  693344 round_trippers.go:423] curl -k -v -XGET  -H "Accept: application/json, */*" -H "User-Agent: oc/openshift (linux/amd64) kubernetes/d7f3ccf" -H "Authorization: Bearer oUurQFo7e5xjPoz1h3QPFUGVBLL8tEaXBquoz9oaans" 'https://api.crc.testing:6443/apis/user.openshift.io/v1/users/~'
I1007 14:03:45.905233  693344 round_trippers.go:443] GET https://api.crc.testing:6443/apis/user.openshift.io/v1/users/~  in 3107 milliseconds
I1007 14:03:45.905329  693344 round_trippers.go:449] Response Headers:
I1007 14:03:45.905665  693344 helpers.go:234] Connection error: Get https://api.crc.testing:6443/apis/user.openshift.io/v1/users/~: dial tcp 192.168.130.11:6443: connect: no route to host
F1007 14:03:45.905769  693344 helpers.go:115] Unable to connect to the server: dial tcp 192.168.130.11:6443: connect: no route to host

real    0m3.233s
user    0m0.152s
sys 0m0.038s

$ time odo version -v=9
I1007 14:05:06.924601  693547 preference.go:165] The path for preference file is /home/prkumar/.odo/preference.yaml
I1007 14:05:06.924638  693547 occlient.go:448] Trying to connect to server api.crc.testing:6443
I1007 14:05:07.925073  693547 occlient.go:451] unable to connect to server: dial tcp 192.168.130.11:6443: i/o timeout
odo v1.1.3 (44440eeac)

real    0m1.106s
user    0m0.138s
sys 0m0.038s
deboer-tim commented 4 years ago

What I see is below - when context is to stopped docker-desktop (or any other context) it fails fast. CRC contexts are fine while using it, but timeouts after I stop CRC. Interestingly enough, if I switch context to Minikube immediately after running CRC I see the same problem - but if I start Minikube and stop it the problem goes away. This leads me to think there is some hyperkit/network cleanup that Minikube is doing but CRC is not.

deboer-mac:crc-macos-1.15.0-amd64 deboer$ kubectl config use-context docker-desktop
Switched to context "docker-desktop".
deboer-mac:crc-macos-1.15.0-amd64 deboer$ time kubectl get pods
The connection to the server kubernetes.docker.internal:6443 was refused - did you specify the right host or port?

real    0m0.062s
user    0m0.057s
sys 0m0.017s
deboer-mac:crc-macos-1.15.0-amd64 deboer$ ./crc start
...
Started the OpenShift cluster
WARN The cluster might report a degraded or error state. This is expected since several operators have been disabled to lower the resource usage. For more information, please consult the documentation
deboer-mac:crc-macos-1.15.0-amd64 deboer$ kubectl config use-context crc-admin
Switched to context "crc-admin".
deboer-mac:crc-macos-1.15.0-amd64 deboer$ time kubectl get pods
No resources found in default namespace.

real    0m2.165s
user    0m0.145s
sys 0m0.175s
deboer-mac:crc-macos-1.15.0-amd64 deboer$ ./crc stop
Stopping the OpenShift cluster, this may take a few minutes...
Stopped the OpenShift cluster
deboer-mac:crc-macos-1.15.0-amd64 deboer$ time kubectl get pods
Unable to connect to the server: dial tcp 192.168.64.2:6443: i/o timeout

real    0m30.209s
user    0m0.101s
sys 0m0.063s
stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

cbolik commented 3 years ago

Found this bug entry after running into the same issue on my Mac with CRC 1.20.0. Running "kubectl get pods" failed with "Unable to connect to the server: dial tcp 192.168.64.2:6443: i/o timeout" after stopping CRC and logging into another k8s cluster. Thanks to @deboer-tim 's comment above I found I could fix the issue as follows:

  1. Determine current context: kubectl config current-context This was "sample-app/api-crc-testing:6443/kube:admin" for me.

  2. Get list of current contexts and take note of the one you want to use: kubectl config get-contexts

  3. Switch to that context: kubectl config use-context context-name Yup, use-context, not set-context which does something different.

After this kubectl get pods again worked as expected.

rohanKanojia commented 1 day ago

I would like to look into this issue. Could someone please assign it to me?

praveenkumar commented 23 hours ago

I would like to look into this issue. Could someone please assign it to me?

Done

rohanKanojia commented 22 hours ago

I can reproduce this issue. When I also do crc stop and try to access pods using kubectl get pods I get these errors after some wait:

E1010 21:50:00.401494  159438 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": net/http: TLS handshake timeout
E1010 21:50:32.402863  159438 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": context deadline exceeded - error from a previous attempt: read tcp 127.0.0.1:35508->127.0.0.1:6443: read: connection reset by peer
E1010 21:51:04.403878  159438 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": context deadline exceeded - error from a previous attempt: read tcp 127.0.0.1:54090->127.0.0.1:6443: read: connection reset by peer
E1010 21:51:36.405070  159438 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": context deadline exceeded - error from a previous attempt: read tcp 127.0.0.1:34104->127.0.0.1:6443: read: connection reset by peer
E1010 21:52:08.406982  159438 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": context deadline exceeded - error from a previous attempt: read tcp 127.0.0.1:58892->127.0.0.1:6443: read: connection reset by peer
error: Get "https://api.crc.testing:6443/api?timeout=32s": context deadline exceeded - error from a previous attempt: read tcp 127.0.0.1:58892->127.0.0.1:6443: read: connection reset by peer

I think this issue is happening because crc is not cleaning up current-context field in ~/.kube/config. Here is my observation for behavior of crc and minikube start/stop commands with kubeconfig:

CRC

Minikube

It seems crc does not perform clean up in kubeconfig during crc stop command. I do see code for cleaning up kubeconfig : https://github.com/crc-org/crc/blob/5611baa4fc9614f838da088fe72f80a369a4fe9d/pkg/crc/machine/kubeconfig.go#L230

It gets invoked in crc delete command here: https://github.com/crc-org/crc/blob/5611baa4fc9614f838da088fe72f80a369a4fe9d/pkg/crc/machine/delete.go#L38

When I compare it with minikube, minikube seems to be cleaning up kubeconfig in case of both stop and delete commands:

I see these two ways to solve this issue:

cfergeau commented 6 hours ago

I see these two ways to solve this issue:

* Make the behavior of `crc` consistent with `minikube`, also invoke `cleanKubeconfig` method while stopping cluster.

* While stopping the cluster, only set `current-context` field in kubeconfig to `""`. Keep `Clusters`, `AuthInfos` and `Contexts` inside the kubeconfig.

If it's easy to regenerate Clusters, AuthInfos and Contexts on cluster start, we can go with the first option and remove everything, especially if the code for that already exists.