fix (crc/machine) : KubeContext left in invalid state after crc stop (#1569)

rohanKanojia commented 1 month ago

Description

Fix #1569

At the moment, we are only cleaning up crc context from kubeconfig during crc delete. This can be problematic if user tries to run any cluster related command after running crc stop as kubeconfig still points to CRC cluster that is not active.

I checked minikube's behavior and noticed it's cleaning up kube config in case of both stop and delete commands. Make crc behavior consistent with minikube and perform kubeconfig cleanup in both sub commands.

Signed-off-by: Rohan Kumar rohaan@redhat.com

Type of change

[x] Bug fix (non-breaking change which fixes an issue)
[ ] Feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to change
[ ] Chore (non-breaking change which doesn't affect codebase; test, version modification, documentation, etc.)

Checklist

[x] I have read the contributing guidelines
[x] My code follows the style guidelines of this project
[x] I Keep It Small and Simple: The smaller the PR is, the easier it is to review and have it merged
[x] I use conventional commits in my commit messages
[x] I have performed a self-review of my code
[x] I have added tests that prove my fix is effective or that my feature works
[x] New and existing unit tests pass locally with my changes
[x] I tested my code on specified platforms
- [x] Linux
- [ ] Windows
- [ ] MacOS

Fixes: Issue #1569

Relates to: Issue #1569

Solution/Idea

Clean up .kube/config file while doing crc stop in order to not leave kubeconfig in an inconsistent state.

Currently after crc stop .kube/config file is left pointing to an outdated kube-context :

  current-context: default/api-crc-testing:6443/kubeadmin

This results in timeouts on the client side when user tries to access cluster using any kube client oc/kubectl:

crc : $ time oc get pods
E1015 15:46:38.452130   72163 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": context deadline exceeded - error from a previous attempt: read tcp 127.0.0.1:35058->127.0.0.1:6443: read: connection reset by peer
E1015 15:47:10.615173   72163 memcache.go:265] couldn't get current server API group list: client rate limiter Wait returned an error: context deadline exceeded - error from a previous attempt: read tcp 127.0.0.1:38388->127.0.0.1:6443: read: connection reset by peer
E1015 15:47:43.548507   72163 memcache.go:265] couldn't get current server API group list: client rate limiter Wait returned an error: context deadline exceeded - error from a previous attempt: read tcp 127.0.0.1:55098->127.0.0.1:6443: read: connection reset by peer
E1015 15:48:15.549643   72163 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": context deadline exceeded - error from a previous attempt: read tcp 127.0.0.1:35854->127.0.0.1:6443: read: connection reset by peer
E1015 15:48:47.550725   72163 memcache.go:265] couldn't get current server API group list: Get "https://api.crc.testing:6443/api?timeout=32s": context deadline exceeded - error from a previous attempt: read tcp 127.0.0.1:44620->127.0.0.1:6443: read: connection reset by peer
error: Get "https://api.crc.testing:6443/api?timeout=32s": context deadline exceeded - error from a previous attempt: read tcp 127.0.0.1:44620->127.0.0.1:6443: read: connection reset by peer

real    2m41.162s
user    0m0.150s
sys     0m0.059s

This pull request would clean up .kube/config to align crc behavior with minikube so that it fails fast now: Trying to access cluster after crc stop

crc : $ time oc get pods
error: Missing or incomplete configuration info.  Please point to an existing, complete config file:

  1. Via the command-line flag --kubeconfig
  2. Via the KUBECONFIG environment variable
  3. In your home directory as ~/.kube/config

To view or setup config directly use the 'config' command.

real    0m0.126s
user    0m0.062s
sys     0m0.051s

Proposed changes

Add a call to cleanKubeconfig in stop.go to clean up kubeconfig while stopping cluster.

Testing

In order to test this branch you need to follow these steps:

make cross to build crc binary
Set up a new cluster with created crc binary
- ./out/linux-amd64/crc setup
- ./out/linux-amd64/crc start
- ./out/linux-amd64/crc stop

Verify whether .kube/config is cleaned up after crc stop

crc : $ ./out/linux-amdcat ~/.kube/config 
apiVersion: v1
clusters: null
contexts: null
current-context: ""
kind: Config
preferences: {}
users: null

Verify whether when accessing stopped cluster with kubectl / oc it fails fast:


crc : $ ./out/linux-amd./out/linux-amd64/crc stop
INFO Stopping the instance, this may take a few minutes... 
Stopped the instance
crc : $ kubectl get opds
E1015 14:29:44.984352   64932 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp [::1]:8080: connect: connection refused
E1015 14:29:44.984593   64932 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp [::1]:8080: connect: connection refused
E1015 14:29:44.985937   64932 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp [::1]:8080: connect: connection refused
E1015 14:29:44.986265   64932 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp [::1]:8080: connect: connection refused
E1015 14:29:44.987715   64932 memcache.go:265] couldn't get current server API group list: Get "http://localhost:8080/api?timeout=32s": dial tcp [::1]:8080: connect: connection refused
The connection to the server localhost:8080 was refused - did you specify the right host or port?

openshift-ci[bot] commented 1 month ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign gbraad for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/crc-org/crc/blob/main/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

openshift-ci[bot] commented 1 month ago

Hi @rohanKanojia. Thanks for your PR.

I'm waiting for a crc-org member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

gbraad commented 1 month ago

/ok-to-test

rohanKanojia commented 1 month ago

Could anyone please help me understand CI failures in the Windows-QE pipeline? Could it be a flaky failure? From the GitHub action logs, it seems that an action failed to generate a report. I'm not entirely sure whether these failures are related to changes made in this pull request.

redbeam commented 1 month ago

@rohanKanojia I would say they are related to something else, since these two pipelines fail for me too in #4343 .

gbraad commented 1 month ago

@adrianriobo and @lilyLuLiu can help you with this

lilyLuLiu commented 1 month ago

CI failures in the Windows-QE pipeline failed in copy test resource to target machine, this is qe related, not because of this pr. @adrianriobo we need to improve the failure handing for the deliverset.

rohanKanojia commented 1 month ago

@lilyLuLiu : Is there any open issue to track this?

lilyLuLiu commented 1 month ago

@rohanKanojia https://github.com/adrianriobo/deliverest/issues/50

openshift-ci[bot] commented 1 month ago

@rohanKanojia: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-crc	cdc863f2290d6d11ca57ea9711d2376c0465f1cd	link	true	`/test e2e-crc`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).

crc-org / crc