antrea-io / antrea

Kubernetes networking based on Open vSwitch
https://antrea.io
Apache License 2.0
1.65k stars 362 forks source link

Antrea controlplane server can muck w/ the k8s apiserver core operations #1354

Closed jayunit100 closed 2 years ago

jayunit100 commented 3 years ago

Describe the bug

It seems like basic operations like kubectl delete ns blah fail on antrea namespaces on the latest ubuntu image.

E1008 19:01:34.227303       1 resource_quota_controller.go:407] unable to retrieve the complete list of server APIs: controlplane.antrea.tanzu.vmware.com/v1beta1: the server is currently unable to handle the request, networking.antrea.tanzu.vmware.com/v1beta1: the server is currently unable to handle the request, system.antrea.tanzu.vmware.com/v1beta
1: the server is currently unable to handle the request
E1008 19:01:36.311593       1 namespace_controller.go:162] deletion of namespace sonobuoy failed: unable to retrieve the complete list of server APIs: controlplane.antrea.tanzu.vmware.com/v1beta1: the server is currently unable to handle the request, networking.antrea.tanzu.vmware.com/v1beta1: the server is currently unable to handle the request, sys
tem.antrea.tanzu.vmware.com/v1beta1: the server is currently unable to handle the request
E1008 19:01:36.339492       1 namespace_controller.go:162] deletion of namespace vladdddddd failed: unable to retrieve the complete list of server APIs: controlplane.antrea.tanzu.vmware.com/v1beta1: the server is currently unable to handle the request, networking.antrea.tanzu.vmware.com/v1beta1: the server is currently unable to handle the request, s
ystem.antrea.tanzu.vmware.com/v1beta1: the server is currently unable to handle the request
E1008 19:01:36.342879       1 namespace_controller.go:162] deletion of namespace projected-5483 failed: unable to retrieve the complete list of server APIs: controlplane.antrea.tanzu.vmware.com/v1beta1: the server is currently unable to handle the request, networking.antrea.tanzu.vmware.com/v1beta1: the server is currently unable to handle the reques
t, system.antrea.tanzu.vmware.com/v1beta1: the server is currently unable to handle the request
E1008 19:01:36.357224       1 namespace_controller.go:162] deletion of namespace dns-5566 failed: unable to retrieve the complete list of server APIs: controlplane.antrea.tanzu.vmware.com/v1beta1: the server is currently unable to handle the request, networking.antrea.tanzu.vmware.com/v1beta1: the server is currently unable to handle the request, sys
tem.antrea.tanzu.vmware.com/v1beta1: the server is currently unable to handle the request
E1008 19:01:38.156115       1 namespace_controller.go:162] deletion of namespace andrewwwww failed: unable to retrieve the complete list of server APIs: controlplane.antrea.tanzu.vmware.com/v1beta1: the server is currently unable to handle the request, networking.antrea.tanzu.vmware.com/v1beta1: the server is currently unable to handle the request, s
ystem.antrea.tanzu.vmware.com/v1beta1: the server is currently unable to handle the request
W1008 19:01:52.542716       1 garbagecollector.go:639] failed to discover some groups: map[controlplane.antrea.tanzu.vmware.com/v1beta1:the server is currently unable to handle the request networking.antrea.tanzu.vmware.com/v1beta1:the server is currently unable to handle the request system.antrea.tanzu.vmware.com/v1beta1:the server is currently unab
le to handle the request]

To Reproduce

Create an antrea kind cluster, run a conformance test, and delete the namespace

Expected

The antrea apiserver would never fail to lookup things , or if so, it would swallow the error so as not to block k8s apiserver from functioning for basic operations.

Actual behavior

antrea apiserver causes the k8s apiserver to fail bc of a resource lookup operation

Version

antrea-ubuntu:latest

jayunit100 commented 3 years ago
1008 19:07:47.451605       1 log.go:172] http: TLS handshake error from 172.17.0.2:61958: remote error: tls: bad certificate

So i guess the way this happens is related to certs

is there a way that, if antrea apiserver is down, it could try not to mess w the way the k8s apiserver is behaving ?

antoninbas commented 3 years ago

These APIServices are served by the Antrea Controller. Is there any issue with your Controller deployment or any connectivity issue between your K8s apiserver and the Controller Pod?

That's a property of APIServices, and there is not much that can be done. When a namespace is deleted, K8s contacts all APIServices to check if any resource needs to be deleted. If an APIService is not available, the namespace deletion gets stuck. You have countless issues all over the internet of this happening because of the metrics server.

jianjuns commented 3 years ago

BTW, one fix (if you want not to recover kube-apiserver to Antrea Controller connectivity first) is to delete the Antrea APIServices.

github-actions[bot] commented 3 years ago

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment, or this will be closed in 180 days

moshloop commented 3 years ago

@antoninbas - I think the use of API aggregation introduces fragility by increasing the blast radius of failure / misconfiguration, especially in GitOps environments where it could cause the very updates to fix a broken cluster to fail due api resource listing breaking in controllers like Flux

Why not just convert the APIService aggregations to direct calls to the controller?

antoninbas commented 3 years ago

@moshloop Thanks for the feedback. I tend to agree with your statement. IIRC the main reason why we chose to use aggregation in the first place was to allow easy access to resource URLs from the antctl command-line tool, without having to worry about endpoint discovery / authentication. Recently we have added more commands to antctl which access non-resource URLs in the controller and for which API aggregation doesn't help. We are planning to refactor the antctl framework to better support such commands and it is probably a good time to consider removing the dependency on Antrea API aggregation altogether. Related discussion: https://github.com/vmware-tanzu/antrea/pull/2082#discussion_r617899358

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment, or this will be closed in 180 days