Open acuteaura opened 1 year ago
I should note, this is running with a single replica, it seems just quit here, believing another replica will take on leadership, though that never happens.
invalid response code
In fact, these responses come from the APISIX admin API. This means that unhealthy APISIX responded to these admin API requests.
That is to be expected, since this happened during a helm upgrade and APISIX pods were being recreated at the same time.
Either way, the controller getting stuck doing nothing is the unexpected behavior. The controller simply can not assume that the APISIX API will be reachable at startup. I'd suggest a failed sync should cause the ingress controller to exit (or retry, if refactored to, it seems to cancel an in-flight retry attempt via the defer cancel), it is not a good signal for giving up on leader election.
You can check the Helm chart. There has a init container for check APISIX's connection
That only checks if the TCP socket is up; same with the readiness probe on APISIX itself (it's not an HTTP check). I'm not sure if the APISIX itself takes a second to boot up or if the etcd being replaced as well might be the culprit (we run etcd with replicas=1 right now), but the server sending a 502/503 is not a condition that's checked.
This issue has been marked as stale due to 90 days of inactivity. It will be closed in 30 days if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@apisix.apache.org list. Thank you for your contributions.
/reopen
still an edge case the ingress controller can't recover from
Also running into this issue here.
This is a criticial issue and checking if the TCP port is available is absolutely not sufficient, @tao12345666333. We had yet another ingress controller that simply got stuck. Frankly, the entire leader election code seems broken (e.g reports as "running as leader" in the callback, where nothing else happens, starts controller loop once any controller is elected), and I'll look into getting a PR in for this as soon as possible (after #2139), this is unfortunately making us look pretty bad as an infra team for adopting the ingress.
2024-01-19T21:29:00+08:00 [34minfo[0m ingress/ingress.go:121 init apisix ingress controller
2024-01-19T21:29:00+08:00 [34minfo[0m ingress/ingress.go:123 version:
Version: 1.7.0
Git SHA: no-git-module
Go Version: go1.20.8
Building OS/Arch: linux/amd64
Running OS/Arch: linux/amd64
2024-01-19T21:29:00+08:00 [34minfo[0m ingress/ingress.go:133 use configuration
{
"cert_file": "/etc/webhook/certs/cert.pem",
"key_file": "/etc/webhook/certs/key.pem",
"log_level": "info",
"log_output": "stderr",
"log_rotate_output_path": "",
"log_rotation_max_size": 100,
"log_rotation_max_age": 0,
"log_rotation_max_backups": 0,
"http_listen": ":8080",
"https_listen": ":8443",
"ingress_publish_service": "apisix/apisix-internal-gateway",
"ingress_status_address": [],
"enable_profiling": true,
"kubernetes": {
"kubeconfig": "",
"resync_interval": "6h0m0s",
"namespace_selector": [],
"election_id": "ingress-apisix-internal-leader",
"ingress_class": "apisix-internal",
"ingress_version": "networking/v1",
"watch_endpoint_slices": false,
"api_version": "apisix.apache.org/v2",
"enable_gateway_api": false,
"disable_status_updates": false,
"enable_admission": false
},
"apisix": {
"admin_api_version": "v3",
"default_cluster_name": "default",
"default_cluster_base_url": "http://apisix-internal-admin.apisix.svc.cluster.local:9180/apisix/admin",
"default_cluster_admin_key": "******"
},
"apisix_resource_sync_interval": "1h0m0s",
"apisix_resource_sync_comparison": true,
"plugin_metadata_cm": "",
"etcdserver": {
"enabled": false,
"prefix": "/apisix",
"listen_address": ":12379",
"ssl_key_encrypt_salt": "edd1c9f0985e76a2"
}
}
2024-01-19T21:29:00+08:00 [34minfo[0m ingress/ingress.go:145 start ingress controller
2024-01-19T21:29:00+08:00 [34minfo[0m providers/controller.go:170 start leader election
2024-01-19T21:29:00+08:00 [34minfo[0m providers/controller.go:155 start api server
I0119 21:29:00.286860 1 leaderelection.go:250] attempting to acquire leader lease apisix/ingress-apisix-internal-leader...
2024-01-19T21:29:00+08:00 [33mwarn[0m providers/controller.go:220 found a new leader apisix-internal-ingress-controller-578bf88584-jvs59
2024-01-19T21:29:00+08:00 [34minfo[0m providers/controller.go:222 controller now is running as a candidate {"namespace": "apisix", "pod": "apisix-internal-ingress-controller-9d96f45cb-4kg8l"}
2024-01-19T21:29:00+08:00 [34minfo[0m providers/controller.go:387 controller tries to leading ... {"namespace": "apisix", "pod": "apisix-internal-ingress-controller-9d96f45cb-4kg8l"}
2024-01-19T21:29:00+08:00 [34minfo[0m apisix/stream_route.go:38 resource stream_routes is disabled
2024-01-19T21:29:00+08:00 [34minfo[0m apisix/cluster.go:248 syncing cache {"cluster": "default"}
2024-01-19T21:29:00+08:00 [33mwarn[0m apisix/cluster.go:423 waiting cluster default to ready, it may takes a while
2024-01-19T21:29:00+08:00 [34minfo[0m apisix/cluster.go:463 syncing schema {"cluster": "default"}
2024-01-19T21:29:00+08:00 [31merror[0m apisix/route.go:90 failed to list routes: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}
2024-01-19T21:29:00+08:00 [31merror[0m apisix/cluster.go:298 failed to list routes in APISIX: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}
2024-01-19T21:29:02+08:00 [31merror[0m apisix/route.go:90 failed to list routes: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}
2024-01-19T21:29:02+08:00 [31merror[0m apisix/cluster.go:298 failed to list routes in APISIX: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}
2024-01-19T21:29:03+08:00 [34minfo[0m providers/controller.go:141 LeaderElection {"message": "apisix-internal-ingress-controller-9d96f45cb-4kg8l became leader", "event_type": "Normal"}
I0119 21:29:03.113259 1 leaderelection.go:260] successfully acquired lease apisix/ingress-apisix-internal-leader
2024-01-19T21:29:03+08:00 [33mwarn[0m providers/controller.go:220 found a new leader apisix-internal-ingress-controller-9d96f45cb-4kg8l
2024-01-19T21:29:03+08:00 [34minfo[0m providers/controller.go:212 controller now is running as leader {"namespace": "apisix", "pod": "apisix-internal-ingress-controller-9d96f45cb-4kg8l"}
2024-01-19T21:29:04+08:00 [31merror[0m apisix/route.go:90 failed to list routes: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}
2024-01-19T21:29:04+08:00 [31merror[0m apisix/cluster.go:298 failed to list routes in APISIX: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}
2024-01-19T21:29:06+08:00 [31merror[0m apisix/route.go:90 failed to list routes: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}
2024-01-19T21:29:06+08:00 [31merror[0m apisix/cluster.go:298 failed to list routes in APISIX: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}
2024-01-19T21:29:08+08:00 [31merror[0m apisix/route.go:90 failed to list routes: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}
2024-01-19T21:29:08+08:00 [31merror[0m apisix/cluster.go:298 failed to list routes in APISIX: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}
2024-01-19T21:29:08+08:00 [31merror[0m apisix/cluster.go:258 failed to sync cache {"cost_time": "8.016969049s", "cluster": "default"}
2024-01-19T21:29:08+08:00 [31merror[0m providers/controller.go:419 failed to wait the default cluster to be ready: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}
2024-01-19T21:29:08+08:00 [34minfo[0m apisix/stream_route.go:38 resource stream_routes is disabled
2024-01-19T21:29:08+08:00 [34minfo[0m apisix/cluster.go:248 syncing cache {"cluster": "default"}
2024-01-19T21:29:08+08:00 [34minfo[0m apisix/cluster.go:463 syncing schema {"cluster": "default"}
2024-01-19T21:29:08+08:00 [31merror[0m apisix/route.go:90 failed to list routes: Get "http://apisix-internal-admin.apisix.svc.cluster.local:9180/apisix/admin/routes": context canceled
2024-01-19T21:29:08+08:00 [31merror[0m apisix/cluster.go:298 failed to list routes in APISIX: Get "http://apisix-internal-admin.apisix.svc.cluster.local:9180/apisix/admin/routes": context canceled
2024-01-19T21:29:08+08:00 [31merror[0m apisix/cluster.go:258 failed to sync cache {"cost_time": "60.702µs", "cluster": "default"}
2024-01-19T21:29:08+08:00 [31merror[0m apisix/plugin.go:46 failed to list plugins' names: Get "http://apisix-internal-admin.apisix.svc.cluster.local:9180/apisix/admin/plugins?all=true": context canceled
2024-01-19T21:29:08+08:00 [31merror[0m apisix/cluster.go:483 failed to list plugin names in APISIX: Get "http://apisix-internal-admin.apisix.svc.cluster.local:9180/apisix/admin/plugins?all=true": context canceled
2024-01-19T21:29:08+08:00 [31merror[0m apisix/cluster.go:446 failed to sync schema: Get "http://apisix-internal-admin.apisix.svc.cluster.local:9180/apisix/admin/plugins?all=true": context canceled
E0122 23:07:05.784720 1 leaderelection.go:369] Failed to update lock: Put "https://10.0.0.1:443/apis/coordination.k8s.io/v1/namespaces/apisix/leases/ingress-apisix-internal-leader": context deadline exceeded
I0122 23:07:05.784783 1 leaderelection.go:285] failed to renew lease apisix/ingress-apisix-internal-leader: timed out waiting for the condition
2024-01-22T23:07:09+08:00 [34minfo[0m providers/controller.go:141 LeaderElection {"message": "apisix-internal-ingress-controller-9d96f45cb-4kg8l stopped leading", "event_type": "Normal"}
2024-01-22T23:07:09+08:00 [34minfo[0m providers/controller.go:231 controller now is running as a candidate {"namespace": "apisix", "pod": "apisix-internal-ingress-controller-9d96f45cb-4kg8l"}
I0122 23:07:09.152585 1 leaderelection.go:250] attempting to acquire leader lease apisix/ingress-apisix-internal-leader...
2024-01-22T23:07:09+08:00 [34minfo[0m providers/controller.go:141 LeaderElection {"message": "apisix-internal-ingress-controller-9d96f45cb-4kg8l became leader", "event_type": "Normal"}
I0122 23:07:09.984780 1 leaderelection.go:260] successfully acquired lease apisix/ingress-apisix-internal-leader
2024-01-22T23:07:09+08:00 [34minfo[0m providers/controller.go:212 controller now is running as leader {"namespace": "apisix", "pod": "apisix-internal-ingress-controller-9d96f45cb-4kg8l"}
E0124 13:07:40.830633 1 leaderelection.go:369] Failed to update lock: Put "https://10.0.0.1:443/apis/coordination.k8s.io/v1/namespaces/apisix/leases/ingress-apisix-internal-leader": context deadline exceeded
I0124 13:07:40.830667 1 leaderelection.go:285] failed to renew lease apisix/ingress-apisix-internal-leader: timed out waiting for the condition
2024-01-24T13:07:40+08:00 [34minfo[0m providers/controller.go:141 LeaderElection {"message": "apisix-internal-ingress-controller-9d96f45cb-4kg8l stopped leading", "event_type": "Normal"}
2024-01-24T13:07:40+08:00 [34minfo[0m providers/controller.go:231 controller now is running as a candidate {"namespace": "apisix", "pod": "apisix-internal-ingress-controller-9d96f45cb-4kg8l"}
I0124 13:07:40.843705 1 leaderelection.go:250] attempting to acquire leader lease apisix/ingress-apisix-internal-leader...
2024-01-24T13:07:40+08:00 [34minfo[0m providers/controller.go:141 LeaderElection {"message": "apisix-internal-ingress-controller-9d96f45cb-4kg8l became leader", "event_type": "Normal"}
I0124 13:07:40.864098 1 leaderelection.go:260] successfully acquired lease apisix/ingress-apisix-internal-leader
2024-01-24T13:07:40+08:00 [34minfo[0m providers/controller.go:212 controller now is running as leader {"namespace": "apisix", "pod": "apisix-internal-ingress-controller-9d96f45cb-4kg8l"}
@acuteaura I will take up this item.
I'm still happy to submit a PR next week if you'd prefer. I already dug around the code a bunch and my other open PR should address at least one of these cases (where returning nil to the command would not exit).
@acuteaura I approved that PR. You can create a PR for this issue.
For the record, we got the same incident this week with a production environment because the ingress-controller stop to work without failing or retrying.
I agreed with @acuteaura is a critical bug.
A first mitigation action could be to update the healthz
endpoint could check if the cluster has a leader and failed.
The problem isn't with having no leader, it's quitting out of run
when the APISIX API can't be reached without stepping down and quitting, which before #2139 meant that the controller was stuck waiting for an Interrupt while having leadership.
A possible hotfix would be an init container to check if the API is available and to fix health/readiness checks on apisix itself. Right now it checks if there's a TCP socket, but that doen't suffice when APISIX is still booting up and rejects API requests with 503.
Specifically, the bug we're seeing and that likely you are seeing is here
Hi,
First, I would like to say thank you about peoples which are currently working on it. Then, I would like to ask kindly if any update is done here. :bow:
Keep an eye on #2152, I'm ready to pick up work on it as soon as someone figures out why the E2E tests don't like that PR.
Even if APISIX is started long before the controller I'm getting the "context canceled" errors when relaunching the controller.
If I get into the controller's Pod and do the same requests with curl manually, I get the correct responses.
The bug isn't random for me, the controller can't start working at all.
Is there an update on this? We have recently started seeing this issue pop up more and more on our clusters that use spot instance, thus pods get recreated pretty often. We are running the latest version (1.8.2). I am happy to help with whatever is necessary to move this forward.
you could patch in an os.Exit
on the TODO line in Controller.run inside pkg/providers/controller.go
for now, or even submit that as a PR. That'd certainly be better than the status quo of ignoring failed initalization and just not running the controller. I'd much prefer if leader election wasn't "both controllers do everything but one skips writes (hopefully everywhere)" because it's a nice footgun for contributors who might not be aware of the larger image, but I don't have the resources to reproduce why the somewhat fragile E2E test suite (that unfortunately also somewhat relies on running inside actions) fails on that one.
I do agree that that having the not-leader would not do anyhting at all instead of what it does now. I'll try to patch your change with the os.Exit
as a new PR here, and will try to look into the e2e suite, but I am not very proficient with this code-base or even golang in general so I'm not sure if I'll be able to figure that one out. Thanks for the quick response.
Looking forward for the fix. We constantly running into the issue if our K8 (GKE) does an update on the cluster where apisix is running. More or less everytime we have to manully restart the ingress-controller to get things working again.
We also ran into a similar issue in prod in 1.7.1.
Current Behavior
occasionally, our controller will stop doing anything after failing to sync, but it also doesn't exit
we have logs for this occurring immediately following an upgrade, but we've seen similar behavior elsewhere too
Expected Behavior
The ingress controller should either retry running its logic or exit.
Error Logs
Do note that i pulled this log on 2023-09-26 - the ingress has just not logged anything in days.
Steps to Reproduce
Not fully known, but etcd/apisix being unavailable at critical moments seems to trigger it.
Environment
apisix-ingress-controller version --long
)apache/apisix-ingress-controller:1.7.0
kubectl version
)v1.26.6
uname -a
)N/A