apache / apisix-ingress-controller

APISIX Ingress Controller for Kubernetes
https://apisix.apache.org/
Apache License 2.0
1k stars 342 forks source link

bug: ingress-controller doesn't recover from failed sync #1980

Open acuteaura opened 1 year ago

acuteaura commented 1 year ago

Current Behavior

occasionally, our controller will stop doing anything after failing to sync, but it also doesn't exit

we have logs for this occurring immediately following an upgrade, but we've seen similar behavior elsewhere too

Expected Behavior

The ingress controller should either retry running its logic or exit.

Error Logs

2023-09-22T17:40:34+08:00   info    ingress/ingress.go:145  start ingress controller
2023-09-22T17:40:34+08:00   info    providers/controller.go:155 start api server
2023-09-22T17:40:34+08:00   info    providers/controller.go:170 start leader election
I0922 17:40:34.168741       1 leaderelection.go:250] attempting to acquire leader lease apisix/ingress-apisix-internal-leader...
2023-09-22T17:40:34+08:00   warn    providers/controller.go:220 found a new leader apisix-internal-ingress-controller-54d9c8d775-9qwkf
2023-09-22T17:40:34+08:00   info    providers/controller.go:222 controller now is running as a candidate    {"namespace": "apisix", "pod": "apisix-internal-ingress-controller-5948cfdc95-hq8cm"}
2023-09-22T17:40:34+08:00   info    providers/controller.go:387 controller tries to leading ... {"namespace": "apisix", "pod": "apisix-internal-ingress-controller-5948cfdc95-hq8cm"}
2023-09-22T17:40:34+08:00   warn    apisix/cluster.go:423   waiting cluster default to ready, it may takes a while
2023-09-22T17:40:34+08:00   info    apisix/cluster.go:248   syncing cache   {"cluster": "default"}
2023-09-22T17:40:34+08:00   info    apisix/cluster.go:463   syncing schema  {"cluster": "default"}
2023-09-22T17:40:34+08:00   error   apisix/route.go:90  failed to list routes: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2023-09-22T17:40:34+08:00   error   apisix/cluster.go:298   failed to list routes in APISIX: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2023-09-22T17:40:36+08:00   error   apisix/route.go:90  failed to list routes: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2023-09-22T17:40:36+08:00   error   apisix/cluster.go:298   failed to list routes in APISIX: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2023-09-22T17:40:38+08:00   error   apisix/cluster.go:298   failed to list routes in APISIX: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2023-09-22T17:40:40+08:00   error   apisix/route.go:90  failed to list routes: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2023-09-22T17:40:40+08:00   error   apisix/cluster.go:298   failed to list routes in APISIX: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2023-09-22T17:40:42+08:00   error   apisix/route.go:90  failed to list routes: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2023-09-22T17:40:42+08:00   error   apisix/cluster.go:298   failed to list routes in APISIX: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2023-09-22T17:40:42+08:00   error   apisix/cluster.go:258   failed to sync cache    {"cost_time": "8.011791225s", "cluster": "default"}
2023-09-22T17:40:42+08:00   error   providers/controller.go:419 failed to wait the default cluster to be ready: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2023-09-22T17:40:42+08:00   info    apisix/cluster.go:248   syncing cache   {"cluster": "default"}
2023-09-22T17:40:42+08:00   info    apisix/cluster.go:463   syncing schema  {"cluster": "default"}
2023-09-22T17:40:42+08:00   error   apisix/route.go:90  failed to list routes: Get "http://apisix-admin.apisix.svc.cluster.local:9180/apisix/admin/routes": context canceled
2023-09-22T17:40:42+08:00   error   apisix/plugin.go:46 failed to list plugins' names: Get "http://apisix-admin.apisix.svc.cluster.local:9180/apisix/admin/plugins?all=true": context canceled
2023-09-22T17:40:42+08:00   error   apisix/cluster.go:298   failed to list routes in APISIX: Get "http://apisix-admin.apisix.svc.cluster.local:9180/apisix/admin/routes": context canceled
2023-09-22T17:40:42+08:00   error   apisix/cluster.go:483   failed to list plugin names in APISIX: Get "http://apisix-admin.apisix.svc.cluster.local:9180/apisix/admin/plugins?all=true": context canceled
2023-09-22T17:40:42+08:00   error   apisix/cluster.go:446   failed to sync schema: Get "http://apisix-admin.apisix.svc.cluster.local:9180/apisix/admin/plugins?all=true": context canceled
2023-09-22T17:40:42+08:00   error   apisix/cluster.go:258   failed to sync cache    {"cost_time": "99.599µs", "cluster": "default"}

Do note that i pulled this log on 2023-09-26 - the ingress has just not logged anything in days.

Steps to Reproduce

Not fully known, but etcd/apisix being unavailable at critical moments seems to trigger it.

Environment

acuteaura commented 1 year ago

I should note, this is running with a single replica, it seems just quit here, believing another replica will take on leadership, though that never happens.

tao12345666333 commented 1 year ago
invalid response code

In fact, these responses come from the APISIX admin API. This means that unhealthy APISIX responded to these admin API requests.

acuteaura commented 1 year ago

That is to be expected, since this happened during a helm upgrade and APISIX pods were being recreated at the same time.

Either way, the controller getting stuck doing nothing is the unexpected behavior. The controller simply can not assume that the APISIX API will be reachable at startup. I'd suggest a failed sync should cause the ingress controller to exit (or retry, if refactored to, it seems to cancel an in-flight retry attempt via the defer cancel), it is not a good signal for giving up on leader election.

tao12345666333 commented 1 year ago

You can check the Helm chart. There has a init container for check APISIX's connection

acuteaura commented 1 year ago

That only checks if the TCP socket is up; same with the readiness probe on APISIX itself (it's not an HTTP check). I'm not sure if the APISIX itself takes a second to boot up or if the etcd being replaced as well might be the culprit (we run etcd with replicas=1 right now), but the server sending a 502/503 is not a condition that's checked.

github-actions[bot] commented 9 months ago

This issue has been marked as stale due to 90 days of inactivity. It will be closed in 30 days if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@apisix.apache.org list. Thank you for your contributions.

acuteaura commented 9 months ago

/reopen

still an edge case the ingress controller can't recover from

geoffreytran commented 8 months ago

Also running into this issue here.

acuteaura commented 8 months ago

This is a criticial issue and checking if the TCP port is available is absolutely not sufficient, @tao12345666333. We had yet another ingress controller that simply got stuck. Frankly, the entire leader election code seems broken (e.g reports as "running as leader" in the callback, where nothing else happens, starts controller loop once any controller is elected), and I'll look into getting a PR in for this as soon as possible (after #2139), this is unfortunately making us look pretty bad as an infra team for adopting the ingress.

2024-01-19T21:29:00+08:00   info   ingress/ingress.go:121  init apisix ingress controller
2024-01-19T21:29:00+08:00   info   ingress/ingress.go:123  version:
Version: 1.7.0
Git SHA: no-git-module
Go Version: go1.20.8
Building OS/Arch: linux/amd64
Running OS/Arch: linux/amd64

2024-01-19T21:29:00+08:00   info   ingress/ingress.go:133  use configuration
{
  "cert_file": "/etc/webhook/certs/cert.pem",
  "key_file": "/etc/webhook/certs/key.pem",
  "log_level": "info",
  "log_output": "stderr",
  "log_rotate_output_path": "",
  "log_rotation_max_size": 100,
  "log_rotation_max_age": 0,
  "log_rotation_max_backups": 0,
  "http_listen": ":8080",
  "https_listen": ":8443",
  "ingress_publish_service": "apisix/apisix-internal-gateway",
  "ingress_status_address": [],
  "enable_profiling": true,
  "kubernetes": {
    "kubeconfig": "",
    "resync_interval": "6h0m0s",
    "namespace_selector": [],
    "election_id": "ingress-apisix-internal-leader",
    "ingress_class": "apisix-internal",
    "ingress_version": "networking/v1",
    "watch_endpoint_slices": false,
    "api_version": "apisix.apache.org/v2",
    "enable_gateway_api": false,
    "disable_status_updates": false,
    "enable_admission": false
  },
  "apisix": {
    "admin_api_version": "v3",
    "default_cluster_name": "default",
    "default_cluster_base_url": "http://apisix-internal-admin.apisix.svc.cluster.local:9180/apisix/admin",
    "default_cluster_admin_key": "******"
  },
  "apisix_resource_sync_interval": "1h0m0s",
  "apisix_resource_sync_comparison": true,
  "plugin_metadata_cm": "",
  "etcdserver": {
    "enabled": false,
    "prefix": "/apisix",
    "listen_address": ":12379",
    "ssl_key_encrypt_salt": "edd1c9f0985e76a2"
  }
}
2024-01-19T21:29:00+08:00   info   ingress/ingress.go:145  start ingress controller
2024-01-19T21:29:00+08:00   info   providers/controller.go:170 start leader election
2024-01-19T21:29:00+08:00   info   providers/controller.go:155 start api server
I0119 21:29:00.286860       1 leaderelection.go:250] attempting to acquire leader lease apisix/ingress-apisix-internal-leader...
2024-01-19T21:29:00+08:00   warn   providers/controller.go:220 found a new leader apisix-internal-ingress-controller-578bf88584-jvs59
2024-01-19T21:29:00+08:00   info   providers/controller.go:222 controller now is running as a candidate    {"namespace": "apisix", "pod": "apisix-internal-ingress-controller-9d96f45cb-4kg8l"}
2024-01-19T21:29:00+08:00   info   providers/controller.go:387 controller tries to leading ... {"namespace": "apisix", "pod": "apisix-internal-ingress-controller-9d96f45cb-4kg8l"}
2024-01-19T21:29:00+08:00   info   apisix/stream_route.go:38   resource stream_routes is disabled
2024-01-19T21:29:00+08:00   info   apisix/cluster.go:248   syncing cache   {"cluster": "default"}
2024-01-19T21:29:00+08:00   warn   apisix/cluster.go:423   waiting cluster default to ready, it may takes a while
2024-01-19T21:29:00+08:00   info   apisix/cluster.go:463   syncing schema  {"cluster": "default"}
2024-01-19T21:29:00+08:00   error  apisix/route.go:90  failed to list routes: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2024-01-19T21:29:00+08:00   error  apisix/cluster.go:298   failed to list routes in APISIX: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2024-01-19T21:29:02+08:00   error  apisix/route.go:90  failed to list routes: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2024-01-19T21:29:02+08:00   error  apisix/cluster.go:298   failed to list routes in APISIX: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2024-01-19T21:29:03+08:00   info   providers/controller.go:141 LeaderElection  {"message": "apisix-internal-ingress-controller-9d96f45cb-4kg8l became leader", "event_type": "Normal"}
I0119 21:29:03.113259       1 leaderelection.go:260] successfully acquired lease apisix/ingress-apisix-internal-leader
2024-01-19T21:29:03+08:00   warn   providers/controller.go:220 found a new leader apisix-internal-ingress-controller-9d96f45cb-4kg8l
2024-01-19T21:29:03+08:00   info   providers/controller.go:212 controller now is running as leader {"namespace": "apisix", "pod": "apisix-internal-ingress-controller-9d96f45cb-4kg8l"}
2024-01-19T21:29:04+08:00   error  apisix/route.go:90  failed to list routes: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2024-01-19T21:29:04+08:00   error  apisix/cluster.go:298   failed to list routes in APISIX: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2024-01-19T21:29:06+08:00   error  apisix/route.go:90  failed to list routes: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2024-01-19T21:29:06+08:00   error  apisix/cluster.go:298   failed to list routes in APISIX: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2024-01-19T21:29:08+08:00   error  apisix/route.go:90  failed to list routes: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2024-01-19T21:29:08+08:00   error  apisix/cluster.go:298   failed to list routes in APISIX: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2024-01-19T21:29:08+08:00   error  apisix/cluster.go:258   failed to sync cache    {"cost_time": "8.016969049s", "cluster": "default"}
2024-01-19T21:29:08+08:00   error  providers/controller.go:419 failed to wait the default cluster to be ready: unexpected status code 503; error message: {"error_msg":"invalid response code: 502"}

2024-01-19T21:29:08+08:00   info   apisix/stream_route.go:38   resource stream_routes is disabled
2024-01-19T21:29:08+08:00   info   apisix/cluster.go:248   syncing cache   {"cluster": "default"}
2024-01-19T21:29:08+08:00   info   apisix/cluster.go:463   syncing schema  {"cluster": "default"}
2024-01-19T21:29:08+08:00   error  apisix/route.go:90  failed to list routes: Get "http://apisix-internal-admin.apisix.svc.cluster.local:9180/apisix/admin/routes": context canceled
2024-01-19T21:29:08+08:00   error  apisix/cluster.go:298   failed to list routes in APISIX: Get "http://apisix-internal-admin.apisix.svc.cluster.local:9180/apisix/admin/routes": context canceled
2024-01-19T21:29:08+08:00   error  apisix/cluster.go:258   failed to sync cache    {"cost_time": "60.702µs", "cluster": "default"}
2024-01-19T21:29:08+08:00   error  apisix/plugin.go:46 failed to list plugins' names: Get "http://apisix-internal-admin.apisix.svc.cluster.local:9180/apisix/admin/plugins?all=true": context canceled
2024-01-19T21:29:08+08:00   error  apisix/cluster.go:483   failed to list plugin names in APISIX: Get "http://apisix-internal-admin.apisix.svc.cluster.local:9180/apisix/admin/plugins?all=true": context canceled
2024-01-19T21:29:08+08:00   error  apisix/cluster.go:446   failed to sync schema: Get "http://apisix-internal-admin.apisix.svc.cluster.local:9180/apisix/admin/plugins?all=true": context canceled
E0122 23:07:05.784720       1 leaderelection.go:369] Failed to update lock: Put "https://10.0.0.1:443/apis/coordination.k8s.io/v1/namespaces/apisix/leases/ingress-apisix-internal-leader": context deadline exceeded
I0122 23:07:05.784783       1 leaderelection.go:285] failed to renew lease apisix/ingress-apisix-internal-leader: timed out waiting for the condition
2024-01-22T23:07:09+08:00   info   providers/controller.go:141 LeaderElection  {"message": "apisix-internal-ingress-controller-9d96f45cb-4kg8l stopped leading", "event_type": "Normal"}
2024-01-22T23:07:09+08:00   info   providers/controller.go:231 controller now is running as a candidate    {"namespace": "apisix", "pod": "apisix-internal-ingress-controller-9d96f45cb-4kg8l"}
I0122 23:07:09.152585       1 leaderelection.go:250] attempting to acquire leader lease apisix/ingress-apisix-internal-leader...
2024-01-22T23:07:09+08:00   info   providers/controller.go:141 LeaderElection  {"message": "apisix-internal-ingress-controller-9d96f45cb-4kg8l became leader", "event_type": "Normal"}
I0122 23:07:09.984780       1 leaderelection.go:260] successfully acquired lease apisix/ingress-apisix-internal-leader
2024-01-22T23:07:09+08:00   info   providers/controller.go:212 controller now is running as leader {"namespace": "apisix", "pod": "apisix-internal-ingress-controller-9d96f45cb-4kg8l"}
E0124 13:07:40.830633       1 leaderelection.go:369] Failed to update lock: Put "https://10.0.0.1:443/apis/coordination.k8s.io/v1/namespaces/apisix/leases/ingress-apisix-internal-leader": context deadline exceeded
I0124 13:07:40.830667       1 leaderelection.go:285] failed to renew lease apisix/ingress-apisix-internal-leader: timed out waiting for the condition
2024-01-24T13:07:40+08:00   info   providers/controller.go:141 LeaderElection  {"message": "apisix-internal-ingress-controller-9d96f45cb-4kg8l stopped leading", "event_type": "Normal"}
2024-01-24T13:07:40+08:00   info   providers/controller.go:231 controller now is running as a candidate    {"namespace": "apisix", "pod": "apisix-internal-ingress-controller-9d96f45cb-4kg8l"}
I0124 13:07:40.843705       1 leaderelection.go:250] attempting to acquire leader lease apisix/ingress-apisix-internal-leader...
2024-01-24T13:07:40+08:00   info   providers/controller.go:141 LeaderElection  {"message": "apisix-internal-ingress-controller-9d96f45cb-4kg8l became leader", "event_type": "Normal"}
I0124 13:07:40.864098       1 leaderelection.go:260] successfully acquired lease apisix/ingress-apisix-internal-leader
2024-01-24T13:07:40+08:00   info   providers/controller.go:212 controller now is running as leader {"namespace": "apisix", "pod": "apisix-internal-ingress-controller-9d96f45cb-4kg8l"}
Revolyssup commented 8 months ago

@acuteaura I will take up this item.

acuteaura commented 8 months ago

I'm still happy to submit a PR next week if you'd prefer. I already dug around the code a bunch and my other open PR should address at least one of these cases (where returning nil to the command would not exit).

Revolyssup commented 8 months ago

@acuteaura I approved that PR. You can create a PR for this issue.

flassagn commented 7 months ago

For the record, we got the same incident this week with a production environment because the ingress-controller stop to work without failing or retrying.

I agreed with @acuteaura is a critical bug.

flassagn commented 7 months ago

A first mitigation action could be to update the healthz endpoint could check if the cluster has a leader and failed.

acuteaura commented 7 months ago

2139 and #2152 fix a bunch of these bugs, but I'm currently stuck trying to figure out why the E2E tests are failing on the second one.

The problem isn't with having no leader, it's quitting out of run when the APISIX API can't be reached without stepping down and quitting, which before #2139 meant that the controller was stuck waiting for an Interrupt while having leadership.

A possible hotfix would be an init container to check if the API is available and to fix health/readiness checks on apisix itself. Right now it checks if there's a TCP socket, but that doen't suffice when APISIX is still booting up and rejects API requests with 503.

acuteaura commented 7 months ago

Specifically, the bug we're seeing and that likely you are seeing is here

flassagn commented 6 months ago

Hi,

First, I would like to say thank you about peoples which are currently working on it. Then, I would like to ask kindly if any update is done here. :bow:

acuteaura commented 6 months ago

Keep an eye on #2152, I'm ready to pick up work on it as soon as someone figures out why the E2E tests don't like that PR.

VelorumS commented 5 months ago

Even if APISIX is started long before the controller I'm getting the "context canceled" errors when relaunching the controller.

If I get into the controller's Pod and do the same requests with curl manually, I get the correct responses.

The bug isn't random for me, the controller can't start working at all.

Ben10k commented 4 months ago

Is there an update on this? We have recently started seeing this issue pop up more and more on our clusters that use spot instance, thus pods get recreated pretty often. We are running the latest version (1.8.2). I am happy to help with whatever is necessary to move this forward.

acuteaura commented 4 months ago

you could patch in an os.Exit on the TODO line in Controller.run inside pkg/providers/controller.go for now, or even submit that as a PR. That'd certainly be better than the status quo of ignoring failed initalization and just not running the controller. I'd much prefer if leader election wasn't "both controllers do everything but one skips writes (hopefully everywhere)" because it's a nice footgun for contributors who might not be aware of the larger image, but I don't have the resources to reproduce why the somewhat fragile E2E test suite (that unfortunately also somewhat relies on running inside actions) fails on that one.

Ben10k commented 4 months ago

I do agree that that having the not-leader would not do anyhting at all instead of what it does now. I'll try to patch your change with the os.Exit as a new PR here, and will try to look into the e2e suite, but I am not very proficient with this code-base or even golang in general so I'm not sure if I'll be able to figure that one out. Thanks for the quick response.

wofr commented 4 months ago

Looking forward for the fix. We constantly running into the issue if our K8 (GKE) does an update on the cluster where apisix is running. More or less everytime we have to manully restart the ingress-controller to get things working again.

ssingh3339 commented 1 month ago

We also ran into a similar issue in prod in 1.7.1.