Sleep-based synchronization between cache warmup and the first sync to Kong

mflendrich commented 2 years ago

Is there an existing issue for this?

[X] I have searched the existing issues

Problem Statement

In order to mitigate #2249, https://github.com/Kong/kubernetes-ingress-controller/pull/2255 has added a 5s sleep before the first sync.

Proposed Solution

Replace the sleep with a solution proposed in https://github.com/Kong/kubernetes-ingress-controller/issues/2249#issuecomment-1031987608

Additional information

No response

Acceptance Criteria

[ ] The first sync is blocked until the cache has been filled with all the relevant k8s objects in a reliable fashion.

rainest commented 2 years ago

Do we indeed want this as an onboarding task? IIRC this was more to track that we're still using the sleep for lack of a great alternative and to collect any reports where the sleep is either insufficient or causes some unexpected issue.

https://github.com/Kong/kubernetes-ingress-controller/issues/2249 (review of the problem and non-viable options for fixing it) and https://github.com/Kong/kubernetes-ingress-controller/pull/2250 (a PoC implementation to demonstrate why the known viable option is not ideal) show prior work, but to be clear the proposal here is basically something we'd like to avoid in order to preserve separation between the controllers, even if it works in a technical sense. This issue says "do this" but this should probably still be a spike to try and come up with alternatives.

shaneutt commented 2 years ago

Indeed, I don't think we should use this as an onboarding task, we should mark this one as a nice to have.

mflendrich commented 2 years ago

collect any reports where the sleep is either insufficient or causes some unexpected issue

Sleeping is not a correct synchronization mechanism and as such IMO does not require further justification for fixing (for example because changes to unrelated parts of code may unexpectedly change the time it takes for the cache warmup to finish). Whether it is likely to show visible outcomes in predictable future is unknown (maybe it's unlikely, but we cannot know that for sure), though. Issues like this add up and result in surprising bugs across the codebase. I'm happy to be convinced otherwise, though.

This issue says "do this" but this should probably still be a spike to try and come up with alternatives.

Good point. Would making this a spike as @rainest proposed make it a decent onboarding exploratory (based on the prior art) task?

we should mark this one as a nice to have

To me, personally, "nice to have" means something like "this is a feature that is not likely to draw our attention ever". We could think that fixing any technical debt is a "nice to have" (and I'm viewing this issue as pure technical debt). Do we want to apply "nice to have" to technical debt items that pose only a maintenance problem? I see this label more on the functional side.

To summarize, here's my thinking/alternative proposal:

let's replace "do this" with "please explore and see if we can add a barrier that resolves the race otherwise" (seems uncontroversial)
keep "area/debt" rather than "nice to have"
consider this "onboarding" because it's a pure golang issue requiring not so much k8s/domain background

WDYT?

shaneutt commented 2 years ago

As long as it's clear that this is a spike and the work that needs to be done first is to propose a solution, not immediately implement one, that's fine with me. Also you can have it marked as both area/debt and nice-to-have at the same time.

jiachinzhao commented 1 year ago

I have encountered related issues when testing in our prod env.

Kong 2.8.0 db-less Kong ingress controller 2.4.2
EKS 1.18

Our prod EKS env have 1000+ service, I configured single ingress by a new ingressClass.

Start a kong, after it’s running, request to it cause 503, “failure to get a peer from the ring-balancer”, after a while maybe 1 or 2 min, response recover to 200

debug by kong admin api, I found there are 1 route, 1 service, but no target found in related upstream when error occured.

Ingress Controller log 2022-07-28 14:14:52 time="2022-07-28T06:14:52Z" level=warning msg="no targets found to create upstream" service_name=deploy.frontlogsrv.8080 2022-07-28 14:14:49 time="2022-07-28T06:14:49Z" level=warning msg="no targets found to create upstream" service_name=deploy.frontlogsrv.8080 2022-07-28 14:14:46 time="2022-07-28T06:14:46Z" level=warning msg="no targets found to create upstream" service_name=deploy.frontlogsrv.8080 2022-07-28 14:14:43 time="2022-07-28T06:14:43Z" level=warning msg="no targets found to create upstream" service_name=deploy.frontlogsrv.8080 2022-07-28 14:14:40 time="2022-07-28T06:14:40Z" level=warning msg="no targets found to create upstream" service_name=deploy.frontlogsrv.8080 2022-07-28 14:14:37 time="2022-07-28T06:14:37Z" level=warning msg="no targets found to create upstream" service_name=deploy.frontlogsrv.8080

Service is running normal in EKS and It ’s occured certainly in our scenario.

is there any plan or release version to fix it?

rainest commented 1 year ago

@jiachinzhao that's not this issue; if it was you wouldn't have any service or route at all. That error generally means that your Service has no Endpoints because there are no Pods ready to serve it.

If you do still think there's an issue with the controller not properly seeing Endpoints it should, please open a separate issue.

jiachinzhao commented 1 year ago

OK, i will open a separate issue to explain.

another issue related

step to reproduce

Kong 2.8.0 db-less
Kong ingress controller 2.2.0
EKS 1.18
apply a single ingress rule
service and pod running normally

start a kong pod then excute test-new.sh

the shell aims to check whether ingress controller is ready before or after kong routes sync normally.

ip=`kubectl get pod -o wide -n kong | grep 'kong-new' | awk '{print $6}'`
echo $ip
while true
do
   date
   echo -e "\ningress controller healthz"
   curl -v $ip:10254/healthz
   echo -e "\nkong routes"
   curl $ip:8001/routes
   echo -e "\nrequest pong"
   curl -H 'host: kong-new.test.io'  $ip:8000/v1/frontlogsrv/pong
   sleep 1
done

the shell output shows that ingress controller healthz is ready before kong routes sync normally.

this will lead to response 404 when traffic proxy to this pod

but in ingress controller 1.3.x, healthz always will be ready after waiting for cache sync success.

Requests sent

``` ./test-new.sh 10.15.52.171 Fri Jul 29 09:33:59 UTC 2022 ingerss controller healthz * Trying 10.15.52.171... * TCP_NODELAY set * connect to 10.15.52.171 port 10254 failed: Connection refused * Failed to connect to 10.15.52.171 port 10254: Connection refused * Closing connection 0 curl: (7) Failed to connect to 10.15.52.171 port 10254: Connection refused kong routes {"data":[],"next":null} request pong {"message":"no Route matched with those values"} Fri Jul 29 09:34:00 UTC 2022 ingerss controller healthz * Trying 10.15.52.171... * TCP_NODELAY set * connect to 10.15.52.171 port 10254 failed: Connection refused * Failed to connect to 10.15.52.171 port 10254: Connection refused * Closing connection 0 curl: (7) Failed to connect to 10.15.52.171 port 10254: Connection refused kong routes {"data":[],"next":null} request pong {"message":"no Route matched with those values"} Fri Jul 29 09:34:01 UTC 2022 ingerss controller healthz * Trying 10.15.52.171... * TCP_NODELAY set * connect to 10.15.52.171 port 10254 failed: Connection refused * Failed to connect to 10.15.52.171 port 10254: Connection refused * Closing connection 0 curl: (7) Failed to connect to 10.15.52.171 port 10254: Connection refused kong routes {"data":[],"next":null} request pong {"message":"no Route matched with those values"} Fri Jul 29 09:34:02 UTC 2022 ingerss controller healthz * Trying 10.15.52.171... * TCP_NODELAY set * connect to 10.15.52.171 port 10254 failed: Connection refused * Failed to connect to 10.15.52.171 port 10254: Connection refused * Closing connection 0 curl: (7) Failed to connect to 10.15.52.171 port 10254: Connection refused kong routes {"data":[],"next":null} request pong {"message":"no Route matched with those values"} Fri Jul 29 09:34:03 UTC 2022 ingerss controller healthz * Trying 10.15.52.171... * TCP_NODELAY set * Connected to 10.15.52.171 (10.15.52.171) port 10254 (#0) > GET /healthz HTTP/1.1 > Host: 10.15.52.171:10254 > User-Agent: curl/7.61.1 > Accept: */* > < HTTP/1.1 200 OK < Content-Type: text/plain; charset=utf-8 < X-Content-Type-Options: nosniff < Date: Fri, 29 Jul 2022 09:34:04 GMT < Content-Length: 2 < * Connection #0 to host 10.15.52.171 left intact ok kong routes {"data":[],"next":null} request pong {"message":"no Route matched with those values"} Fri Jul 29 09:34:05 UTC 2022 ingerss controller healthz * Trying 10.15.52.171... * TCP_NODELAY set * Connected to 10.15.52.171 (10.15.52.171) port 10254 (#0) > GET /healthz HTTP/1.1 > Host: 10.15.52.171:10254 > User-Agent: curl/7.61.1 > Accept: */* > < HTTP/1.1 200 OK < Content-Type: text/plain; charset=utf-8 < X-Content-Type-Options: nosniff < Date: Fri, 29 Jul 2022 09:34:05 GMT < Content-Length: 2 < * Connection #0 to host 10.15.52.171 left intact ok kong routes {"data":[],"next":null} request pong {"message":"no Route matched with those values"} Fri Jul 29 09:34:06 UTC 2022 ingerss controller healthz * Trying 10.15.52.171... * TCP_NODELAY set * Connected to 10.15.52.171 (10.15.52.171) port 10254 (#0) > GET /healthz HTTP/1.1 > Host: 10.15.52.171:10254 > User-Agent: curl/7.61.1 > Accept: */* > < HTTP/1.1 200 OK < Content-Type: text/plain; charset=utf-8 < X-Content-Type-Options: nosniff < Date: Fri, 29 Jul 2022 09:34:06 GMT < Content-Length: 2 < * Connection #0 to host 10.15.52.171 left intact ok kong routes {"data":[],"next":null} request pong {"message":"no Route matched with those values"} Fri Jul 29 09:34:07 UTC 2022 ingerss controller healthz * Trying 10.15.52.171... * TCP_NODELAY set * Connected to 10.15.52.171 (10.15.52.171) port 10254 (#0) > GET /healthz HTTP/1.1 > Host: 10.15.52.171:10254 > User-Agent: curl/7.61.1 > Accept: */* > < HTTP/1.1 200 OK < Content-Type: text/plain; charset=utf-8 < X-Content-Type-Options: nosniff < Date: Fri, 29 Jul 2022 09:34:07 GMT < Content-Length: 2 < * Connection #0 to host 10.15.52.171 left intact ok kong routes {"data":[],"next":null} request pong {"message":"no Route matched with those values"} Fri Jul 29 09:34:08 UTC 2022 ingerss controller healthz * Trying 10.15.52.171... * TCP_NODELAY set * Connected to 10.15.52.171 (10.15.52.171) port 10254 (#0) > GET /healthz HTTP/1.1 > Host: 10.15.52.171:10254 > User-Agent: curl/7.61.1 > Accept: */* > < HTTP/1.1 200 OK < Content-Type: text/plain; charset=utf-8 < X-Content-Type-Options: nosniff < Date: Fri, 29 Jul 2022 09:34:08 GMT < Content-Length: 2 < * Connection #0 to host 10.15.52.171 left intact ok kong routes {"data":[],"next":null} request pong {"message":"no Route matched with those values"} Fri Jul 29 09:34:09 UTC 2022 ingerss controller healthz * Trying 10.15.52.171... * TCP_NODELAY set * Connected to 10.15.52.171 (10.15.52.171) port 10254 (#0) > GET /healthz HTTP/1.1 > Host: 10.15.52.171:10254 > User-Agent: curl/7.61.1 > Accept: */* > < HTTP/1.1 200 OK < Content-Type: text/plain; charset=utf-8 < X-Content-Type-Options: nosniff < Date: Fri, 29 Jul 2022 09:34:09 GMT < Content-Length: 2 < * Connection #0 to host 10.15.52.171 left intact ok kong routes {"data":[{"snis":null,"tags":null,"request_buffering":true,"response_buffering":true,"created_at":1659087249,"paths":["/v1/frontlogsrv/"],"methods":null,"https_redirect_status_code":426,"preserve_host":true,"service":{"id":"e56c138c-7c1d-5925-8e9e-698913f8e7ab"},"hosts":["kong-pre-release.klook.io"],"destinations":null,"protocols":["http","https"],"regex_priority":0,"path_handling":"v0","id":"bb31773d-8418-5943-b4ae-c88d8eae90a5","updated_at":1659087249,"headers":null,"name":"deploy.frontlogsrv-kong-new-test.00","strip_path":false,"sources":null}],"next":null} request pong frontlogsrv:01.:8080:active:service/deploy/frontlogsrv Fri Jul 29 09:34:10 UTC 2022 ingerss controller healthz * Trying 10.15.52.171... * TCP_NODELAY set * Connected to 10.15.52.171 (10.15.52.171) port 10254 (#0) > GET /healthz HTTP/1.1 > Host: 10.15.52.171:10254 > User-Agent: curl/7.61.1 > Accept: */* > < HTTP/1.1 200 OK < Content-Type: text/plain; charset=utf-8 < X-Content-Type-Options: nosniff < Date: Fri, 29 Jul 2022 09:34:10 GMT < Content-Length: 2 < * Connection #0 to host 10.15.52.171 left intact ok kong routes {"data":[{"snis":null,"tags":null,"request_buffering":true,"response_buffering":true,"created_at":1659087249,"paths":["/v1/frontlogsrv/"],"methods":null,"https_redirect_status_code":426,"preserve_host":true,"service":{"id":"e56c138c-7c1d-5925-8e9e-698913f8e7ab"},"hosts":["kong-pre-release.klook.io"],"destinations":null,"protocols":["http","https"],"regex_priority":0,"path_handling":"v0","id":"bb31773d-8418-5943-b4ae-c88d8eae90a5","updated_at":1659087249,"headers":null,"name":"deploy.frontlogsrv-kong-new-test.00","strip_path":false,"sources":null}],"next":null} request pong frontlogsrv:01.:8080:active:service/deploy/frontlogsrv ```

dcarley commented 1 year ago

I thought that we had run into a startup/sync bug caused by this sleep but it actually turned out that the chart was using the wrong readiness endpoint: https://github.com/Kong/charts/pull/716

pmalek commented 1 year ago

Hi @dcarley 👋

The problem you're trying to solve is not related to this bug. What readyz endpoint is doing in KIC as of now is returning true if the first sync (i.e. sending Kong objects - from parsed and translated Kubernetes resources - to Kong instance(s)) has already happened.

Sometimes this sync can send less objects then we'd intend it to because the parser https://github.com/Kong/kubernetes-ingress-controller/blob/91bd96dd47a7d15dcc340d6d2c032ddd1a8ab22d/internal/dataplane/parser/parser.go didn't manage to fill its caches yet. This is what https://github.com/Kong/kubernetes-ingress-controller/pull/2250 tried to remediate by using manager's client to get those k8s resources going around the parser cache(s).

You fix in https://github.com/Kong/charts/pull/716 still makes sense but it doesn't solve the issue described here.

Kong / kubernetes-ingress-controller