Open mflendrich opened 2 years ago
Do we indeed want this as an onboarding task? IIRC this was more to track that we're still using the sleep for lack of a great alternative and to collect any reports where the sleep is either insufficient or causes some unexpected issue.
https://github.com/Kong/kubernetes-ingress-controller/issues/2249 (review of the problem and non-viable options for fixing it) and https://github.com/Kong/kubernetes-ingress-controller/pull/2250 (a PoC implementation to demonstrate why the known viable option is not ideal) show prior work, but to be clear the proposal here is basically something we'd like to avoid in order to preserve separation between the controllers, even if it works in a technical sense. This issue says "do this" but this should probably still be a spike to try and come up with alternatives.
Indeed, I don't think we should use this as an onboarding task, we should mark this one as a nice to have.
collect any reports where the sleep is either insufficient or causes some unexpected issue
Sleeping is not a correct synchronization mechanism and as such IMO does not require further justification for fixing (for example because changes to unrelated parts of code may unexpectedly change the time it takes for the cache warmup to finish). Whether it is likely to show visible outcomes in predictable future is unknown (maybe it's unlikely, but we cannot know that for sure), though. Issues like this add up and result in surprising bugs across the codebase. I'm happy to be convinced otherwise, though.
This issue says "do this" but this should probably still be a spike to try and come up with alternatives.
Good point. Would making this a spike as @rainest proposed make it a decent onboarding exploratory (based on the prior art) task?
we should mark this one as a nice to have
To me, personally, "nice to have" means something like "this is a feature that is not likely to draw our attention ever". We could think that fixing any technical debt is a "nice to have" (and I'm viewing this issue as pure technical debt). Do we want to apply "nice to have" to technical debt items that pose only a maintenance problem? I see this label more on the functional side.
To summarize, here's my thinking/alternative proposal:
WDYT?
As long as it's clear that this is a spike and the work that needs to be done first is to propose a solution, not immediately implement one, that's fine with me. Also you can have it marked as both area/debt
and nice-to-have
at the same time.
I have encountered related issues when testing in our prod env.
Kong 2.8.0 db-less
Kong ingress controller 2.4.2
EKS 1.18
Our prod EKS env have 1000+ service, I configured single ingress by a new ingressClass.
Start a kong, after it’s running, request to it cause 503, “failure to get a peer from the ring-balancer”, after a while maybe 1 or 2 min, response recover to 200
debug by kong admin api, I found there are 1 route, 1 service, but no target found in related upstream when error occured.
Ingress Controller log 2022-07-28 14:14:52 time="2022-07-28T06:14:52Z" level=warning msg="no targets found to create upstream" service_name=deploy.frontlogsrv.8080 2022-07-28 14:14:49 time="2022-07-28T06:14:49Z" level=warning msg="no targets found to create upstream" service_name=deploy.frontlogsrv.8080 2022-07-28 14:14:46 time="2022-07-28T06:14:46Z" level=warning msg="no targets found to create upstream" service_name=deploy.frontlogsrv.8080 2022-07-28 14:14:43 time="2022-07-28T06:14:43Z" level=warning msg="no targets found to create upstream" service_name=deploy.frontlogsrv.8080 2022-07-28 14:14:40 time="2022-07-28T06:14:40Z" level=warning msg="no targets found to create upstream" service_name=deploy.frontlogsrv.8080 2022-07-28 14:14:37 time="2022-07-28T06:14:37Z" level=warning msg="no targets found to create upstream" service_name=deploy.frontlogsrv.8080
Service is running normal in EKS and It ’s occured certainly in our scenario.
is there any plan or release version to fix it?
@jiachinzhao that's not this issue; if it was you wouldn't have any service or route at all. That error generally means that your Service has no Endpoints because there are no Pods ready to serve it.
If you do still think there's an issue with the controller not properly seeing Endpoints it should, please open a separate issue.
OK, i will open a separate issue to explain.
another issue related
step to reproduce
start a kong pod then excute test-new.sh
the shell aims to check whether ingress controller is ready before or after kong routes sync normally.
ip=`kubectl get pod -o wide -n kong | grep 'kong-new' | awk '{print $6}'`
echo $ip
while true
do
date
echo -e "\ningress controller healthz"
curl -v $ip:10254/healthz
echo -e "\nkong routes"
curl $ip:8001/routes
echo -e "\nrequest pong"
curl -H 'host: kong-new.test.io' $ip:8000/v1/frontlogsrv/pong
sleep 1
done
the shell output shows that ingress controller healthz is ready before kong routes sync normally.
this will lead to response 404 when traffic proxy to this pod
but in ingress controller 1.3.x, healthz always will be ready after waiting for cache sync success.
I thought that we had run into a startup/sync bug caused by this sleep but it actually turned out that the chart was using the wrong readiness endpoint: https://github.com/Kong/charts/pull/716
Hi @dcarley 👋
The problem you're trying to solve is not related to this bug. What readyz endpoint is doing in KIC as of now is returning true if the first sync (i.e. sending Kong objects - from parsed and translated Kubernetes resources - to Kong instance(s)) has already happened.
Sometimes this sync can send less objects then we'd intend it to because the parser https://github.com/Kong/kubernetes-ingress-controller/blob/91bd96dd47a7d15dcc340d6d2c032ddd1a8ab22d/internal/dataplane/parser/parser.go didn't manage to fill its caches yet. This is what https://github.com/Kong/kubernetes-ingress-controller/pull/2250 tried to remediate by using manager's client to get those k8s resources going around the parser cache(s).
You fix in https://github.com/Kong/charts/pull/716 still makes sense but it doesn't solve the issue described here.
Is there an existing issue for this?
Problem Statement
In order to mitigate #2249, https://github.com/Kong/kubernetes-ingress-controller/pull/2255 has added a 5s sleep before the first sync.
Proposed Solution
Replace the sleep with a solution proposed in https://github.com/Kong/kubernetes-ingress-controller/issues/2249#issuecomment-1031987608
Additional information
No response
Acceptance Criteria