Open gogiraghu1 opened 5 years ago
Although undesirable, this isn't terribly surprising.
If we take ingress controllers out of the equation for a moment...
The Osiris endpoints controller is constantly watching the pods for every Osiris-enabled service and when no ready pods exist for a given service, it substitutes activator endpoints for application pod endpoints.
As the last pod terminates, some very small interval of time may elapse between that pod reaching an unready state and the Osiris endpoints controller reacting to that by substituting activator endpoints. If any traffic reaches the service in that moment, a 502 would occur.
Now, add your ingress controller to the mix. Ingress controllers (with Nginx being no exception, if I recall correctly) do not generally forward traffic to services. Instead, they watch a service's endpoints and forward directly to those. What this means is that you double down on that tiny delay described above. While it may take the endpoints controller a brief moment before it updates the service's endpoints (to activator endpoints), it could be another brief moment before the ingress controller syncs to the service to learn about those new endpoints. i.e. The ingress controller does experience some very small period of time wherein all the endpoints it knows about are invalid.
What can be done about this?
You can mitigate it by not involving an ingress controller in the flow of traffic to Osiris-enabled services. It would probably reduce the window of opportunity for this to occur by half. I'm not saying you should do this. I'm just presenting it as an option.
What, if anything, can be done with Osiris itself to mitigate this? You asked about retry logic. Where would it that be accomplished? The requests in question that actually receive 502 response never flow through any Osiris component.
Off the top of my head, the only thing that comes to mind is that the zeroscaler could do a soft-delete of the last remaining pods(s) before setting the number of desired replicas to zero. This could be done, for instance, by labeling the pod(s) as defunct or "scheduled for termination." The endpoints controller could be made to disregard such pods. What this would achieve is that if there is a small delay in the endpoints controller learning that a pod is scheduled for termination, that pod, at least remains ready for a few seconds longer while its endpoints might still receive traffic. (But if those endpoints receive traffic in that moment, shouldn't scale-to-zero be aborted? But how?)
This is a pretty substantial change, however, as I'm certain this will lead to some race conditions that will have to be accounted for. For instance, what happens when the zero-scaler has soft-deleted a pod and is waiting n seconds before setting desired replicas to zero but in the meantime, the endpoints controller has substituted activator endpoints and the activator has received a request already and has initiated scale from zero? We should end up with 1 (or the configured minimum) number of pods, but we could end up with zero. There may be some ways to mitigate this, but it's still a complex new scenario to deal with. I also wouldn't be surprised if there are more edge cases to look out for that haven't already come to mind.
Before choosing to do that, I have to ask whether the issue you've described is likely a problem under real conditions. I have been able to simulate this problem (before you even reported it), but under very inauthentic conditions. Remember, that scale-to-zero happens when a service has received no traffic for a long time. The liklihood that you receive no traffic for minutes and then receive a request in the split second where a 502 could occur seems small unless you're trying to do that.
Basically, I am weighing the seriousness of the issue against the complexity of a possible solution and the new issues that is likely to create.
I'd welcome your input, of course.
Hi,
I think the root cause of the problem is located in the endpoint controller and to keep "Terminating" pods until they are actually deleted. This behaviour not only impacts ingress controllers listening to endpoint changes but potentially any service client. In case clients reaches directly services VIP, the pod they are electing actually comes from the endpoint, synchronised through the k8s API.
In order to let the system react to those changes, the default endpoint controller, by default, removes any terminating pods from the endpoint. As pods are terminating, even though they are still able to accept and process requests, they are removed from the endpoint (https://github.com/kubernetes/kubernetes/blob/25651408aeadf38c3df7ea8c760e7519fd37d625/pkg/controller/endpoint/endpoints_controller.go#L420 added here: https://github.com/kubernetes/kubernetes/commit/2aaf8bddc253737fefb77e4d0306872c553ddbde#diff-a1a9c0efe93384ed9a010e65e8ad4604R316)
IMHO the custom endpoint controller should have the same behaviour. Opening a PR for this
Environment:
kubectl version
): v1.10.11helm install
command used): helmWhat happened? When zero scaler is terminating pods after reaching metrics interval threshold. During termination process, if a new request comes in simultaneously. We are running into 502 errors because either it is possible the traffic is reaching the pods being terminated or PODs being activated is taking time to bring up the containers. What you expected to happen? The request should never fail even when zero scaler is terminating pods i.e. there should be some retry or timeout which can help to recover the 502 errors How to reproduce it (as minimally and precisely as possible): This is happening with nginx ingress controller lbr endpoint when we configure domain name using annotation in service yaml. Have a nginx ingress controller based lbr with ingress configured for service end point. Wait for zero scaler PODs termination to kick off after metrics interval threshold is reached. When the pods are being terminated submit a request from lbr you will see 502 errros. It is easily reproducible. Anything else that we need to know? Are there any kubernetes ingress level annotations which can avoid 502 errors and still be able to retry and timeout for this scenario?