Closed vpetersson closed 7 years ago
ping @krancour thoughts?
I can't promise I'll get to the bottom of this very quickly, but let's start here...
Can you confirm the external IP for the router is shown to be different than that of your flask app? (e.g. not "x.y.z.x")
@krancour Yeah it's a tricky one for sure. It took me about two days to figure out that the traffic was actually routed to Deis.
In terms of your question - yup the external IP for the deis-router is different from the flask app.
Also, just as an FYI - I did test this on two different clusters on GCE and was able to reproduce it on both.
Ok... so my first two thoughts:
This could a problem in GCP's networking. Personally, I find GCP networking difficult to understand in comparison to other clouds, and I'm not very eager to dig into it... but difficult though I may find it, I think it's also unlikely that it could be malfunctioning to this extent.
If we start with the premise that no. 1 isn't likely, the problem is in k8s-- or more likely still, the problem is specific to configuration / resources in your cluster. Let's explore this further first...
Next time you're able to observe this happening, could you please do a kubectl describe
on the flask app's service? Among other things, it should show you the endpoints that are in service. These would be private IPs in the cluster's overlay network. i.e. These are the IPs of individual pods. I'd be interested in knowing if you can correlate one of the in-service endpoint IPs to the IP of a running router pod.
Let's start there. If there is indeed a router pod taking some of the traffic for that service, then we can turn our attention to figuring out how that pod got selected by the service.
@krancour sure thing:
$ kubectl describe svc backend --namespace=flask-example
Name: backend
Namespace: flask-example
Labels: <none>
Annotations: <none>
Selector: app=backend
Type: LoadBalancer
IP: 10.207.240.95
LoadBalancer Ingress: 104.155.150.163
Port: http 80/TCP
NodePort: http 32110/TCP
Endpoints: 10.204.0.35:5000,10.204.2.31:5000
Session Affinity: None
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
2m 2m 1 service-controller Normal CreatingLoadBalancer Creating load balancer
46s 46s 1 service-controller Normal CreatedLoadBalancer Created load balancer
@vpetersson can you get the IP(s) of the router pod(s) as well? Need to compare those to the 10.204.0.35
and 10.204.2.31
seen above.
$ kubectl describe svc deis-router --namespace=deis ✭
Name: deis-router
Namespace: deis
Labels: heritage=deis
Annotations: service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout=1200
Selector: app=deis-router
Type: LoadBalancer
IP: 10.207.244.45
LoadBalancer Ingress: 104.154.x.y
Port: http 80/TCP
NodePort: http 32071/TCP
Endpoints: 10.204.2.33:8080
Port: https 443/TCP
NodePort: https 30023/TCP
Endpoints: 10.204.2.33:6443
Port: builder 2222/TCP
NodePort: builder 32116/TCP
Endpoints: 10.204.2.33:2222
Port: healthz 9090/TCP
NodePort: healthz 31894/TCP
Endpoints: 10.204.2.33:9090
Session Affinity: None
Events: <none>
No mention of those IPs as far as I can tell, @krancour.
What happens if you shut the router down? (I'm guessing this isn't a production cluster and that you can afford to try that. The easiest would be to just scale the router's deployment set down to 0 replicas.) I'm curious whether the periodic 404s will persist (i.e. it's not actually the router serving the 404s), go away entirely, or be replaced with some other issue like a 502 bad gateway or 503 unavailable.
If I scale down the router deployment to 0, the issue goes away and I don't get any 404s.
@vpetersson this issue is perplexing, but let me see if you agree with me on this-- this isn't a router issue. Something between your agent (e.g. curl) and the router (e.g. a GCP load balancer) would appear to be mis-directing traffic to the router. That mis-direction is hardly within the router's control.
Hi @krancour,
It's perplexing indeed. To be honest, I'm not exactly sure where the error stems from. As far as I can tell, it could either be a bug in GCP LB, k8s internals or in the router configuration. I'm not enough of an expert in k8s to tell.
From what I can tell, however, is that the GCP LB configuration seems to be fine, and so does the k8s configuration. Also, the only way I've been able to reproduce this is with a Deis router. That, of course, doesn't prove that the bug isn't in any of the former two.
@vpetersson the router proxies traffic to apps. If the router receives requests it should not have, something else directed that traffic to the router. I can't see a scenario where the router itself could be responsible for this happening.
If this isn't a production cluster and you can afford to experiment, consider writing a trivial program that listens on port 8080 and responds with some recognizable response. Build that into a Docker image and (in-place) kubectl edit
the router's deployment. Change the image to that custom one and remove the readiness and liveness probes. (Don't change any other details of the deployment.) You should then be able to observe (I think) that whatever is mis-directing traffic to the router is still going to be misdirecting traffic to this bespoke router replacement.
My goal here is just to exonerate the router.
@vpetersson have you developed any new insight into how this is happening? For now, I am closing this, because I am certain this issue lies outside the router.
tl;dr: Every Nth request of
Type: LoadBalancer
requests get sent to Deis router even if it is unrelated to Deis.Environment
Steps to reproduce
curl
requests to the external IP:$ curl x.y.z.x Hello World $ curl x.y.z.x Hello World $ curl x.y.z.x Hello World $ curl x.y.z.x Hello World $ curl x.y.z.x Hello World $ curl x.y.z.x Hello World $ curl x.y.z.x
404 Not Found
$ kubectl logs deis-router-3574011047-7z6mr --namespace=deis | grep 404 [...] [2017-07-20T09:41:02+00:00] - router-default-vhost - 93.x.x.x - - - 404 - "GET / HTTP/1.1" - 324 - "-" - "curl/7.51.0" - "" - - - 104.x.x.x - - - 0.000 [2017-07-20T09:41:18+00:00] - router-default-vhost - 93.x.x.x - - - 404 - "GET / HTTP/1.1" - 324 - "-" - "curl/7.51.0" - "" - - - 104.x.x.x - - - 0.000