deis / router

Edge router for Deis Workflow
https://deis.com
MIT License
80 stars 57 forks source link

Deis router picks up non-Deis traffic #351

Closed vpetersson closed 7 years ago

vpetersson commented 7 years ago

tl;dr: Every Nth request of Type: LoadBalancer requests get sent to Deis router even if it is unrelated to Deis.

Environment

Steps to reproduce

$ curl x.y.z.x Hello World $ curl x.y.z.x Hello World $ curl x.y.z.x Hello World $ curl x.y.z.x Hello World $ curl x.y.z.x Hello World $ curl x.y.z.x Hello World $ curl x.y.z.x

404 Not Found

404 Not Found


nginx/1.13.0

As you can see in the above, every Nth request gets a 404. Digging closer into this, we can find said 404's in the Deis router despite the service having nothing to do with Deis:

$ kubectl logs deis-router-3574011047-7z6mr --namespace=deis | grep 404 [...] [2017-07-20T09:41:02+00:00] - router-default-vhost - 93.x.x.x - - - 404 - "GET / HTTP/1.1" - 324 - "-" - "curl/7.51.0" - "" - - - 104.x.x.x - - - 0.000 [2017-07-20T09:41:18+00:00] - router-default-vhost - 93.x.x.x - - - 404 - "GET / HTTP/1.1" - 324 - "-" - "curl/7.51.0" - "" - - - 104.x.x.x - - - 0.000


Interestingly however is that if we delete the deis router pod and let the deployment recreate the router, the 404s go away.
jchauncey commented 7 years ago

ping @krancour thoughts?

krancour commented 7 years ago

I can't promise I'll get to the bottom of this very quickly, but let's start here...

Can you confirm the external IP for the router is shown to be different than that of your flask app? (e.g. not "x.y.z.x")

vpetersson commented 7 years ago

@krancour Yeah it's a tricky one for sure. It took me about two days to figure out that the traffic was actually routed to Deis.

In terms of your question - yup the external IP for the deis-router is different from the flask app.

vpetersson commented 7 years ago

Also, just as an FYI - I did test this on two different clusters on GCE and was able to reproduce it on both.

krancour commented 7 years ago

Ok... so my first two thoughts:

  1. This could a problem in GCP's networking. Personally, I find GCP networking difficult to understand in comparison to other clouds, and I'm not very eager to dig into it... but difficult though I may find it, I think it's also unlikely that it could be malfunctioning to this extent.

  2. If we start with the premise that no. 1 isn't likely, the problem is in k8s-- or more likely still, the problem is specific to configuration / resources in your cluster. Let's explore this further first...

Next time you're able to observe this happening, could you please do a kubectl describe on the flask app's service? Among other things, it should show you the endpoints that are in service. These would be private IPs in the cluster's overlay network. i.e. These are the IPs of individual pods. I'd be interested in knowing if you can correlate one of the in-service endpoint IPs to the IP of a running router pod.

Let's start there. If there is indeed a router pod taking some of the traffic for that service, then we can turn our attention to figuring out how that pod got selected by the service.

vpetersson commented 7 years ago

@krancour sure thing:

$ kubectl describe svc backend --namespace=flask-example
Name:                   backend
Namespace:              flask-example
Labels:                 <none>
Annotations:            <none>
Selector:               app=backend
Type:                   LoadBalancer
IP:                     10.207.240.95
LoadBalancer Ingress:   104.155.150.163
Port:                   http    80/TCP
NodePort:               http    32110/TCP
Endpoints:              10.204.0.35:5000,10.204.2.31:5000
Session Affinity:       None
Events:
  FirstSeen     LastSeen        Count   From                    SubObjectPath   Type            Reason                  Message
  ---------     --------        -----   ----                    -------------   --------        ------                  -------
  2m            2m              1       service-controller                      Normal          CreatingLoadBalancer    Creating load balancer
  46s           46s             1       service-controller                      Normal          CreatedLoadBalancer     Created load balancer
krancour commented 7 years ago

@vpetersson can you get the IP(s) of the router pod(s) as well? Need to compare those to the 10.204.0.35 and 10.204.2.31 seen above.

vpetersson commented 7 years ago
$ kubectl describe svc deis-router --namespace=deis                                               ✭
Name:                   deis-router
Namespace:              deis
Labels:                 heritage=deis
Annotations:            service.beta.kubernetes.io/aws-load-balancer-connection-idle-timeout=1200
Selector:               app=deis-router
Type:                   LoadBalancer
IP:                     10.207.244.45
LoadBalancer Ingress:   104.154.x.y
Port:                   http    80/TCP
NodePort:               http    32071/TCP
Endpoints:              10.204.2.33:8080
Port:                   https   443/TCP
NodePort:               https   30023/TCP
Endpoints:              10.204.2.33:6443
Port:                   builder 2222/TCP
NodePort:               builder 32116/TCP
Endpoints:              10.204.2.33:2222
Port:                   healthz 9090/TCP
NodePort:               healthz 31894/TCP
Endpoints:              10.204.2.33:9090
Session Affinity:       None
Events:                 <none>
vpetersson commented 7 years ago

No mention of those IPs as far as I can tell, @krancour.

krancour commented 7 years ago

What happens if you shut the router down? (I'm guessing this isn't a production cluster and that you can afford to try that. The easiest would be to just scale the router's deployment set down to 0 replicas.) I'm curious whether the periodic 404s will persist (i.e. it's not actually the router serving the 404s), go away entirely, or be replaced with some other issue like a 502 bad gateway or 503 unavailable.

vpetersson commented 7 years ago

If I scale down the router deployment to 0, the issue goes away and I don't get any 404s.

krancour commented 7 years ago

@vpetersson this issue is perplexing, but let me see if you agree with me on this-- this isn't a router issue. Something between your agent (e.g. curl) and the router (e.g. a GCP load balancer) would appear to be mis-directing traffic to the router. That mis-direction is hardly within the router's control.

vpetersson commented 7 years ago

Hi @krancour,

It's perplexing indeed. To be honest, I'm not exactly sure where the error stems from. As far as I can tell, it could either be a bug in GCP LB, k8s internals or in the router configuration. I'm not enough of an expert in k8s to tell.

From what I can tell, however, is that the GCP LB configuration seems to be fine, and so does the k8s configuration. Also, the only way I've been able to reproduce this is with a Deis router. That, of course, doesn't prove that the bug isn't in any of the former two.

krancour commented 7 years ago

@vpetersson the router proxies traffic to apps. If the router receives requests it should not have, something else directed that traffic to the router. I can't see a scenario where the router itself could be responsible for this happening.

If this isn't a production cluster and you can afford to experiment, consider writing a trivial program that listens on port 8080 and responds with some recognizable response. Build that into a Docker image and (in-place) kubectl edit the router's deployment. Change the image to that custom one and remove the readiness and liveness probes. (Don't change any other details of the deployment.) You should then be able to observe (I think) that whatever is mis-directing traffic to the router is still going to be misdirecting traffic to this bespoke router replacement.

My goal here is just to exonerate the router.

krancour commented 7 years ago

@vpetersson have you developed any new insight into how this is happening? For now, I am closing this, because I am certain this issue lies outside the router.