deis / router

Edge router for Deis Workflow
https://deis.com
MIT License
80 stars 57 forks source link

Random 502 bad gateway #312

Closed robinmonjo closed 7 years ago

robinmonjo commented 7 years ago

Hello all,

Using deis v2.11.0 I experience random 502 error on the Deis router. I deployed an app and scale it up to 5 webs.

So I've got this service:

NAME          CLUSTER-IP       EXTERNAL-IP   PORT(S)   AGE
my_app   100.65.135.200   <none>        80/TCP    1d

and these endpoints:

Name:       my_app
Namespace:  my_app
Labels:     app=my_app
        heritage=deis
        router.deis.io/routable=true
Subsets:
  Addresses:        10.42.0.12,10.42.0.2,10.43.128.11,10.43.128.12,10.43.128.13
  NotReadyAddresses:    <none>
  Ports:
    Name    Port    Protocol
    ----    ----    --------
    http    3000    TCP

No events.

Everything looks good, nginx on deis router is properly configured to send request on my kubernetes service:

[...]
proxy_pass http://100.65.135.200:80;
[...]

I can also properly curl each endpoints from within the router pod.

However I get random and regularly 502 bad gateway when accessing my app (it works most of the time but 20% of my requests got a 502). Here are some logs of the router:

2017/02/09 14:27:10 [error] 48#0: *8869 connect() failed (113: No route to host) while connecting to upstream, client: 10.42.0.0, server: ~^my_app\.(?<domain>.+)$, request: "GET /en/accounts/1/events HTTP/1.1", upstream: "http://100.65.135.200:80/en/accounts/1/events", host: "my_app.deis.my_host.com"
[2017-02-09T14:27:10+00:00] - my_app - 10.42.0.0 - - - 502 - "GET /en/accounts/1/events HTTP/1.1" - 732 - "-" - "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36" - "~^my_app\x5C.(?<domain>.+)$" - 100.65.135.200:80 - my_app.deis.my_host.com - 2.996 - 2.996
[2017-02-09T14:27:10+00:00] - my_app - 10.42.0.0 - - - 200 - "GET /favicon.ico HTTP/1.1" - 2664 - "http://my_app.deis.my_host.com/en/accounts/1/events" - "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36" - "~^my_app\x5C.(?<domain>.+)$" - 100.65.135.200:80 - my_app.deis.my_host.com - 0.002 - 0.002

I have no idea how to debug this further ...

Regards, Robin

robinmonjo commented 7 years ago

Linking this issue: https://github.com/deis/router/issues/180 but pretty hard to reproduce as it happens randomly.

robinmonjo commented 7 years ago

I've got more information about this subject. I scaled deis router to have 2 pods. From within the 2 deis router pods, I started a simple loop that curl my service and output the status code every ten seconds.

For about 1 hour, everything worked fine, only the expected 302 status code was output. Then I launched a deployment I started to see some failures and managed to catch one:

$> curl 100.65.135.200 -vv
* Rebuilt URL to: 100.65.135.200/
*   Trying 100.65.135.200...
* connect to 100.65.135.200 port 80 failed: No route to host
* Failed to connect to 100.65.135.200 port 80: No route to host
* Closing connection 0
curl: (7) Failed to connect to 100.65.135.200 port 80: No route to host

So good news, it doesn't seem to come from the deis router. It looks like my kubernetes service failed sometimes, and more frequently when endpoints changed recently. I have a liveness and a readiness probs set on my app web processes.

Does that sounds to be a reasonable conclusion for you ? My cluster is a k8s 1.5.2 cluster setup with kops, running on AWS and using the weave network plugin.

krancour commented 7 years ago

I don't know much about weave, but your conclusion seems reasonable. This sounds like a problem upstream from Nginx, so the overlay network, kube-proxy, or some bug in deployments are all possibilities.

robinmonjo commented 7 years ago

Ok thank you. What do you recommend to use ? I have had good success with flannel not that good with weave (hence the issue :) ). I don't really want to use the "not software based" networking as I don't really want to have to worry about routing in my VPC route table...

krancour commented 7 years ago

I don't want to be too quick to pin the problem on weave, but since you asked... most of my clusters have used kube-aws from CoreOS. That uses flannel by default and I've never personally witnessed this problem.

krancour commented 7 years ago

Also, more recently, I have created clusters using kops and that uses kubenet by default.

robinmonjo commented 7 years ago

I used to use kube-aws as well but had had a terrible experience lately since they introduced node pools. Tried kops and really loved it. I'll try another overlay network and see if my problem happens again.

tmedford commented 6 years ago

Robin,

Were did you end up with this. We noticed 502 within our cluster as well running weave.

robinmonjo commented 6 years ago

We ended up using Flannel. And problem solved. These 502 issue at the time were because docker crashed a lot and the weave docker container was unable to reboot properly.

I’m surprised to hear someone still have this issue. Is your k8s cluster up to date ? This is 100% related to kubernetes services. It says that all endpoints are available but this is not the case.

tmedford commented 6 years ago

We are running kube 1.7.4 and weave 2.1.3. We notice times where thing start dropping.

image

But is not isolated to a single node. image

What would you recommend as next steps? Would one weave pod crash really cause 502 originating on different deis pods on different nodes?