docker-archive / for-aws

92 stars 26 forks source link

Strange behaviour with routing mesh. #154

Open westfood opened 6 years ago

westfood commented 6 years ago

Expected behavior

Routing mesh always routes to Task.

sslProtocol: TLSv1.2
traceId: "Root=1-5ade0220-2af7d7103105e78080019f18"
protocol: https
backendHost: 172.31.24.140:8012
sslCipher: ECDHE-RSA-AES128-GCM-SHA256
domainName: "developers.test.angelcam.com"
request: "GET https://developers.test.angelcam.com:443/angelcam-api/reference HTTP/1.1"
receivedBytes: 422
backendStatusCode: 200
albStatusCode: 200
backendProcessingTime: 0.005
targetGroupArn: arn:aws:elasticloadbalancing:us-west-2:137739810751:targetgroup/developers-test-swarm/b0cdcbf0020cc035
chosenCertArn: "arn:aws:acm:us-west-2:137739810751:certificate/98db27a0-a6f8-4bf8-9bab-21fe5793360c"
timestamp: 2018-04-23T15:56:16.654437Z
sentBytes: 21429
userAgent: "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.4 (KHTML, like Gecko) Chrome/98 Safari/537.4 (StatusCake)"
responseProcessingTime: 0.0
albName: app/swarm-test/f476cf66e3605fc8
clientHost: 45.32.166.195:47226
requestProcessingTime: 0.0

Actual behavior

ELB sometimes get 502 from backend, but i cannot see this in Task/Service logs. In past 12 hours i get 94% uptime when I monitor service via StatusCake

sslProtocol: TLSv1.2
traceId: "Root=1-5ade03fd-7d3fd8c4465bb42039ee03bc"
protocol: https
backendHost: 172.31.24.140:8012
sslCipher: ECDHE-RSA-AES128-GCM-SHA256
domainName: "developers.test.angelcam.com"
timestamp: 2018-04-23T16:04:13.036665Z
receivedBytes: 422
request: "GET https://developers.test.angelcam.com:443/angelcam-api/reference HTTP/1.1"
albStatusCode: 502
backendProcessingTime: 0.001
targetGroupArn: arn:aws:elasticloadbalancing:us-west-2:137739810751:targetgroup/developers-test-swarm/b0cdcbf0020cc035
chosenCertArn: "arn:aws:acm:us-west-2:137739810751:certificate/98db27a0-a6f8-4bf8-9bab-21fe5793360c"
sentBytes: 695
userAgent: "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.4 (KHTML, like Gecko) Chrome/98 Safari/537.4 (StatusCake)"
responseProcessingTime: -1.0
albName: app/swarm-test/f476cf66e3605fc8
clientHost: 37.235.55.205:54036
requestProcessingTime: 0.0

Information

I am not running Docker4AWS in Classic LB. I user CLB to publish ports on managers, but then I use them as Target group for ALB/NLB. It works for dozen other services.

What I observe is that from time to time, ALB get 502 from my swarm manager instance. But there is no request in Task logs. It seems like this request never gets to Dapperdox - it's logs responds always with 2xx/3xx. In non-swarm host this works flawlessly.

I used Dapperdox 1.1.1 and latest 1.2.2. I am not sure how properly debug this issues, as I did not found any docs related to debugging routing mesh, so I am not sure, where the trouble happened.

Dockerfile: https://github.com/bircow/docker-dapperdox

~ $ docker-diagnose
OK hostname=ip-172-31-24-140-us-west-2-compute-internal session=1524562907-QQ2x1pDo0AlrWrmoMpBa8mCJpeNL9iOf
OK hostname=ip-172-31-46-228-us-west-2-compute-internal session=1524562907-QQ2x1pDo0AlrWrmoMpBa8mCJpeNL9iOf
OK hostname=ip-172-31-0-150-us-west-2-compute-internal session=1524562907-QQ2x1pDo0AlrWrmoMpBa8mCJpeNL9iOf
OK hostname=ip-172-31-33-125-us-west-2-compute-internal session=1524562907-QQ2x1pDo0AlrWrmoMpBa8mCJpeNL9iOf
OK hostname=ip-172-31-31-211-us-west-2-compute-internal session=1524562907-QQ2x1pDo0AlrWrmoMpBa8mCJpeNL9iOf
OK hostname=ip-172-31-26-168-us-west-2-compute-internal session=1524562907-QQ2x1pDo0AlrWrmoMpBa8mCJpeNL9iOf
OK hostname=ip-172-31-2-139-us-west-2-compute-internal session=1524562907-QQ2x1pDo0AlrWrmoMpBa8mCJpeNL9iOf
OK hostname=ip-172-31-42-15-us-west-2-compute-internal session=1524562907-QQ2x1pDo0AlrWrmoMpBa8mCJpeNL9iOf
Done requesting diagnostics.
Your diagnostics session ID is 1524562907-QQ2x1pDo0AlrWrmoMpBa8mCJpeNL9iOf
Please provide this session ID to the maintainer debugging your issue.

Steps to reproduce the behavior

  1. ...
  2. ...
FrenchBen commented 6 years ago

@westfood Thanks for the issue - This seems more related to Swarmkit/docker, rather than aws/azure setup. Could you open an issue within their repo? Additionally having steps, at its most basic, to replicate this issue would really help.

westfood commented 6 years ago

I will try to prepare reproducible steps in few days.

westfood commented 6 years ago

Sorry, I do not have time to provide reproducible steps for dapperbox in near future. Colleague moved our API docs to swagger, so it's not priority now. So if no one else experience this issues, this could be closed I guess. Good luck!