Open aldencolerain opened 7 years ago
I'm getting the same issue, this is in the EU-WEST-2 region (london).
I've tried ssh'ing into one of the EC2 nodes and curl -v http://localhost:44554
fails to connect?
This was a new cluster created using an unmodified CloudFormation template for 17.03.x (stable) with 3 managers and 0 workers
Only change was to create an RDS instance and add it to the same VPC.
I've still got the 'broken' cluster hanging around if someone wants to do some debugging
Further to this - I just tried binding nginx to 44554 but docker complains the port is already in use - so something is bound to that but not responding?
tried finding what was using the port but without much success
~ $ lsof -i :44554
12 /bin/busybox /dev/pts/0
12 /bin/busybox /dev/pts/0
12 /bin/busybox /dev/pts/0
12 /bin/busybox /dev/tty
~ $ netstat -a
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 0.0.0.0:ssh 0.0.0.0:* LISTEN
tcp 0 0 localhost:52698 0.0.0.0:* LISTEN
tcp 0 0 a9764d3cac1a:ssh XXXX.dyn.plus.net:64901 ESTABLISHED
tcp 0 0 :::ssh :::* LISTEN
tcp 0 0 localhost:52698 :::* LISTEN
Active UNIX domain sockets (servers and established)
Proto RefCnt Flags Type State I-Node Path
unix 3 [ ] STREAM CONNECTED 25900
unix 3 [ ] STREAM CONNECTED 25901
So to be honest I gave up for the time being and built the stack manually, but for me it seems broken out of the box. I'll dig in a bit later when and see if I can get some traction.
@jebw thank you for letting us know.
BTW 44554 is our diagnostic server, it runs directly on the host. If it stops responding the host would reboot. So it might be an issue with that server.
@aldencolerain sorry for the trouble, we have had a similar report from someone else, and we are currently testing it out to see if we can reproduce. We are trying to find a common denominator, can you let us know which region you were using?
@kencochrane Unfortunately I killed the broken cluster a few hours ago but I'm 95% sure there wasn't any diagnostics container visible from docker ps
- I've had 2 clusters break like this though so it seems fairly reproduceable
Whats the sure the diagnostics server be visible from docker ps
?
The diagnostic server doesn't run as a container, it runs directly on the host, and it monitors docker and other items. Because of this, it isn't visible from docker ps.
@kencochrane Does that imply the diagnostics server is dying/stopping responding then?
Odd bit is new nodes have a failing health check as well
Process is
Here is an update:
We started 3 clusters on Friday and let them run over the weekend. We had one in us-east-1 and 2 in us-west-2. We were able to reproduce what you described in us-east-1, but it didn't happen in us-west-2.
docker run --rm --net=host marsmensch/tcpdump -vv -i eth0 port 44554
(so for some reason the ELB isn't able to connect to the nodes, or it stopped checking for some reason)Since the ELB's were not connecting to the nodes to ping the healthcheck endpoint, and I knew the healthcheck endpoints were fine. I started trying out some different things.
Taking what I leaned above, I think the issue is related to an issue with the ELB, and since it is only happening in some of the regions, it might be a recent change that they rolled out to ELBs that is now effecting us. Eventually the problem will happen in all the different regions, once they roll out the change to all regions.
I'm not 100% sure, but I think the issue is related to the fact that we have no listeners in the load balancer. If you create a load balancer in the web dashboard, it doesn't let you continue unless you add at least 1 listener. They must have rolled out a change that stops looking at nodes unless you have a listener configured for your ELB.
Why does this take so long for this to take effect? no idea, maybe there are admin tasks on the ELBs that do something every so many hours, and once this happens it reloads the config, and then the problem pops up. Or it is just a bug in the ELB, that we are now hitting.
With the current ELB config in the CloudFormation Template we add a listener on port 7, which is mostly to allow us to add the ELB without giving us an error. When the LBController starts up, it resets the configuration to what it has from swarm. Since port 7 isn't part of swarm, that listener is removed, which leaves us with no listeners.
One of the things we can do, is make sure the LBController doesn't remove all of the listeners, and worse case there is at least 1 listener (port 7), to help prevent this from happening in the first place.
Since we are not 100% this is the cause, it is hard to tell if this will fix the issue, but it is worth a try.
I'll report back if I notice anything else. Please let me know if you noticed anything, or incase I missed anything.
/cc @cjyclaire Do you know if any ELB config changes were recently rolled out?
@FrenchBen as far as I know, below are 2 latest changes for ELB and ELBv2(ALB) ELB May 11, ELBv2 26 days ago
The latest change for CloudFormation is 2 month ago.
I didn't hear about any ELB config changes for Cloudformation templates, yet feel free to cut a ticket to AWS support if anything surprise you happens : )
Thanks for chiming in @cjyclaire - I'll try to recreate the above issue with a simple deployment and will open an issue if needed :)
@kencochrane Sorry about the late reply, thank you for looking into the issue. Mine was breaking in the us-west-2 region. It's weird that it didn't happen for you there. One day it took abour 16 hours to break. The next day it only took 1 hour. What size instances did you use? Is anyone having this issue with larger instances?
@aldencolerain we were able to reproduce the issue, and it seems to happen for us if there was no listeners listed in the ELB. As soon as you added a listener the health checks on the nodes worked again.We have put in a fix for 17.06, so hopefully it won't happen for you again.
@kencochrane I am running 17.06 in us-east-2 and I am facing this exact same issue. I've collected diagnosis information from the builtin tool, if you'd like it.
@madmax88 really, we haven't been able to reproduce with 17.06, so yes, any info you can send along is appreciated. Can you check to see if you have listeners in your ELB, and if so, which ones are listed? You can find it in the AWS ELB dashboard.
@kencochrane Listeners:
Load Balancer Protocol | Load Balancer Port | Instance Protocol | Instance Port | Cipher | SSL Certificate TCP | 7 | TCP | 7 | N/A | N/A TCP | 443 | TCP | 443 | N/A | N/A TCP | 5000 | TCP | 5000 | N/A | N/A TCP | 5001 | TCP | 5001 | N/A | N/A TCP | 9000 | TCP | 9000 | N/A | N/A
@madmax88 thanks, under instances, what does it have for statuses for the instances?
@kencochrane A couple of managers have 1 of the healthchecks failing.
Here is a part of the system logs from one of the ones with a failing healthcheck:
Oops: 0000 [#1] SMP
Modules linked in:
CPU: 0 PID: 7206 Comm: dockerd Not tainted 4.9.31-moby #1
Hardware name: Xen HVM domU, BIOS 4.2.amazon 02/16/2017
task: ffff9fd07b03b180 task.stack: ffffb0024233c000
RIP: 0010:[<ffffffff946453db>] [<ffffffff946453db>] sk_filter_uncharge+0x5/0x31
RSP: 0018:ffffb0024233fe10 EFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff9fd07d95eea8 RCX: 0000000000000006
RDX: 00000000ffffffff RSI: 00000000ffffffe5 RDI: ffff9fd07d95ec00
RBP: ffff9fd07d95ec00 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: ffff9fd07b03b180 R12: ffff9fd07d95ec00
R13: ffff9fd07b6e6ca8 R14: ffff9fd07d95ef40 R15: 0000000000000000
FS: 00007fce4cff9700(0000) GS:ffff9fd08fa00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000fffffffd CR3: 00000001fc5aa000 CR4: 00000000001406b0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Stack:
ffffffff9461ba01 ffff9fd07b6e6c00 ffffb0024233fe88 ffff9fd07d95ec00
ffffffff94760bef 010000087b82d380 ffff9fd0853b53a0 ffff9fd082faf180
ced5c46e9c728ced ffff9fd07fc7ec80 0000000000000000 ffff9fd07fc7ecb0
Call Trace:
[<ffffffff9461ba01>] ? __sk_destruct+0x35/0x133
[<ffffffff94760bef>] ? unix_release_sock+0x180/0x212
[<ffffffff94760c9a>] ? unix_release+0x19/0x25
[<ffffffff94616cf9>] ? sock_release+0x1a/0x6c
[<ffffffff94616d59>] ? sock_close+0xe/0x11
[<ffffffff941f7425>] ? __fput+0xdd/0x17b
[<ffffffff940f538a>] ? task_work_run+0x64/0x7a
[<ffffffff94003285>] ? prepare_exit_to_usermode+0x7d/0x96
[<ffffffff9482a184>] ? entry_SYSCALL_64_fastpath+0xa7/0xa9
Code: 08 4c 89 e7 e8 fb f8 ff ff 48 3d 00 f0 ff ff 77 06 48 89 45 00 31 c0 48 83 c4 10 5b 5d 41 5c 41 5d 41 5e 41 5f c3 0f 1f 44 00 00 <48> 8b 46 18 8b 40 04 48 8d 04 c5 28 00 00 00 f0 29 87 24 01 00
RIP [<ffffffff946453db>] sk_filter_uncharge+0x5/0x31
RSP <ffffb0024233fe10>
CR2: 00000000fffffffd
---[ end trace 77637a5620196472 ]---
Kernel panic - not syncing: Fatal exception
@madmax88 we've identified a similar issue and have implemented a fix for it - A patch will be available soon.
Hey all, this morning I had a stroke of good luck and witnessed what's been going on from start -> finish.
Here's the rundown of what I saw: (1) A manager experiences a kernel panic, like the one I posted earlier (2) That instance is not marked as unhealthy in the EC2 console (3) A new manager instance is provisioned (4) Now, there are more managers than desired in the swarm. Rather than terminating the unreachable manager, a different manager (randomly? I'm not sure) is terminated (5) Managers continue to scale up/down
If you all need any more information about this, please let me know.
@madmax88 awesome, thank you for letting us know, that is very helpful, and could explain a few things. Hopefully we will have a new version that fixes the kernel panic out there shortly.
@kencochrane Thanks for the fast replies!
By any chance do you have an approximate date for that fix?
@madmax88 trying to find that out now, it should be very soon.
Whilst solving the kernel panic resolves the trigger - underlying this is the issue of why a paniced manager isn't getting cleaned up? Isn't that also something which needs some kind of fix?
@jebw yes, we are looking at that as well. unfortunately a lot of the decision making is done by the ASG, so we are a little limited in what we can do. For example, when there are two many managers in a pool, we have no control over which manager gets removed.
We need to figure out why the node wasn't marked as down in the EC2 console. I think if we can solve that, then it will help the other issue.
And to be clear, we do have a little control over what happens when an ASG needs to scale down, but the options available, are none that work for what we need. If there is a way to programmatically decide, it would make it much easier for us. http://docs.aws.amazon.com/autoscaling/latest/userguide/as-instance-termination.html
@kencochrane that sounds great, I appreciate your bounded by what aws allows. Just wanted to check the underlying is was going to get considered. Thanks for the great product
Just had this happen to me. What is the best way to recover from this issue and when is an update to the cloud formation template going to be released for this issue. So far I just have a test Swarm but would like to go to production, which I cannot do with a major bug like this.
@RehanSaeed Do you mind joining our community slack? https://dockr.ly/community
@FrenchBen I've signed up to the channel. Since my previous message, I managed to get my swarm stabilized by setting heath check type to 'EC2' as suggested above. Would be nice to get a more permanent fix for this issue.
This is still an issue on ECS (container service) for me. I run 2 clusters one for front-end and another one for back-end. It woks well with one instance in each cluster but when i adding one more instance to a cluster it works but after a while (indeed can fail within an hours or it may work few days) getting 502 Gateway problem. I noticed 2 things:
I launched a 1 master 2 worker cluster all micro instances (stable and edge had the same behavior) and after about a day both clusters were continually terminating and restarting ec2 instances due to failed ELB health checks. I tried increasing all of the health check numbers and wait periods. I'm letting cloud formation create a new VPC, not using an existing one.
I haven't provisioned anything on the cluster yet, so I don't think its a resource issue. I am able to curl the health check endpoint
curl -I 172.31.18.132:44554/
from a master to worker etc. It seems like an issue with ELB or the VPC.