Closed hapnermw closed 9 years ago
This six instance cluster is using default etcd config. It's in a VPC spread across 10.0.6.0/24 and 10.0.7.0/24 subnets with inbound and outbound 4001/7001 10.0.0.0/16 access for all instances.This issue is intermittent but does occur often.
To be clear, this happens when an existing cluster instance's ec2 instance is terminated and a new ec2 instance with the same cloud-config, security group and ip is created to replace it. I've done this now a number of times and each time this slow to join issue occurs.
I found that inbound access to port 7001 on a subset of the instances was blocked, although it wasn't the instances that the joining instance was reporting an i/o error on. The slow join issue was not repeatable after fixing access to 7001.
Here is the log showing that this instance discovers its cluster entry and begins cycling through each each instance attempting to join. It appears that each join attempt fails due to an i/o error to one of the instances - it is always the same instance - regardless of which instance the join is initiated with. After some time this clears and the instance succeeds in joining the cluster.
After etcd is restarted for eighth it joins the cluster.
The instance that is listed as the source of the i/o error is up and appears to be working. For some reason it gives this fleetctl version skew warning. It's not clear why, since this instance is just hours old and other cluster instances are weeks old. They are all created with the ec2 stable ami-8097d4e8. Just noting this in case it might be the cause of the join issue.
Here's the journal of the instance attempting to join the cluster: