Closed dev-head closed 6 years ago
Could you please help to collect more information? 1) the full trace in volume info log around 18:24:34. 2) the availability zones of the 3 EC2 instances. 3) firecamp-service-cli -op=list-members -region=us-east-1 -cluster=firecamp-stage -service-name=firecamp-stage-zookeeper.
thanks, @JuniusLuo for taking at this...
this just repeats from the start of the error log (after the init log message)
E0315 18:24:34.022478 6 volume.go:829] service has no idle member &{931a5f81f9ce40ae5bc0ccde07a8747c ACTIVE 1521136846996011291 3 firecamp-stage firecamp-stage-zookeeper {/dev/xvdg {gp2 10 100 false} { 0 0 false}} true fir ecamp-stage-firecamp.com /hostedzone/Z1826MR4G8CQU6 false 0xc4202b5800 {0 256 0 4096} } requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521138273 E0315 18:24:34.022496 6 volume.go:592] findIdleMember error InternalError requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521138273 service &{931a5f81f9ce40ae5bc0ccde07a8747c ACTIVE 1521136846996011291 3 firecamp-stag e firecamp-stage-zookeeper {/dev/xvdg {gp2 10 100 false} { 0 0 false}} true firecamp-stage-firecamp.com /hostedzone/Z1826MR4G8CQU6 false 0xc4202b5800 {0 256 0 4096} } E0315 18:24:34.022513 6 volume.go:546] Mount failed, get service member error InternalError, serviceUUID 931a5f81f9ce40ae5bc0ccde07a8747c, requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521138273
More from: firecamp-dockervolume.INFO
I0315 18:19:09.245270 6 dynamodb_servicemember.go:270] list serviceMembers succeeded, serviceUUID 931a5f81f9ce40ae5bc0ccde07a8747c limit 0 requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521137949 resp count 0xc420061 978 I0315 18:19:09.245388 6 dynamodb_servicemember.go:297] list 3 serviceMembers, serviceUUID 931a5f81f9ce40ae5bc0ccde07a8747c LastEvaluatedKey map[] requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521137949 I0315 18:19:09.245400 6 volume.go:821] member &{931a5f81f9ce40ae5bc0ccde07a8747c 1 ACTIVE firecamp-stage-zookeeper-1 us-east-1b arn:aws:ecs:us-east-1:xxxxxxxxxxxx:task/7b438391-7188-4955-ae9d-1292cbe35ac0 arn:aws:ecs:us-eas t-1:xxxxxxxxxxxx:container-instance/79d0b066-abda-4944-b82d-597e9b137a16 i-0cc05e662e755c435 1521136853495041077 {vol-06fc1c5d2d0d6304b /dev/xvdg } 127.0.0.1 [0xc420124960 0xc420125050 0xc420125170 0xc4201252c0 0xc4201253e0]} in use, service &{931a5f81f9ce40ae5bc0ccde07a8747c ACTIVE 1521136846996011291 3 firecamp-stage firecamp-stage-zookeeper {/dev/xvdg {gp2 10 100 false} { 0 0 false}} true firecamp-stage-firecamp.com /hostedzone/Z1826MR4G8CQU6 false 0xc4202b4c00 {0 256 0 4096} } requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521137949 E0315 18:19:09.245423 6 volume.go:829] service has no idle member &{931a5f81f9ce40ae5bc0ccde07a8747c ACTIVE 1521136846996011291 3 firecamp-stage firecamp-stage-zookeeper {/dev/xvdg {gp2 10 100 false} { 0 0 false}} true fir ecamp-stage-firecamp.com /hostedzone/Z1826MR4G8CQU6 false 0xc4202b4c00 {0 256 0 4096} } requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521137949 E0315 18:19:09.245439 6 volume.go:592] findIdleMember error InternalError requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521137949 service &{931a5f81f9ce40ae5bc0ccde07a8747c ACTIVE 1521136846996011291 3 firecamp-stag e firecamp-stage-zookeeper {/dev/xvdg {gp2 10 100 false} { 0 0 false}} true firecamp-stage-firecamp.com /hostedzone/Z1826MR4G8CQU6 false 0xc4202b4c00 {0 256 0 4096} } E0315 18:19:09.245455 6 volume.go:546] Mount failed, get service member error InternalError, serviceUUID 931a5f81f9ce40ae5bc0ccde07a8747c, requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521137949 I0315 18:24:33.907839 6 volume.go:147] Get volume {931a5f81f9ce40ae5bc0ccde07a8747c map[]} I0315 18:24:33.907862 6 volume.go:166] volume is not mounted for service 931a5f81f9ce40ae5bc0ccde07a8747c I0315 18:24:33.951591 6 volume.go:224] handle Mount {931a5f81f9ce40ae5bc0ccde07a8747c 9c8018df65bf9e0e850e049c82837cb6e24f6907fba75958bab38a5079d96a14} I0315 18:24:33.974019 6 dynamodb_serviceattr.go:310] get service attr &{931a5f81f9ce40ae5bc0ccde07a8747c ACTIVE 1521136846996011291 3 firecamp-stage firecamp-stage-zookeeper {/dev/xvdg {gp2 10 100 false} { 0 0 false}} true firecamp-stage-firecamp.com /hostedzone/Z1826MR4G8CQU6 false 0xc4202b5800 {0 256 0 4096} } requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521138273 I0315 18:24:33.974041 6 volume.go:540] get service attr &{931a5f81f9ce40ae5bc0ccde07a8747c ACTIVE 1521136846996011291 3 firecamp-stage firecamp-stage-zookeeper {/dev/xvdg {gp2 10 100 false} { 0 0 false}} true firecamp-stag e-firecamp.com /hostedzone/Z1826MR4G8CQU6 false 0xc4202b5800 {0 256 0 4096} } requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521138273 I0315 18:24:34.018568 6 ecs.go:101] list service firecamp-stage-zookeeper cluster firecamp-stage resp { TaskArns: ["arn:aws:ecs:us-east-1:xxxxxxxxxxxx:task/536c253d-9ae7-460c-9bb4-f7d24cf53807","arn:aws:ecs:us-east-1:xxxxxxxxxxxx:task/7b438391-7188-4955-ae9d-1292cbe35ac0","arn:aws:ecs:us-east-1:xxxxxxxxxxxx:task/a2ab3e4e-558c-452 c-9a78-2363c2d949d7"] } I0315 18:24:34.018596 6 ecs.go:119] list task arn:aws:ecs:us-east-1:xxxxxxxxxxxx:task/536c253d-9ae7-460c-9bb4-f7d24cf53807 I0315 18:24:34.018602 6 ecs.go:119] list task arn:aws:ecs:us-east-1:xxxxxxxxxxxx:task/7b438391-7188-4955-ae9d-1292cbe35ac0 I0315 18:24:34.018606 6 ecs.go:119] list task arn:aws:ecs:us-east-1:xxxxxxxxxxxx:task/a2ab3e4e-558c-452c-9a78-2363c2d949d7 I0315 18:24:34.018628 6 ecs.go:122] list 3 tasks, service firecamp-stage-zookeeper cluster firecamp-stage I0315 18:24:34.022317 6 dynamodb_servicemember.go:270] list serviceMembers succeeded, serviceUUID 931a5f81f9ce40ae5bc0ccde07a8747c limit 0 requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521138273 resp count 0xc420526 778 I0315 18:24:34.022434 6 dynamodb_servicemember.go:297] list 3 serviceMembers, serviceUUID 931a5f81f9ce40ae5bc0ccde07a8747c LastEvaluatedKey map[] requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521138273 I0315 18:24:34.022448 6 volume.go:821] member &{931a5f81f9ce40ae5bc0ccde07a8747c 1 ACTIVE firecamp-stage-zookeeper-1 us-east-1b arn:aws:ecs:us-east-1:xxxxxxxxxxxx:task/7b438391-7188-4955-ae9d-1292cbe35ac0 arn:aws:ecs:us-eas t-1:xxxxxxxxxxxx:container-instance/79d0b066-abda-4944-b82d-597e9b137a16 i-0cc05e662e755c435 1521136853495041077 {vol-06fc1c5d2d0d6304b /dev/xvdg } 127.0.0.1 [0xc4201b6cf0 0xc4201b6d20 0xc4201b6d50 0xc4201b6d80 0xc4201b6de0]} in use, service &{931a5f81f9ce40ae5bc0ccde07a8747c ACTIVE 1521136846996011291 3 firecamp-stage firecamp-stage-zookeeper {/dev/xvdg {gp2 10 100 false} { 0 0 false}} true firecamp-stage-firecamp.com /hostedzone/Z1826MR4G8CQU6 false 0xc4202b5800 {0 256 0 4096} } requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521138273 E0315 18:24:34.022478 6 volume.go:829] service has no idle member &{931a5f81f9ce40ae5bc0ccde07a8747c ACTIVE 1521136846996011291 3 firecamp-stage firecamp-stage-zookeeper {/dev/xvdg {gp2 10 100 false} { 0 0 false}} true fir ecamp-stage-firecamp.com /hostedzone/Z1826MR4G8CQU6 false 0xc4202b5800 {0 256 0 4096} } requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521138273 E0315 18:24:34.022496 6 volume.go:592] findIdleMember error InternalError requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521138273 service &{931a5f81f9ce40ae5bc0ccde07a8747c ACTIVE 1521136846996011291 3 firecamp-stag e firecamp-stage-zookeeper {/dev/xvdg {gp2 10 100 false} { 0 0 false}} true firecamp-stage-firecamp.com /hostedzone/Z1826MR4G8CQU6 false 0xc4202b5800 {0 256 0 4096} } E0315 18:24:34.022513 6 volume.go:546] Mount failed, get service member error InternalError, serviceUUID 931a5f81f9ce40ae5bc0ccde07a8747c, requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521138273 I0315 18:29:55.427880 6 volume.go:147] Get volume {931a5f81f9ce40ae5bc0ccde07a8747c map[]} I0315 18:29:55.427934 6 volume.go:166] volume is not mounted for service 931a5f81f9ce40ae5bc0ccde07a8747c I0315 18:29:55.472599 6 volume.go:224] handle Mount {931a5f81f9ce40ae5bc0ccde07a8747c 5ab387ef0c7537b725e18ec68507f95b556d8f27a3f668766fa2fca958a52de5} I0315 18:29:55.503645 6 dynamodb_serviceattr.go:310] get service attr &{931a5f81f9ce40ae5bc0ccde07a8747c ACTIVE 1521136846996011291 3 firecamp-stage firecamp-stage-zookeeper {/dev/xvdg {gp2 10 100 false} { 0 0 false}} true
This looks weird. It looks like 2 EC2 nodes are in us-east-1b. Could you please check the AZ of all 3 EC2 nodes?
confirmed. the cluster spun up the replacement node in the same az as one of the others. I can't verify if the original one was in there too at that time. I am going to kill the stack and try again to see if that changes anything.
out of curiosity, does it matter to the firecamp services that each node is in it's own availability zone? i mean if i need to scale up to more nodes it's going to double up at some point.
It is weird. Could you please share the detail configurations of the ASG? ASG should try to distribute the nodes equally across 3 AZs.
Yes, each node should be in it's own AZ. This is the limitation of EBS volume. FireCamp creates the EBS volume for every service (zookeeper in this case) member. One EBS volume is owned by one AZ, and could not be attached to another AZ. When you scale out to more nodes, it is best to add 3 nodes at one time. So the service members could be distributed to all AZs to tolerate the possible failure of one AZ.
I spun up a fresh build and ASG placed two in zone b, so it was having that issue earlier today too. Dug into the ASG and found the issue, won't deploy to us-east-1c due to lack of ec2 instance type support. (m3.large in my case). So, i'm going back and will place it in a different availability zone and try again. lets consider this closed, based on what you explained having services spread evenly across the az's is a requirement and issue was on my end.
thanks again, I really appreciate the time you've been taking to help me out.
Hi There,
I'm trying to spin up the zookeeper service with three replicas and the service is only deploying two with the third throwing errors for not being able to mount the volume. I've confirmed the ebs volume was created and available. I deleted the service and terminated the bad node, tried again once the ASG spun a new one up and redeployed the zookeeper service... same error happened.
Please let me know if there's any more info i can provide to help identify where the issue is happening and if it's something i need to change on my end. I'm using the normal cloud formation template in aws with three nodes, one in each of my defined three availability zones.
Thank you
Firecamp volume error log
ecs-agent error