cloudstax / firecamp

Serverless Platform for the stateful services
https://www.cloudstax.io
Apache License 2.0
210 stars 20 forks source link

Mount Failed #44

Closed dev-head closed 6 years ago

dev-head commented 6 years ago

Hi There,

I'm trying to spin up the zookeeper service with three replicas and the service is only deploying two with the third throwing errors for not being able to mount the volume. I've confirmed the ebs volume was created and available. I deleted the service and terminated the bad node, tried again once the ASG spun a new one up and redeployed the zookeeper service... same error happened.

Please let me know if there's any more info i can provide to help identify where the issue is happening and if it's something i need to change on my end. I'm using the normal cloud formation template in aws with three nodes, one in each of my defined three availability zones.

Thank you

Firecamp volume error log

E0315 18:24:34.022496       6 volume.go:592] findIdleMember error InternalError requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521138273 service &{931a5f81f9ce40ae5bc0ccde07a8747c ACTIVE 1521136846996011291 3 firecamp-stage firecamp-stage-zookeeper {/dev/xvdg {gp2 10 100 false}  { 0 0 false}} true firecamp-stage-firecamp.com /hostedzone/Z1826MR4G8CQU6 false 0xc4202b5800 {0 256 0 4096} }

E0315 18:24:34.022513       6 volume.go:546] Mount failed, get service member error InternalError, serviceUUID 931a5f81f9ce40ae5bc0ccde07a8747c, requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521138273

ecs-agent error

2018-03-15T18:29:55Z [INFO] TaskHandler: batching container event: arn:aws:ecs:us-east-1:050772179124:task/4e24694b-02f9-46fe-9714-e4cce7f7a900 firecamp-stage-firecamp-stage-zookeeper-container -> STOPPED, Reason CannotStartContainerError: API error (500): error while mounting volume '/var/lib/docker/plugins/0bb436c154f10d5a0318180d992dfaf0f66dec1cbd8e1d83a8fb1888e8e3ccf1/rootfs': VolumeDriver.Mount: Mount failed, get service member error InternalError, serviceUUID 931a5f81f9ce40ae5bc0ccde07a8747c, requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521138595
, Known Sent: NONE

2018-03-15T18:29:55Z [INFO] TaskHandler: Adding event: TaskChange: [arn:aws:ecs:us-east-1:050772179124:task/4e24694b-02f9-46fe-9714-e4cce7f7a900 -> STOPPED, Known Sent: NONE, PullStartedAt: 2018-03-15 18:29:55.284603066 +0000 UTC, PullStoppedAt: 2018-03-15 18:29:55.39755933 +0000 UTC, ExecutionStoppedAt: 2018-03-15 18:29:55.604614019 +0000 UTC, arn:aws:ecs:us-east-1:050772179124:task/4e24694b-02f9-46fe-9714-e4cce7f7a900 firecamp-stage-firecamp-stage-zookeeper-container -> STOPPED, Reason CannotStartContainerError: API error (500): error while mounting volume '/var/lib/docker/plugins/0bb436c154f10d5a0318180d992dfaf0f66dec1cbd8e1d83a8fb1888e8e3ccf1/rootfs': VolumeDriver.Mount: Mount failed, get service member error InternalError, serviceUUID 931a5f81f9ce40ae5bc0ccde07a8747c, requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521138595
JuniusLuo commented 6 years ago

Could you please help to collect more information? 1) the full trace in volume info log around 18:24:34. 2) the availability zones of the 3 EC2 instances. 3) firecamp-service-cli -op=list-members -region=us-east-1 -cluster=firecamp-stage -service-name=firecamp-stage-zookeeper.

dev-head commented 6 years ago

thanks, @JuniusLuo for taking at this...

More from: firecamp-dockervolume.ERROR

this just repeats from the start of the error log (after the init log message)

E0315 18:24:34.022478       6 volume.go:829] service has no idle member &{931a5f81f9ce40ae5bc0ccde07a8747c ACTIVE 1521136846996011291 3 firecamp-stage firecamp-stage-zookeeper {/dev/xvdg {gp2 10 100 false}  { 0 0 false}} true fir
ecamp-stage-firecamp.com /hostedzone/Z1826MR4G8CQU6 false 0xc4202b5800 {0 256 0 4096} } requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521138273
E0315 18:24:34.022496       6 volume.go:592] findIdleMember error InternalError requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521138273 service &{931a5f81f9ce40ae5bc0ccde07a8747c ACTIVE 1521136846996011291 3 firecamp-stag
e firecamp-stage-zookeeper {/dev/xvdg {gp2 10 100 false}  { 0 0 false}} true firecamp-stage-firecamp.com /hostedzone/Z1826MR4G8CQU6 false 0xc4202b5800 {0 256 0 4096} }
E0315 18:24:34.022513       6 volume.go:546] Mount failed, get service member error InternalError, serviceUUID 931a5f81f9ce40ae5bc0ccde07a8747c, requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521138273

More from: firecamp-dockervolume.INFO

I0315 18:19:09.245270       6 dynamodb_servicemember.go:270] list serviceMembers succeeded, serviceUUID 931a5f81f9ce40ae5bc0ccde07a8747c limit 0 requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521137949 resp count 0xc420061
978
I0315 18:19:09.245388       6 dynamodb_servicemember.go:297] list 3 serviceMembers, serviceUUID 931a5f81f9ce40ae5bc0ccde07a8747c LastEvaluatedKey map[] requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521137949
I0315 18:19:09.245400       6 volume.go:821] member &{931a5f81f9ce40ae5bc0ccde07a8747c 1 ACTIVE firecamp-stage-zookeeper-1 us-east-1b arn:aws:ecs:us-east-1:xxxxxxxxxxxx:task/7b438391-7188-4955-ae9d-1292cbe35ac0 arn:aws:ecs:us-eas
t-1:xxxxxxxxxxxx:container-instance/79d0b066-abda-4944-b82d-597e9b137a16 i-0cc05e662e755c435 1521136853495041077 {vol-06fc1c5d2d0d6304b /dev/xvdg  } 127.0.0.1 [0xc420124960 0xc420125050 0xc420125170 0xc4201252c0 0xc4201253e0]} in
use, service &{931a5f81f9ce40ae5bc0ccde07a8747c ACTIVE 1521136846996011291 3 firecamp-stage firecamp-stage-zookeeper {/dev/xvdg {gp2 10 100 false}  { 0 0 false}} true firecamp-stage-firecamp.com /hostedzone/Z1826MR4G8CQU6 false 
0xc4202b4c00 {0 256 0 4096} } requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521137949
E0315 18:19:09.245423       6 volume.go:829] service has no idle member &{931a5f81f9ce40ae5bc0ccde07a8747c ACTIVE 1521136846996011291 3 firecamp-stage firecamp-stage-zookeeper {/dev/xvdg {gp2 10 100 false}  { 0 0 false}} true fir
ecamp-stage-firecamp.com /hostedzone/Z1826MR4G8CQU6 false 0xc4202b4c00 {0 256 0 4096} } requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521137949
E0315 18:19:09.245439       6 volume.go:592] findIdleMember error InternalError requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521137949 service &{931a5f81f9ce40ae5bc0ccde07a8747c ACTIVE 1521136846996011291 3 firecamp-stag
e firecamp-stage-zookeeper {/dev/xvdg {gp2 10 100 false}  { 0 0 false}} true firecamp-stage-firecamp.com /hostedzone/Z1826MR4G8CQU6 false 0xc4202b4c00 {0 256 0 4096} }
E0315 18:19:09.245455       6 volume.go:546] Mount failed, get service member error InternalError, serviceUUID 931a5f81f9ce40ae5bc0ccde07a8747c, requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521137949
I0315 18:24:33.907839       6 volume.go:147] Get volume {931a5f81f9ce40ae5bc0ccde07a8747c map[]}
I0315 18:24:33.907862       6 volume.go:166] volume is not mounted for service 931a5f81f9ce40ae5bc0ccde07a8747c
I0315 18:24:33.951591       6 volume.go:224] handle Mount  {931a5f81f9ce40ae5bc0ccde07a8747c 9c8018df65bf9e0e850e049c82837cb6e24f6907fba75958bab38a5079d96a14}
I0315 18:24:33.974019       6 dynamodb_serviceattr.go:310] get service attr &{931a5f81f9ce40ae5bc0ccde07a8747c ACTIVE 1521136846996011291 3 firecamp-stage firecamp-stage-zookeeper {/dev/xvdg {gp2 10 100 false}  { 0 0 false}} true
firecamp-stage-firecamp.com /hostedzone/Z1826MR4G8CQU6 false 0xc4202b5800 {0 256 0 4096} } requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521138273
I0315 18:24:33.974041       6 volume.go:540] get service attr &{931a5f81f9ce40ae5bc0ccde07a8747c ACTIVE 1521136846996011291 3 firecamp-stage firecamp-stage-zookeeper {/dev/xvdg {gp2 10 100 false}  { 0 0 false}} true firecamp-stag
e-firecamp.com /hostedzone/Z1826MR4G8CQU6 false 0xc4202b5800 {0 256 0 4096} } requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521138273
I0315 18:24:34.018568       6 ecs.go:101] list service firecamp-stage-zookeeper cluster firecamp-stage resp {
TaskArns: ["arn:aws:ecs:us-east-1:xxxxxxxxxxxx:task/536c253d-9ae7-460c-9bb4-f7d24cf53807","arn:aws:ecs:us-east-1:xxxxxxxxxxxx:task/7b438391-7188-4955-ae9d-1292cbe35ac0","arn:aws:ecs:us-east-1:xxxxxxxxxxxx:task/a2ab3e4e-558c-452
c-9a78-2363c2d949d7"]
}
I0315 18:24:34.018596       6 ecs.go:119] list task arn:aws:ecs:us-east-1:xxxxxxxxxxxx:task/536c253d-9ae7-460c-9bb4-f7d24cf53807
I0315 18:24:34.018602       6 ecs.go:119] list task arn:aws:ecs:us-east-1:xxxxxxxxxxxx:task/7b438391-7188-4955-ae9d-1292cbe35ac0
I0315 18:24:34.018606       6 ecs.go:119] list task arn:aws:ecs:us-east-1:xxxxxxxxxxxx:task/a2ab3e4e-558c-452c-9a78-2363c2d949d7
I0315 18:24:34.018628       6 ecs.go:122] list 3 tasks, service firecamp-stage-zookeeper cluster firecamp-stage
I0315 18:24:34.022317       6 dynamodb_servicemember.go:270] list serviceMembers succeeded, serviceUUID 931a5f81f9ce40ae5bc0ccde07a8747c limit 0 requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521138273 resp count 0xc420526
778
I0315 18:24:34.022434       6 dynamodb_servicemember.go:297] list 3 serviceMembers, serviceUUID 931a5f81f9ce40ae5bc0ccde07a8747c LastEvaluatedKey map[] requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521138273
I0315 18:24:34.022448       6 volume.go:821] member &{931a5f81f9ce40ae5bc0ccde07a8747c 1 ACTIVE firecamp-stage-zookeeper-1 us-east-1b arn:aws:ecs:us-east-1:xxxxxxxxxxxx:task/7b438391-7188-4955-ae9d-1292cbe35ac0 arn:aws:ecs:us-eas
t-1:xxxxxxxxxxxx:container-instance/79d0b066-abda-4944-b82d-597e9b137a16 i-0cc05e662e755c435 1521136853495041077 {vol-06fc1c5d2d0d6304b /dev/xvdg  } 127.0.0.1 [0xc4201b6cf0 0xc4201b6d20 0xc4201b6d50 0xc4201b6d80 0xc4201b6de0]} in
use, service &{931a5f81f9ce40ae5bc0ccde07a8747c ACTIVE 1521136846996011291 3 firecamp-stage firecamp-stage-zookeeper {/dev/xvdg {gp2 10 100 false}  { 0 0 false}} true firecamp-stage-firecamp.com /hostedzone/Z1826MR4G8CQU6 false 
0xc4202b5800 {0 256 0 4096} } requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521138273
E0315 18:24:34.022478       6 volume.go:829] service has no idle member &{931a5f81f9ce40ae5bc0ccde07a8747c ACTIVE 1521136846996011291 3 firecamp-stage firecamp-stage-zookeeper {/dev/xvdg {gp2 10 100 false}  { 0 0 false}} true fir
ecamp-stage-firecamp.com /hostedzone/Z1826MR4G8CQU6 false 0xc4202b5800 {0 256 0 4096} } requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521138273
E0315 18:24:34.022496       6 volume.go:592] findIdleMember error InternalError requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521138273 service &{931a5f81f9ce40ae5bc0ccde07a8747c ACTIVE 1521136846996011291 3 firecamp-stag
e firecamp-stage-zookeeper {/dev/xvdg {gp2 10 100 false}  { 0 0 false}} true firecamp-stage-firecamp.com /hostedzone/Z1826MR4G8CQU6 false 0xc4202b5800 {0 256 0 4096} }
E0315 18:24:34.022513       6 volume.go:546] Mount failed, get service member error InternalError, serviceUUID 931a5f81f9ce40ae5bc0ccde07a8747c, requuid 10.0.43.217-931a5f81f9ce40ae5bc0ccde07a8747c-1521138273
I0315 18:29:55.427880       6 volume.go:147] Get volume {931a5f81f9ce40ae5bc0ccde07a8747c map[]}
I0315 18:29:55.427934       6 volume.go:166] volume is not mounted for service 931a5f81f9ce40ae5bc0ccde07a8747c
I0315 18:29:55.472599       6 volume.go:224] handle Mount  {931a5f81f9ce40ae5bc0ccde07a8747c 5ab387ef0c7537b725e18ec68507f95b556d8f27a3f668766fa2fca958a52de5}
I0315 18:29:55.503645       6 dynamodb_serviceattr.go:310] get service attr &{931a5f81f9ce40ae5bc0ccde07a8747c ACTIVE 1521136846996011291 3 firecamp-stage firecamp-stage-zookeeper {/dev/xvdg {gp2 10 100 false}  { 0 0 false}} true
JuniusLuo commented 6 years ago

This looks weird. It looks like 2 EC2 nodes are in us-east-1b. Could you please check the AZ of all 3 EC2 nodes?

dev-head commented 6 years ago

confirmed. the cluster spun up the replacement node in the same az as one of the others. I can't verify if the original one was in there too at that time. I am going to kill the stack and try again to see if that changes anything.

out of curiosity, does it matter to the firecamp services that each node is in it's own availability zone? i mean if i need to scale up to more nodes it's going to double up at some point.

JuniusLuo commented 6 years ago

It is weird. Could you please share the detail configurations of the ASG? ASG should try to distribute the nodes equally across 3 AZs.

Yes, each node should be in it's own AZ. This is the limitation of EBS volume. FireCamp creates the EBS volume for every service (zookeeper in this case) member. One EBS volume is owned by one AZ, and could not be attached to another AZ. When you scale out to more nodes, it is best to add 3 nodes at one time. So the service members could be distributed to all AZs to tolerate the possible failure of one AZ.

dev-head commented 6 years ago

I spun up a fresh build and ASG placed two in zone b, so it was having that issue earlier today too. Dug into the ASG and found the issue, won't deploy to us-east-1c due to lack of ec2 instance type support. (m3.large in my case). So, i'm going back and will place it in a different availability zone and try again. lets consider this closed, based on what you explained having services spread evenly across the az's is a requirement and issue was on my end.

thanks again, I really appreciate the time you've been taking to help me out.