Long time to launch tasks after an EC2 instance failure

cloudstax / firecamp

Serverless Platform for the stateful services

https://www.cloudstax.io

Apache License 2.0

209 stars 20 forks source link

Long time to launch tasks after an EC2 instance failure #46

Closed jazzl0ver closed 6 years ago

jazzl0ver commented 6 years ago

EC2 instance has failed (AWS issue), so the firecamp's ASG has it terminated and fired up another one. After it came up, no tasks were able to start up. The error in the AWS ECS console is:

Status reason | CannotStartContainerError:  API error (500): error while mounting volume  '/var/lib/docker/plugins/2ec1ac405b2314e7a06c414ab0323a74187b49f1b9e9d7dcefb670bff13f599d/rootfs':  VolumeDriver.Mount: Mount failed, get service member error Timeout,  serviceUUID d5cc

Fortunately, after some time (~37 minutes) the tasks have been started w/o any interactions from my side.

Not sure, but the reason might be connected with long time the failed instance were in shutting-down state.

JuniusLuo commented 6 years ago

If the failed instance is not terminated, it probably still has the volume mounted. How long did the failed instance stuck at the shutting down state? When it actually got terminated? Could you please upload the /var/log/cloud-init-output log and /var/log/firecamp/ info log?

jazzl0ver commented 6 years ago

How long did the failed instance stuck at the shutting down state? When it actually got terminated?

Can't tell the exact time, but it looks pretty close to 37 minutes. That's why I decided it's tied.

Could you please upload the /var/log/cloud-init-output log and /var/log/firecamp/ info log?

Which instance would you like me to take that files from?

JuniusLuo commented 6 years ago

The new instance. If you could get the logs from the shutdown instance, please also upload it.

jazzl0ver commented 6 years ago

Unfortunately, the failed instance was automatically terminated. logs.zip

JuniusLuo commented 6 years ago

The error is consistent: "WaitVolumeDetached error Timeout volume vol-085bb2cc2c5c473de ServerInstanceID i-091f41e1ec3914e7c". The error repeated for more than 30 minutes, and finally the volume was successfully detached from the old instance and attached to the new instance.

The problem would be the shutdown instance. Somehow, the instance got stuck and not released the volume. This would not be caused by FireCamp, likely caused by EC2 or EBS. If you hit this issue again, please check why the old instance got stuck.

While, the good thing is: the service will still be available when one node is down.

jazzl0ver commented 6 years ago

Thank you for the explanation!