Closed jazzl0ver closed 6 years ago
If the failed instance is not terminated, it probably still has the volume mounted. How long did the failed instance stuck at the shutting down state? When it actually got terminated? Could you please upload the /var/log/cloud-init-output log and /var/log/firecamp/ info log?
How long did the failed instance stuck at the shutting down state? When it actually got terminated?
Can't tell the exact time, but it looks pretty close to 37 minutes. That's why I decided it's tied.
Could you please upload the /var/log/cloud-init-output log and /var/log/firecamp/ info log?
Which instance would you like me to take that files from?
The new instance. If you could get the logs from the shutdown instance, please also upload it.
Unfortunately, the failed instance was automatically terminated. logs.zip
The error is consistent: "WaitVolumeDetached error Timeout volume vol-085bb2cc2c5c473de ServerInstanceID i-091f41e1ec3914e7c". The error repeated for more than 30 minutes, and finally the volume was successfully detached from the old instance and attached to the new instance.
The problem would be the shutdown instance. Somehow, the instance got stuck and not released the volume. This would not be caused by FireCamp, likely caused by EC2 or EBS. If you hit this issue again, please check why the old instance got stuck.
While, the good thing is: the service will still be available when one node is down.
Thank you for the explanation!
EC2 instance has failed (AWS issue), so the firecamp's ASG has it terminated and fired up another one. After it came up, no tasks were able to start up. The error in the AWS ECS console is:
Fortunately, after some time (~37 minutes) the tasks have been started w/o any interactions from my side.
Not sure, but the reason might be connected with long time the failed instance were in shutting-down state.