boto / boto3

AWS SDK for Python
https://aws.amazon.com/sdk-for-python/
Apache License 2.0
9.07k stars 1.87k forks source link

Waiter encountered a terminal failure state #176

Closed wayne-luminal closed 3 years ago

wayne-luminal commented 9 years ago

When calling wait_until_running() on an instance, sometimes I receive this exception:

2015-07-13 11:44:42,583 INFO call Calling ec2:wait_until_running with {'InstanceIds': ['i-972ed75e']} 2015-07-13 11:45:43,687 ERROR decorated_function Waiter InstanceRunning failed: Waiter encountered a terminal failure state Traceback (most recent call last): ... File ".../lib/python3.4/site-packages/boto3/resources/factory.py", line 227, in #do_waiter waiter(self, _args, *_kwargs) File ".../lib/python3.4/site-packages/boto3/resources/action.py", line 194, in #call response = waiter.wait(**params) File ".../lib/python3.4/site-packages/botocore/waiter.py", line 284, in wait reason='Waiter encountered a terminal failure state') botocore.exceptions.WaiterError: Waiter InstanceRunning failed: Waiter encountered a terminal failure state

In the console, the instance does come into the running state. I have turned on boto3 debug logging but haven't recreated it again since this happened.

OS X Yosemite 10.10.3 Python 3.4.2 boto3 1.1.0

Edit: I extracted the methods in our custom code to a script that will (hopefully) recreate the issue.

import logging, boto3, time

boto3.set_stream_logger('boto3', logging.DEBUG)
ec2 = boto3.resource('ec2', region_name='us-east-1')
instance = ec2.create_instances(
    ImageId='ami-b0210ed8',
    InstanceType='t2.micro',
    MinCount=1,
    MaxCount=1,
)[0]
print('Created instance:', instance.id)
instance.wait_until_running()
time.sleep(5)
instance.terminate()
instance.wait_until_terminated()
print('Terminated instance:', instance.id)
jamesls commented 9 years ago

It would be really helpful if you were able to capture the debug logs from when it fails. That would show the response we get back from EC2 so we can see what caused the waiter to fail.

I'll look into improving the error message we surface. We should be able to add the specific failure state we received to give more context about why it failed.

KlemenzF commented 9 years ago

I was going to write almost exactly this ticket myself except in my case its instance.terminate() that is causing trouble, are you sure this is not the case for you? When instances are being terminated they are first put into stopped state, it appears this was not accounted for in instance.wait_until_terminated() or it was assumed that users would fix it themselves by first using: instance.stop() instance.wait_until_stopped() And then: instance.terminate() instance.wait_until_terminated()

I just think it is weird that instance.terminate() will work on its own on a running instance, but not when used in conjunction with instance.wait_until_terminated()

How to recreate using python:

import boto3
session = boto3.session.Session(aws_access_key_id="",aws_secret_access_key="",region_name='primary')
    resource = session.resource('ec2', endpoint_url="")

instance = resource.create_instances(ImageId=image_id,MinCount=1,MaxCount=1)
instance[0].wait_until_running()
for i in instances:
      i.terminate()
      i.wait_until_terminated()
turtlemonvh commented 8 years ago

I'm seeing this now. In my case what happened is I ran a command to spin up a 15 node cluster. Only 4 of the nodes came up successfully while the rest are showing up as terminated in the dashboard.

When I try waiter.wait(InstanceIds=all_instance_ids) (using the ids of all the instances in the cluster, including the ones that are terminated), I get this stack trace (after setting boto3.set_stream_logger('boto3.resources', logging.DEBUG)).

is-mbp-timothy:deployment-management timothy$ python configure.py do sync_routes add camels
2015-12-29 20:16:48,842 boto3.resources.factory [DEBUG] Loading ec2:ec2
Preparing to UPSERT routes for nodes in cluster camels
2015-12-29 20:16:48,891 boto3.resources.collection [INFO] Calling paginated ec2:describe_instances with {'Filters': [{'Values': ['Analytics'], 'Name': 'tag:Team'}, {'Values': ['camels'], 'Name': 'tag:Cluster'}]}
2015-12-29 20:16:49,813 boto3.resources.factory [DEBUG] Loading ec2:Instance
2015-12-29 20:16:49,815 boto3.resources.model [DEBUG] Renaming Instance attribute network_interfaces
Waiting for all instances to get into 'running' state... Checks every 15 seconds (up to 40 checks)
Traceback (most recent call last):
  File "configure.py", line 338, in <module>
    aws_launcher.parse_command_line(options)
  File "/Users/timothy/Projects/deployment-management/lib/aws_launcher.py", line 523, in parse_command_line
    sync_routes(options.environment, 'UPSERT')
  File "/Users/timothy/Projects/deployment-management/lib/aws_launcher.py", line 413, in sync_routes
    waiter.wait(InstanceIds=all_instance_ids)
  File "/Users/timothy/anaconda/lib/python2.7/site-packages/botocore/waiter.py", line 53, in wait
    Waiter.wait(self, **kwargs)
  File "/Users/timothy/anaconda/lib/python2.7/site-packages/botocore/waiter.py", line 312, in wait
    reason='Waiter encountered a terminal failure state')
botocore.exceptions.WaiterError: Waiter InstanceRunning failed: Waiter encountered a terminal failure state

This still doesn't look like a useful level of detail. Is there another way to make the logs more verbose?

turtlemonvh commented 8 years ago

Btw - on the AWS web interface when I click on one of the terminated instances, I can see the State transition reason is Client.VolumeLimitExceeded: Volume limit exceeded. I'm not sure if this is relevant to the specific issue, but it may help in reproducing the error the client is seeing (and crafting a more helpful error message!).

lmr commented 8 years ago

@turtlemonvh You have to set the logging level of botocore and boto3 to logging.DEBUG:

    import logging
    logging.getLogger('botocore').setLevel(logging.DEBUG)
    logging.getLogger('boto3').setLevel(logging.DEBUG)

With this, you will get a ton of logging.

lmr commented 8 years ago

Whoops, I ended up writing CRITICAL on my code sample instead of DEBUG. I just corrected it.

ba1dr commented 8 years ago

I managed to bypass this issue with adding time.sleep(5) before waiters.

I suspect that the issue might be caused by AWS-side caching or something like that - just created instance_id or spot_request_id were not found. Adding a small delay helps.

eldondevcg commented 7 years ago

I am experiencing this now as well. It is in a script that does something like the following:

+++ aws ec2 run-instances --image-id ami-cafeface
+++ aws ec2 create-tags --resources i-deadbeef --tags Key=env,Value=prod
+++ aws ec2 wait instance-running --instance-id  i-deadbeef .

Waiter InstanceRunning failed: Waiter encountered a terminal failure state

It is very uncommon that this occurs, but a bummer when it does.

shawnpg commented 7 years ago

I was hitting this issue and observed the same thing @turtlemonvh saw:

State transition reason Client.VolumeLimitExceeded: Volume limit exceeded

Deleting some unnecessary volumes cleared things up.

It would be great if the Waiter exception could provide something a little more informative. Even if it can't detect whether a failure was because of a volume limit issue at runtime, a different string that recommends looking at the metadata for the failed instance would have pointed me in the right direction without as much internet searching.

ShashiDhungel commented 6 years ago

This seems like a volume issue to me too. Pointing to a different bucket resolved it.

digitalkaoz commented 6 years ago

@ShashiDhungel what do you mean by Volume issue anderen pointing to another bucket? I deleted some ec2 instances/volumes/snapshots and s3 Buckets. Still i got this error all of a sudden.

Any hints what i can do?

carolinux commented 5 years ago

FWIW, google led me here. I was looking to debug the aws cli. In my case the error was that I was not including the cluster name - correct command is:

aws ecs wait services-stable --cluster cluster_name --services "service_name"
pooja-choudhari commented 5 years ago

I am seeing similar issues with the wait command to check if my EC2 instance is running or not

vagrant@ubuntu-bionic:~/cpooja/Week-08$ aws ec2 wait instance-running \
>     --instance-ids i-1234567890abcdef0

Waiter InstanceRunning failed: Invalid id: "i-1234567890abcdef0"
vagrant@ubuntu-bionic:~/cpooja/Week-08$ aws ec2 wait instance-running 

Waiter InstanceRunning failed: Waiter encountered a terminal failure state

Debug logs


instanceId>i-0bc98b8d86b432265</instanceId>\n                    <imageId>ami-072ba2e0afdd77177</imageId>\n                    <instanceState>\n                        <code>48</code>\n                        <name>terminated</name>\n                    </instanceState>\n                    <privateDnsName/>\n                    <dnsName/>\n                    <reason>User initiated (2019-10-12 20:35:03 GMT)</reason>\n                    <keyName>ubuntu-inclass-2019</keyName>\n                    <amiLaunchIndex>1</amiLaunchIndex>\n                    <productCodes/>\n                    <instanceType>t2.micro</instanceType>\n                    <launchTime>2019-10-12T20:26:43.000Z</launchTime>\n                    <placement>\n                        <availabilityZone>us-east-1d</availabilityZone>\n                        <groupName/>\n                        <tenancy>default</tenancy>\n                    </placement>\n                    <monitoring>\n                        <state>disabled</state>\n                    </monitoring>\n                    <groupSet/>\n                    <stateReason>\n                        <code>Client.UserInitiatedShutdown</code>\n                        <message>Client.UserInitiatedShutdown: User initiated shutdown</message>\n                    </stateReason>\n                    <architecture>x86_64</architecture>\n                    <rootDeviceType>ebs</rootDeviceType>\n                    <rootDeviceName>/dev/sda1</rootDeviceName>\n                    <blockDeviceMapping/>\n                    <virtualizationType>hvm</virtualizationType>\n                    <clientToken/>\n                    <hypervisor>xen</hypervisor>\n                    <networkInterfaceSet/>\n                    <ebsOptimized>false</ebsOptimized>\n                    <enaSupport>true</enaSupport>\n                    <cpuOptions>\n                        <coreCount>1</coreCount>\n                        <threadsPerCore>1</threadsPerCore>\n                    </cpuOptions>\n                    <capacityReservationSpecification>\n                        <capacityReservationPreference>open</capacityReservationPreference>\n                    </capacityReservationSpecification>\n                    <hibernationOptions>\n                        <configured>false</configured>\n                    </hibernationOptions>\n                    <enclaveOptions>\n                        <enabled>false</enabled>\n                    </enclaveOptions>\n                </item>\n            </instancesSet>\n        </item>\n    </reservationSet>\n</DescribeInstancesResponse>'
2019-10-12 20:49:42,293 - MainThread - botocore.hooks - DEBUG - Event needs-retry.ec2.DescribeInstances: calling handler <botocore.retryhandler.RetryHandler object at 0x7f8bd2796be0>
2019-10-12 20:49:42,294 - MainThread - botocore.retryhandler - DEBUG - No retry needed.
2019-10-12 20:49:42,294 - MainThread - awscli.clidriver - DEBUG - Exception caught in main()
Traceback (most recent call last):
  File "/home/vagrant/.local/lib/python3.6/site-packages/awscli/clidriver.py", line 217, in main
    return command_table[parsed_args.command](remaining, parsed_args)
  File "/home/vagrant/.local/lib/python3.6/site-packages/awscli/clidriver.py", line 358, in __call__
    return command_table[parsed_args.operation](remaining, parsed_globals)
  File "/home/vagrant/.local/lib/python3.6/site-packages/awscli/customizations/commands.py", line 190, in __call__
    parsed_globals)
  File "/home/vagrant/.local/lib/python3.6/site-packages/awscli/clidriver.py", line 530, in __call__
    call_parameters, parsed_globals)
  File "/home/vagrant/.local/lib/python3.6/site-packages/awscli/customizations/waiters.py", line 208, in invoke
    waiter.wait(**parameters)
  File "/home/vagrant/.local/lib/python3.6/site-packages/botocore/waiter.py", line 53, in wait
    Waiter.wait(self, **kwargs)
  File "/home/vagrant/.local/lib/python3.6/site-packages/botocore/waiter.py", line 323, in wait
    last_response=response,
botocore.exceptions.WaiterError: Waiter InstanceRunning failed: Waiter encountered a terminal failure state
2019-10-12 20:49:42,296 - MainThread - awscli.clidriver - DEBUG - Exiting with rc 255

Waiter InstanceRunning failed: Waiter encountered a terminal failure state```
rirze commented 4 years ago

I also have a simliar problem to @pooja-choudhari above-- when using .wait_until_stopped on a resouce-level EC2 instance, I get a terminal failure state error.

After testing some things, I found that:

  1. Added a delay before checking .wait_until_stopped doesn't help.
  2. Adding a delay before retrying a second time does help. I'm using a delay of 30 seconds.

Edit: I have a hunch that this is probably not a problem with boto3, but rather with something on the AWS backend. But any way that boto3 could mitigate this could be helpful. After using the boto3 EC2 waiters directly, I no longer believe this is AWS backend-caused, but rather something with how boto3 issues these functionality. I no longer get this error when using EC2 waiters directly. See my comment below.

SriAstitva commented 4 years ago

I am getting the same error. Is there a way to handle such error? In my case, the stateCode is "Client.InstanceInitiatedShutdown"

rirze commented 4 years ago

After looking around, I've ended up using the waiters directly. Specifically, where ever I'd want to use:

Instance.wait_until_stopped()

I now use:

stopped_instance_waiter = ec2_client.get_waiter('instance_stopped')
stopped_instance_waiter.wait(InstancesIds=[Instance.id])

Yes, it's an annoying amount of boilerplate, but it doesn't produce the error above. Maybe the way boto3 implemented the resource level method causes occasional errors.

tmccombs commented 4 years ago

I ran into this with an ami creation script. My user_data automatically stops the instance once it is set up. I was trying to call wait_until_stopped immediately, but apparantly that fails, because "pending" is a failure state for wait_until_stopped. I got it to work once if I called wait_until_running first. However, I'm afraid there is a race condition there, if it it starts stopping the instance before the wait_until_running notices that the instance is up.

Is there a better way to handle this case?

warrenronsiek commented 4 years ago

Problem occurred when I was trying to run a multithreaded program. It disappeared when I stopped using multithreading.

rirze commented 4 years ago

@warrenronsiek That could be it, I was experiencing errors when using this in a multithreading setting.

github-actions[bot] commented 3 years ago

Greetings! It looks like this issue hasn’t been active in longer than one year. We encourage you to check if this is still an issue in the latest release. In the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please feel free to provide a comment or upvote with a reaction on the initial post to prevent automatic closure. If the issue is already closed, please feel free to open a new one.

andrastaus commented 2 years ago

I'm having a similar issue as @pooja-choudhari:

import boto3
ec2_client = boto3.client("ec2")
waiter = ec2_client.get_waiter("instance_running")
waiter.wait(WaiterConfig={"Delay": 5}, Filters=[{"Name": "instance-type", "Values": ["t2.micro"]}])

This only occurs when there are instances in the cluster that are in either "terminated" or "shutting-down" state. After a while, the "terminated" instances disappear, and the waiter works fine.

Has anybody found a solution yet? I've tried the methods mentioned here, but nothing works for me.

EDIT: I'm not using multithreading.

tmccombs commented 2 years ago

for me it was not a multithreaded program.

andrastaus commented 2 years ago

In the meantime, I found a solution to my problem. Here's the proper way to use the waiter:

waiter.wait(WaiterConfig={"Delay": 5},
                             Filters=[{"Name": "instance-type", "Values": [instance_type]},
                                      {"Name": "instance-state-name", "Values": ["running"]}])
harshblog150 commented 2 years ago

Hi Guys,

I'm trying to launch eks cluster with below command- and throwes error , Apprieciate for suggestion !!

eksctl create cluster --name regapp --region ap-south-1 --version 1.22 \ --nodegroup-name linux-nodes --node-type t2.micro --nodes 2

2022-11-01 10:38:19 [!] 1 error(s) occurred and cluster hasn't been created properly, you may wish to check CloudFormation console 2022-11-01 10:38:19 [ℹ] to cleanup resources, run 'eksctl delete cluster --region=ap-south-1 --name=regapp' 2022-11-01 10:38:19 [✖] waiter state transitioned to Failure Error: failed to create cluster "regapp"

chris-kiick-sp commented 1 year ago

During a lambda function to restart an instance, got the following error: [ERROR] WaiterError: Waiter InstanceRunning failed: Waiter encountered a terminal failure state: For expression "Reservations[].Instances[].State.Name" we matched expected path: "stopping" at least once Traceback (most recent call last):   File "/var/task/lambda_function.py", line 41, in lambda_handler     waiter.wait(InstanceIds=[instance_id])   File "/var/runtime/botocore/waiter.py", line 55, in wait     Waiter.wait(self, **kwargs)   File "/var/runtime/botocore/waiter.py", line 375, in wait     raise WaiterError(

` #stop the instance ec2_client.stop_instances(InstanceIds=[instance_id])

    # wait for the instance to stop
    waiter = ec2_client.get_waiter('instance_stopped')
    waiter.wait(InstanceIds=[instance_id])

    # start the instance
    ec2_client.start_instances(InstanceIds=[instance_id])

    # wait for the insatnce to start
    waiter = ec2_client.get_waiter('instance_running')
    waiter.wait(InstanceIds=[instance_id])`

This has to be some kind of race condition that boto doesn't handle correctly.