Closed wayne-luminal closed 3 years ago
It would be really helpful if you were able to capture the debug logs from when it fails. That would show the response we get back from EC2 so we can see what caused the waiter to fail.
I'll look into improving the error message we surface. We should be able to add the specific failure state we received to give more context about why it failed.
I was going to write almost exactly this ticket myself except in my case its instance.terminate() that is causing trouble, are you sure this is not the case for you? When instances are being terminated they are first put into stopped state, it appears this was not accounted for in instance.wait_until_terminated() or it was assumed that users would fix it themselves by first using: instance.stop() instance.wait_until_stopped() And then: instance.terminate() instance.wait_until_terminated()
I just think it is weird that instance.terminate() will work on its own on a running instance, but not when used in conjunction with instance.wait_until_terminated()
How to recreate using python:
import boto3
session = boto3.session.Session(aws_access_key_id="",aws_secret_access_key="",region_name='primary')
resource = session.resource('ec2', endpoint_url="")
instance = resource.create_instances(ImageId=image_id,MinCount=1,MaxCount=1)
instance[0].wait_until_running()
for i in instances:
i.terminate()
i.wait_until_terminated()
I'm seeing this now. In my case what happened is I ran a command to spin up a 15 node cluster. Only 4 of the nodes came up successfully while the rest are showing up as terminated
in the dashboard.
When I try waiter.wait(InstanceIds=all_instance_ids)
(using the ids of all the instances in the cluster, including the ones that are terminated), I get this stack trace (after setting boto3.set_stream_logger('boto3.resources', logging.DEBUG)
).
is-mbp-timothy:deployment-management timothy$ python configure.py do sync_routes add camels
2015-12-29 20:16:48,842 boto3.resources.factory [DEBUG] Loading ec2:ec2
Preparing to UPSERT routes for nodes in cluster camels
2015-12-29 20:16:48,891 boto3.resources.collection [INFO] Calling paginated ec2:describe_instances with {'Filters': [{'Values': ['Analytics'], 'Name': 'tag:Team'}, {'Values': ['camels'], 'Name': 'tag:Cluster'}]}
2015-12-29 20:16:49,813 boto3.resources.factory [DEBUG] Loading ec2:Instance
2015-12-29 20:16:49,815 boto3.resources.model [DEBUG] Renaming Instance attribute network_interfaces
Waiting for all instances to get into 'running' state... Checks every 15 seconds (up to 40 checks)
Traceback (most recent call last):
File "configure.py", line 338, in <module>
aws_launcher.parse_command_line(options)
File "/Users/timothy/Projects/deployment-management/lib/aws_launcher.py", line 523, in parse_command_line
sync_routes(options.environment, 'UPSERT')
File "/Users/timothy/Projects/deployment-management/lib/aws_launcher.py", line 413, in sync_routes
waiter.wait(InstanceIds=all_instance_ids)
File "/Users/timothy/anaconda/lib/python2.7/site-packages/botocore/waiter.py", line 53, in wait
Waiter.wait(self, **kwargs)
File "/Users/timothy/anaconda/lib/python2.7/site-packages/botocore/waiter.py", line 312, in wait
reason='Waiter encountered a terminal failure state')
botocore.exceptions.WaiterError: Waiter InstanceRunning failed: Waiter encountered a terminal failure state
This still doesn't look like a useful level of detail. Is there another way to make the logs more verbose?
Btw - on the AWS web interface when I click on one of the terminated instances, I can see the State transition reason
is Client.VolumeLimitExceeded: Volume limit exceeded
. I'm not sure if this is relevant to the specific issue, but it may help in reproducing the error the client is seeing (and crafting a more helpful error message!).
@turtlemonvh You have to set the logging level of botocore and boto3 to logging.DEBUG:
import logging
logging.getLogger('botocore').setLevel(logging.DEBUG)
logging.getLogger('boto3').setLevel(logging.DEBUG)
With this, you will get a ton of logging.
Whoops, I ended up writing CRITICAL
on my code sample instead of DEBUG
. I just corrected it.
I managed to bypass this issue with adding time.sleep(5)
before waiters.
I suspect that the issue might be caused by AWS-side caching or something like that - just created instance_id or spot_request_id were not found. Adding a small delay helps.
I am experiencing this now as well. It is in a script that does something like the following:
+++ aws ec2 run-instances --image-id ami-cafeface
+++ aws ec2 create-tags --resources i-deadbeef --tags Key=env,Value=prod
+++ aws ec2 wait instance-running --instance-id i-deadbeef .
Waiter InstanceRunning failed: Waiter encountered a terminal failure state
It is very uncommon that this occurs, but a bummer when it does.
I was hitting this issue and observed the same thing @turtlemonvh saw:
State transition reason Client.VolumeLimitExceeded: Volume limit exceeded
Deleting some unnecessary volumes cleared things up.
It would be great if the Waiter exception could provide something a little more informative. Even if it can't detect whether a failure was because of a volume limit issue at runtime, a different string that recommends looking at the metadata for the failed instance would have pointed me in the right direction without as much internet searching.
This seems like a volume issue to me too. Pointing to a different bucket resolved it.
@ShashiDhungel what do you mean by Volume issue anderen pointing to another bucket? I deleted some ec2 instances/volumes/snapshots and s3 Buckets. Still i got this error all of a sudden.
Any hints what i can do?
FWIW, google led me here. I was looking to debug the aws cli. In my case the error was that I was not including the cluster name - correct command is:
aws ecs wait services-stable --cluster cluster_name --services "service_name"
I am seeing similar issues with the wait command to check if my EC2 instance is running or not
vagrant@ubuntu-bionic:~/cpooja/Week-08$ aws ec2 wait instance-running \
> --instance-ids i-1234567890abcdef0
Waiter InstanceRunning failed: Invalid id: "i-1234567890abcdef0"
vagrant@ubuntu-bionic:~/cpooja/Week-08$ aws ec2 wait instance-running
Waiter InstanceRunning failed: Waiter encountered a terminal failure state
Debug logs
instanceId>i-0bc98b8d86b432265</instanceId>\n <imageId>ami-072ba2e0afdd77177</imageId>\n <instanceState>\n <code>48</code>\n <name>terminated</name>\n </instanceState>\n <privateDnsName/>\n <dnsName/>\n <reason>User initiated (2019-10-12 20:35:03 GMT)</reason>\n <keyName>ubuntu-inclass-2019</keyName>\n <amiLaunchIndex>1</amiLaunchIndex>\n <productCodes/>\n <instanceType>t2.micro</instanceType>\n <launchTime>2019-10-12T20:26:43.000Z</launchTime>\n <placement>\n <availabilityZone>us-east-1d</availabilityZone>\n <groupName/>\n <tenancy>default</tenancy>\n </placement>\n <monitoring>\n <state>disabled</state>\n </monitoring>\n <groupSet/>\n <stateReason>\n <code>Client.UserInitiatedShutdown</code>\n <message>Client.UserInitiatedShutdown: User initiated shutdown</message>\n </stateReason>\n <architecture>x86_64</architecture>\n <rootDeviceType>ebs</rootDeviceType>\n <rootDeviceName>/dev/sda1</rootDeviceName>\n <blockDeviceMapping/>\n <virtualizationType>hvm</virtualizationType>\n <clientToken/>\n <hypervisor>xen</hypervisor>\n <networkInterfaceSet/>\n <ebsOptimized>false</ebsOptimized>\n <enaSupport>true</enaSupport>\n <cpuOptions>\n <coreCount>1</coreCount>\n <threadsPerCore>1</threadsPerCore>\n </cpuOptions>\n <capacityReservationSpecification>\n <capacityReservationPreference>open</capacityReservationPreference>\n </capacityReservationSpecification>\n <hibernationOptions>\n <configured>false</configured>\n </hibernationOptions>\n <enclaveOptions>\n <enabled>false</enabled>\n </enclaveOptions>\n </item>\n </instancesSet>\n </item>\n </reservationSet>\n</DescribeInstancesResponse>'
2019-10-12 20:49:42,293 - MainThread - botocore.hooks - DEBUG - Event needs-retry.ec2.DescribeInstances: calling handler <botocore.retryhandler.RetryHandler object at 0x7f8bd2796be0>
2019-10-12 20:49:42,294 - MainThread - botocore.retryhandler - DEBUG - No retry needed.
2019-10-12 20:49:42,294 - MainThread - awscli.clidriver - DEBUG - Exception caught in main()
Traceback (most recent call last):
File "/home/vagrant/.local/lib/python3.6/site-packages/awscli/clidriver.py", line 217, in main
return command_table[parsed_args.command](remaining, parsed_args)
File "/home/vagrant/.local/lib/python3.6/site-packages/awscli/clidriver.py", line 358, in __call__
return command_table[parsed_args.operation](remaining, parsed_globals)
File "/home/vagrant/.local/lib/python3.6/site-packages/awscli/customizations/commands.py", line 190, in __call__
parsed_globals)
File "/home/vagrant/.local/lib/python3.6/site-packages/awscli/clidriver.py", line 530, in __call__
call_parameters, parsed_globals)
File "/home/vagrant/.local/lib/python3.6/site-packages/awscli/customizations/waiters.py", line 208, in invoke
waiter.wait(**parameters)
File "/home/vagrant/.local/lib/python3.6/site-packages/botocore/waiter.py", line 53, in wait
Waiter.wait(self, **kwargs)
File "/home/vagrant/.local/lib/python3.6/site-packages/botocore/waiter.py", line 323, in wait
last_response=response,
botocore.exceptions.WaiterError: Waiter InstanceRunning failed: Waiter encountered a terminal failure state
2019-10-12 20:49:42,296 - MainThread - awscli.clidriver - DEBUG - Exiting with rc 255
Waiter InstanceRunning failed: Waiter encountered a terminal failure state```
I also have a simliar problem to @pooja-choudhari above-- when using .wait_until_stopped
on a resouce-level EC2 instance, I get a terminal failure state
error.
After testing some things, I found that:
.wait_until_stopped
doesn't help.Edit: I have a hunch that this is probably not a problem with boto3, but rather with something on the AWS backend. But any way that boto3 could mitigate this could be helpful. After using the boto3 EC2 waiters directly, I no longer believe this is AWS backend-caused, but rather something with how boto3 issues these functionality. I no longer get this error when using EC2 waiters directly. See my comment below.
I am getting the same error. Is there a way to handle such error? In my case, the stateCode is "Client.InstanceInitiatedShutdown"
After looking around, I've ended up using the waiters directly. Specifically, where ever I'd want to use:
Instance.wait_until_stopped()
I now use:
stopped_instance_waiter = ec2_client.get_waiter('instance_stopped')
stopped_instance_waiter.wait(InstancesIds=[Instance.id])
Yes, it's an annoying amount of boilerplate, but it doesn't produce the error above. Maybe the way boto3
implemented the resource level method causes occasional errors.
I ran into this with an ami creation script. My user_data automatically stops the instance once it is set up. I was trying to call wait_until_stopped
immediately, but apparantly that fails, because "pending" is a failure state for wait_until_stopped
. I got it to work once if I called wait_until_running
first. However, I'm afraid there is a race condition there, if it it starts stopping the instance before the wait_until_running
notices that the instance is up.
Is there a better way to handle this case?
Problem occurred when I was trying to run a multithreaded program. It disappeared when I stopped using multithreading.
@warrenronsiek That could be it, I was experiencing errors when using this in a multithreading setting.
Greetings! It looks like this issue hasn’t been active in longer than one year. We encourage you to check if this is still an issue in the latest release. In the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please feel free to provide a comment or upvote with a reaction on the initial post to prevent automatic closure. If the issue is already closed, please feel free to open a new one.
I'm having a similar issue as @pooja-choudhari:
import boto3
ec2_client = boto3.client("ec2")
waiter = ec2_client.get_waiter("instance_running")
waiter.wait(WaiterConfig={"Delay": 5}, Filters=[{"Name": "instance-type", "Values": ["t2.micro"]}])
This only occurs when there are instances in the cluster that are in either "terminated" or "shutting-down" state. After a while, the "terminated" instances disappear, and the waiter works fine.
Has anybody found a solution yet? I've tried the methods mentioned here, but nothing works for me.
EDIT: I'm not using multithreading.
for me it was not a multithreaded program.
In the meantime, I found a solution to my problem. Here's the proper way to use the waiter:
waiter.wait(WaiterConfig={"Delay": 5},
Filters=[{"Name": "instance-type", "Values": [instance_type]},
{"Name": "instance-state-name", "Values": ["running"]}])
Hi Guys,
I'm trying to launch eks cluster with below command- and throwes error , Apprieciate for suggestion !!
eksctl create cluster --name regapp --region ap-south-1 --version 1.22 \ --nodegroup-name linux-nodes --node-type t2.micro --nodes 2
2022-11-01 10:38:19 [!] 1 error(s) occurred and cluster hasn't been created properly, you may wish to check CloudFormation console 2022-11-01 10:38:19 [ℹ] to cleanup resources, run 'eksctl delete cluster --region=ap-south-1 --name=regapp' 2022-11-01 10:38:19 [✖] waiter state transitioned to Failure Error: failed to create cluster "regapp"
During a lambda function to restart an instance, got the following error: [ERROR] WaiterError: Waiter InstanceRunning failed: Waiter encountered a terminal failure state: For expression "Reservations[].Instances[].State.Name" we matched expected path: "stopping" at least once Traceback (most recent call last): File "/var/task/lambda_function.py", line 41, in lambda_handler waiter.wait(InstanceIds=[instance_id]) File "/var/runtime/botocore/waiter.py", line 55, in wait Waiter.wait(self, **kwargs) File "/var/runtime/botocore/waiter.py", line 375, in wait raise WaiterError(
` #stop the instance ec2_client.stop_instances(InstanceIds=[instance_id])
# wait for the instance to stop
waiter = ec2_client.get_waiter('instance_stopped')
waiter.wait(InstanceIds=[instance_id])
# start the instance
ec2_client.start_instances(InstanceIds=[instance_id])
# wait for the insatnce to start
waiter = ec2_client.get_waiter('instance_running')
waiter.wait(InstanceIds=[instance_id])`
This has to be some kind of race condition that boto doesn't handle correctly.
When calling wait_until_running() on an instance, sometimes I receive this exception:
2015-07-13 11:44:42,583 INFO call Calling ec2:wait_until_running with {'InstanceIds': ['i-972ed75e']} 2015-07-13 11:45:43,687 ERROR decorated_function Waiter InstanceRunning failed: Waiter encountered a terminal failure state Traceback (most recent call last): ... File ".../lib/python3.4/site-packages/boto3/resources/factory.py", line 227, in #do_waiter waiter(self, _args, *_kwargs) File ".../lib/python3.4/site-packages/boto3/resources/action.py", line 194, in #call response = waiter.wait(**params) File ".../lib/python3.4/site-packages/botocore/waiter.py", line 284, in wait reason='Waiter encountered a terminal failure state') botocore.exceptions.WaiterError: Waiter InstanceRunning failed: Waiter encountered a terminal failure state
In the console, the instance does come into the running state. I have turned on boto3 debug logging but haven't recreated it again since this happened.
OS X Yosemite 10.10.3 Python 3.4.2 boto3 1.1.0
Edit: I extracted the methods in our custom code to a script that will (hopefully) recreate the issue.