Failure stopping EC2 instances with community.aws.ec2_instance

sjthespian commented 2 years ago

Summary

Using community.aws.ec2_instance to stop instances usually works, but tonight it threw the message "Unable to stop instances:" for two out of three of the instances I was attempting to stop. The playbook task is fairly simple:

- name: Stop gold masters
  community.aws.ec2_instance:
    state: stopped
    wait: true
    instance_ids: "{{ gold_master_instances }}"
    region: us-east-1
    profile: "{{ aws_profile }}"

It is possible that the instances that this failed on were in the process of being shut down by the operating system (Windows), but I would still expect the above task to wait for those instances, not fail. Checking the instances shortly after this failure in the AWS EC2 console showed that the instances were already stopped.

Below is the full text of the error with instance names and other identifying information redacted.

Issue Type

Bug Report

Component Name

ec2_instance

Ansible Version

$ ansible --version
ansible [core 2.12.6]
  config file = /etc/ansible/ansible.cfg
  configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/lib/python3.10/site-packages/ansible
  ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections
  executable location = /usr/local/bin/ansible
  python version = 3.10.4 (main, Jun  5 2022, 03:28:37) [GCC 8.3.0]
  jinja version = 3.1.2
  libyaml = True

Collection Versions

$ ansible-galaxy collection list
Collection           Version
-------------------- -------
amazon.aws           3.3.0
ansible.netcommon    3.0.1
ansible.posix        1.4.0
ansible.utils        2.6.1
ansible.windows      1.10.0
awx.awx              21.1.0
cisco.ios            3.1.0
cisco.iosxr          3.1.0
cisco.nxos           3.0.0
cloud.common         2.1.1
community.aws        3.3.0
community.crypto     2.3.2
community.general    5.0.2
community.kubernetes 2.0.1
community.network    4.0.1
community.vmware     2.6.0
community.windows    1.10.0
community.yang       1.1.0
kubernetes.core      2.3.1
vmware.vmware_rest   2.1.5

# /usr/local/lib/python3.10/site-packages/ansible_collections
Collection                    Version
----------------------------- -------
amazon.aws                    2.2.0
ansible.netcommon             2.6.1
ansible.posix                 1.3.0
ansible.utils                 2.6.1
ansible.windows               1.10.0
arista.eos                    3.1.0
awx.awx                       19.4.0
azure.azcollection            1.12.0
check_point.mgmt              2.3.0
chocolatey.chocolatey         1.2.0
cisco.aci                     2.2.0
cisco.asa                     2.1.0
cisco.intersight              1.0.18
cisco.ios                     2.8.1
cisco.iosxr                   2.9.0
cisco.ise                     1.2.1
cisco.meraki                  2.6.2
cisco.mso                     1.4.0
cisco.nso                     1.0.3
cisco.nxos                    2.9.1
cisco.ucs                     1.8.0
cloud.common                  2.1.1
cloudscale_ch.cloud           2.2.1
community.aws                 2.4.0
community.azure               1.1.0
community.ciscosmb            1.0.5
community.crypto              2.3.1
community.digitalocean        1.19.0
community.dns                 2.1.1
community.docker              2.5.1
community.fortios             1.0.0
community.general             4.8.1
community.google              1.0.0
community.grafana             1.4.0
community.hashi_vault         2.5.0
community.hrobot              1.3.0
community.kubernetes          2.0.1
community.kubevirt            1.0.0
community.libvirt             1.1.0
community.mongodb             1.4.0
community.mysql               2.3.7
community.network             3.3.0
community.okd                 2.2.0
community.postgresql          1.7.4
community.proxysql            1.3.2
community.rabbitmq            1.2.1
community.routeros            2.0.0
community.sap                 1.0.0
community.skydive             1.0.0
community.sops                1.2.1
community.vmware              1.18.0
community.windows             1.10.0
community.zabbix              1.6.0
containers.podman             1.9.3
cyberark.conjur               1.1.0
cyberark.pas                  1.0.13
dellemc.enterprise_sonic      1.1.0
dellemc.openmanage            4.4.0
dellemc.os10                  1.1.1
dellemc.os6                   1.0.7
dellemc.os9                   1.0.4
f5networks.f5_modules         1.16.0
fortinet.fortimanager         2.1.5
fortinet.fortios              2.1.4
frr.frr                       1.0.4
gluster.gluster               1.0.2
google.cloud                  1.0.2
hetzner.hcloud                1.6.0
hpe.nimble                    1.1.4
ibm.qradar                    1.0.3
infinidat.infinibox           1.3.3
infoblox.nios_modules         1.2.1
inspur.sm                     1.3.0
junipernetworks.junos         2.10.0
kubernetes.core               2.3.1
mellanox.onyx                 1.0.0
netapp.aws                    21.7.0
netapp.azure                  21.10.0
netapp.cloudmanager           21.17.0
netapp.elementsw              21.7.0
netapp.ontap                  21.19.1
netapp.storagegrid            21.10.0
netapp.um_info                21.8.0
netapp_eseries.santricity     1.3.0
netbox.netbox                 3.7.1
ngine_io.cloudstack           2.2.3
ngine_io.exoscale             1.0.0
ngine_io.vultr                1.1.1
openstack.cloud               1.8.0
openvswitch.openvswitch       2.1.0
ovirt.ovirt                   1.6.6
purestorage.flasharray        1.13.0
purestorage.flashblade        1.9.0
sensu.sensu_go                1.13.1
servicenow.servicenow         1.0.6
splunk.es                     1.0.2
t_systems_mms.icinga_director 1.29.0
theforeman.foreman            2.2.0
vmware.vmware_rest            2.1.5
vyos.vyos                     2.8.0
wti.remote                    1.0.3

AWS SDK versions

$ pip show boto boto3 botocore
WARNING: Package(s) not found: boto
Name: boto3
Version: 1.24.3
Summary: The AWS SDK for Python
Home-page: https://github.com/boto/boto3
Author: Amazon Web Services
Author-email:
License: Apache License 2.0
Location: /usr/local/lib/python3.10/site-packages
Requires: botocore, jmespath, s3transfer
Required-by:
---
Name: botocore
Version: 1.27.3
Summary: Low-level, data-driven core of boto 3.
Home-page: https://github.com/boto/botocore
Author: Amazon Web Services
Author-email:
License: Apache License 2.0
Location: /usr/local/lib/python3.10/site-packages
Requires: jmespath, python-dateutil, urllib3
Required-by: boto3, s3transfer

Configuration

$ ansible-config dump --only-changed

OS / Environment

Running in a docker image built on Debian GNU/Linux 10 (buster)

Steps to Reproduce

- name: Stop gold masters
  community.aws.ec2_instance:
    state: stopped
    wait: true
    instance_ids: "{{ gold_master_instances }}"
    region: us-east-1
    profile: "{{ aws_profile }}"

Expected Results

I would expect the instances in the gold_master_instances list to be stopped after this task runs. If they are already stopped, I would expect it to exit quickly, otherwise I would expect it to wait for them to stop.

This has worked in the past, tonight is the first time I have seen this failure.

Actual Results

{
    "stop_success": [
        "i-0fdxxxxxxxxxxeef0"
    ],
    "stop_failed": [
        "i-0d9xxxxxxxxxx5a5b",
        "i-0ccxxxxxxxxxxdfd0"
    ],
    "msg": "Unable to stop instances: ",
    "invocation": {
        "module_args": {
            "state": "stopped",
            "wait": true,
            "instance_ids": [
                "i-0ccxxxxxxxxxxdfd0",
                "i-0fdxxxxxxxxxxeef0",
                "i-0d9xxxxxxxxxx5a5b"
            ],
            "region": "us-east-1",
            "profile": "dxxxxxxxxxxd",
            "debug_botocore_endpoint_logs": false,
            "validate_certs": true,
            "wait_timeout": 600,
            "security_groups": [],
            "purge_tags": false,
            "ec2_url": null,
            "aws_access_key": null,
            "aws_secret_key": null,
            "security_token": null,
            "aws_ca_bundle": null,
            "aws_config": null,
            "count": null,
            "exact_count": null,
            "image": null,
            "image_id": null,
            "instance_type": null,
            "user_data": null,
            "tower_callback": null,
            "ebs_optimized": null,
            "vpc_subnet_id": null,
            "availability_zone": null,
            "security_group": null,
            "instance_role": null,
            "name": null,
            "tags": null,
            "filters": {
                "instance-state-name": [
                    "pending",
                    "running",
                    "stopping",
                    "stopped"
                ],
                "instance-id": [
                    "i-0ccxxxxxxxxxxdfd0",
                    "i-0fdxxxxxxxxxxeef0",
                    "i-0d9xxxxxxxxxx5a5b"
                ]
            },
            "launch_template": null,
            "key_name": null,
            "cpu_credit_specification": null,
            "cpu_options": null,
            "tenancy": null,
            "placement_group": null,
            "instance_initiated_shutdown_behavior": null,
            "termination_protection": null,
            "detailed_monitoring": null,
            "network": null,
            "volumes": null,
            "metadata_options": null
        }
    },
    "deprecations": [
        {
            "msg": "Default value instance_type has been deprecated, in the future you must set an instance_type or a launch_template",
            "date": "2023-01-01",
            "collection_name": "amazon.aws"
        }
    ],
    "_ansible_no_log": false,
    "changed": false
}

Code of Conduct

[X] I agree to follow the Ansible Code of Conduct

ansibullbot commented 2 years ago

Files identified in the description: None

If these files are inaccurate, please update the component name section of the description or use the !component bot command.

click here for bot help

ansibullbot commented 2 years ago

Files identified in the description:

[plugins/modules/ec2_instance.py](https://github.com/['ansible-collections/amazon.aws', 'ansible-collections/community.aws', 'ansible-collections/community.vmware']/blob/main/plugins/modules/ec2_instance.py)

If these files are inaccurate, please update the component name section of the description or use the !component bot command.

click here for bot help

ansibullbot commented 2 years ago

cc @jillr @ryansb @s-hertel @tremble click here for bot help

tremble commented 2 years ago

@sjthespian Thanks for taking the time to open this issue.

ec2_instance was promoted to the "amazon.aws" collection (different support policies), so I've moved the issue over there. Version 3.x of this collection is nearing the end of it's support life (we're starting to prepare for 5.0), and 2.x is no longer supported by the community.

Please could you try to reproduce this issue using a more recent release of this collection (4.2.0 is the latest release). There's been some significant work around handling state since 3.x which wasn't all backported and may fix your issue.

sjthespian commented 2 years ago

Let me run some testing on the amazon.aws version next week. This is the first time I have seen this issue, it never showed up in my testing of the module in our dev environment, so this could be tough to reproduce.

sjthespian commented 2 years ago

So far things are looking good using amazon.aws.ec2_instance. I don't know if the original problem was a race condition in the cluster or switching actually fixed things; but in either case I'm going to go ahead and close this. I don't have a cluster I can play with to keep seeing if I can reproduce the issue unfortunately.

Thanks for the help!

sjthespian commented 2 years ago

Reopening this -- I just had the same failure using amazon.aws.ec2_instance.

# Make sure gold masters are stopped
- name: Stop gold masters
  amazon.aws.ec2_instance:
    state: stopped
    wait: true
    instance_ids: "{{ gold_master_instances }}"
    region: us-east-1
    profile: "{{ aws_profile }}"
  tags:
    - sysprep

Which sometimes fails with:

{
    "stop_success": [
        "i-08b7xxxxxxxx01e4",
        "i-0604xxxxxxxxf9f6"
    ],
    "stop_failed": [
        "i-074bxxxxxxxx3a79"
    ],
    "msg": "Unable to stop instances: ",
    "invocation": {
        "module_args": {
            "state": "stopped",
            "wait": true,
            "instance_ids": [
                "i-08b7xxxxxxxx01e4",
                "i-0604xxxxxxxxf9f6",
                "i-074bxxxxxxxx3a79"
            ],
            "region": "us-east-1",
            "profile": "xxx-xxx-xxx",
            "debug_botocore_endpoint_logs": false,
            "validate_certs": true,
            "wait_timeout": 600,
            "security_groups": [],
            "purge_tags": false,
            "ec2_url": null,
            "aws_access_key": null,
            "aws_secret_key": null,
            "security_token": null,
            "aws_ca_bundle": null,
            "aws_config": null,
            "count": null,
            "exact_count": null,
            "image": null,
            "image_id": null,
            "instance_type": null,
            "user_data": null,
            "tower_callback": null,
            "ebs_optimized": null,
            "vpc_subnet_id": null,
            "availability_zone": null,
            "security_group": null,
            "instance_role": null,
            "name": null,
            "tags": null,
            "filters": {
                "instance-state-name": [
                    "pending",
                    "running",
                    "stopping",
                    "stopped"
                ],
                "instance-id": [
                    "i-08b7xxxxxxxx01e4",
                    "i-0604xxxxxxxxf9f6",
                    "i-074bxxxxxxxx3a79"
                ]
            },
            "launch_template": null,
            "key_name": null,
            "cpu_credit_specification": null,
            "cpu_options": null,
            "tenancy": null,
            "placement_group": null,
            "instance_initiated_shutdown_behavior": null,
            "termination_protection": null,
            "detailed_monitoring": null,
            "network": null,
            "volumes": null,
            "metadata_options": null
        }
    },
    "deprecations": [
        {
            "msg": "Default value instance_type has been deprecated, in the future you must set an instance_type or a launch_template",
            "date": "2023-01-01",
            "collection_name": "amazon.aws"
        }
    ],
    "_ansible_no_log": false,
    "changed": false
}

I believe what is happening is that something earlier in the playbook tells the instances to shut down from the OS. When it gets to this point in the playbook the instances are either already down, or in the process of shutting down. However, since I am using wait: true, I would expect tit to just wait for them all to stop rather than failing with the above error.

ansible-collections / amazon.aws