equinix-labs / ansible-collection-equinix

Ansible content to help automate the management of Equinix resources
https://deploy.equinix.com/labs/ansible-collection-equinix/
GNU General Public License v3.0
2 stars 8 forks source link

metal_device creates duplicate servers when a server takes too long to become active #192

Open ctreatma opened 3 weeks ago

ctreatma commented 3 weeks ago
SUMMARY

When given a hostname and project_id, the metal_device module attempts to find a server with that hostname in the specified Equinix Metal project, and creates a server if it can't find an existing match. It appears that, if the server is early enough in the provisioning process, the module is unable to find the existing server and it submits another request to create an identical server instead of monitoring the existing request.

I've observed this issue when a server is in queued state, but it is possible the issue exists for other states as well. I confirmed that the server is visible in the API response by running metal devices get -p <project_id>; since that CLI command hits the same endpoint that the metal_device module uses, this appears to be a problem in the Ansible collection and not in the Equinix Metal API.

ISSUE TYPE
COMPONENT NAME

equinix.cloud.metal_device

ANSIBLE VERSION
ansible [core 2.13.10]
  config file = None
  configured module search path = ['/Users/ctreatman/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /Users/ctreatman/Library/Python/3.9/lib/python/site-packages/ansible
  ansible collection location = /Users/ctreatman/.ansible/collections:/usr/share/ansible/collections
  executable location = /Users/ctreatman/Library/Python/3.9/bin/ansible
  python version = 3.9.6 (default, Feb  3 2024, 15:58:27) [Clang 15.0.0 (clang-1500.3.9.4)]
  jinja version = 3.0.1
  libyaml = False
CONFIGURATION
# No output
OS / ENVIRONMENT

N/A

STEPS TO REPRODUCE

The config below uses an extremely short timeout to guarantee that the module will fail before the device is provisioned. Run the config using ansible-playbook <path/to/file.yaml>. You may need to run the command multiple times; since the hostname is changing you should end up with 2 servers no matter how many times you run the config, but if you look in the Equinix Metal console you will see more than 2 servers.

---
- name: create Equinix Metal device
  hosts: localhost
  tasks:
    - equinix.cloud.metal_project:
        name: "Ansible Dupe Test"
      register: project

    - name: "Create {{ cluster_request.controllers.count }} Kube Controllers"
      equinix.cloud.metal_device:
        project_id: "{{ project.id }}"
        state: "present"
        hostname: "{{ cluster_request.env }}-{{ cluster_request.metro}}-{{ cluster_request.class }}{% if cluster_request.subclass %}-{{ cluster_request.subclass }}{% endif %}-{{ k8s_cluster_phase }}-{{ item }}"
        tags:
          - "{{ cluster_request.env }}-{{ cluster_request.metro }}-{{ cluster_request.class }}{% if cluster_request.subclass %}-{{ cluster_request.subclass }}{% endif %}-{{ k8s_cluster_phase }}"
          - kube_controllers
        operating_system: "{{ cluster_request.os }}"
        plan: "{{ cluster_request.plan }}"
        ipxe_script_url: "{{ cluster_request.ipxe_script_url }}"
        userdata: "{{ ignition_content.stdout|string }}" #yes we need this string filter. no to_json doesn't work. ansible.. is dumb.
        metro: "{{ cluster_request.metro }}"
        provisioning_wait_seconds: 5
      when:
        - cluster_request.controllers.count > 0
        - cluster_request.controllers.count is defined
      loop: "{{ range(0, cluster_request.controllers.count) }}"
      retries: 5
      delay: 10
      register: controller_nodes

    - name: "Debug output"
      ansible.builtin.debug:
        var: controller_nodes

You can put the following contents in group_vars/all.yml to ensure the necessary variables are defined for the above config:

---
k8s_cluster_phase: test
ignition_content:
  stdout: ""
cluster_request:
  controllers:
    count: 2
  class: dupe
  subclass:
  env: test
  metro: ch
  os: ubuntu_22_04
  plan: m3.small.x86
  ipxe_script_url: ""
  userdata: ""
EXPECTED RESULTS

The config above should create exactly 2 servers.

ACTUAL RESULTS

More than 2 servers are created, and each hostname is duplicated multiple times.

I've also observed the following error in the Ansible output:

"msg": "Error in metal_device: (422)\nReason: Unprocessable Entity\nHTTP response headers: HTTPHeaderDict({'Content-Type': 'application/json; charset=utf-8', 'X-Frame-Options': 'SAMEORIGIN', 'X-XSS-Protection': '1; mode=block', 'X-Content-Type-Options': 'nosniff', 'X-Download-Options': 'noopen', 'X-Permitted-Cross-Domain-Policies': 'none', 'Referrer-Policy': 'strict-origin-when-cross-origin', 'Cache-Control': 'no-cache', 'X-Request-Id': '9e8560e24f8b12064da1d670d8098e53', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains', 'Content-Length': '99', 'Date': 'Mon, 10 Jun 2024 15:15:55 GMT', 'Connection': 'close'})\nHTTP response body: error=None errors=['No matches found. There are no servers in any metro that match your search criteria.'] href=None\n"
ctreatma commented 3 weeks ago

~I've hit a wall debugging the 422 error further. The metal-python SDK will print HTTP requests and responses to stdout when configuration.debug = True, but Ansible makes it difficult to get at the module's stdout and I've had no luck so far figuring out how to wire that up.~

The 422 error is due to a lack of platform capacity and is an outcome rather than a cause of the duplicate servers.

ctreatma commented 3 weeks ago

NOTE: This issue was mitigated in v0.6.2+ by increasing the wait timeout in metal_device to 30 minutes. The increased timeout makes it less likely to encounter this behavior in the wild, but there is still the potential for it to happen until we come up with a direct fix.

ctreatma commented 3 weeks ago

Upon further inspection, the issue is that the Ansible collection filters by metro when looking up devices; metal-cli does not appear to support a metro filter on devices, so it cannot reproduce this issue.

I observed the following behavior when a server is in queued state:

Once the server comes out of queued state, both curl commands return the matching server.