ansible-collections / azure

Development area for Azure Collections
https://galaxy.ansible.com/azure/azcollection
GNU General Public License v3.0
245 stars 327 forks source link

azure_rm inventory returns no hosts, a subset or all #224

Open ljosten opened 4 years ago

ljosten commented 4 years ago
SUMMARY

When requesting an inventory using the dynamic azure_rm plugin no hosts are returned. All included resource groups include virtual machines up and running. This changes during the day with the same inventory yml configuration in use, from no hosts returned to some hosts returned to all hosts returned.

ISSUE TYPE
COMPONENT NAME

azure_rm dynamic inventory plugin

ANSIBLE VERSION
ansible 2.9.11
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/home/x/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/lib/python2.7/dist-packages/ansible
  executable location = /usr/local/bin/ansible
  python version = 2.7.17 (default, Apr 15 2020, 17:20:14) [GCC 7.5.0]
CONFIGURATION
no output
OS / ENVIRONMENT

Ubuntu 20.04, RHEL 7 Installed via pip using the ansible[azure] collection

STEPS TO REPRODUCE

Create inventory definition, for example ansible_azure_rm.yml with following content

plugin: azure_rm
include_vm_resource_groups:
- s85df53d7-sbx-rg
- s85df53d7-sbx-rg
- s85df53d7-sbx-rg
auth_source: auto
keyed_groups:
- prefix: tag
  key: tags

use ansible-inventory to list hosts:

ansible-inventory --inventory ansible_azure_rm.yml --list
ANSIBLE_DEBUG="1" ansible-inventory --inventory ansible_azure_rm.yml --list -vvv
   652 1596737744.42542: starting run
ansible-inventory 2.9.11
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/home/x/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/local/lib/python2.7/dist-packages/ansible
  executable location = /usr/local/bin/ansible-inventory
  python version = 2.7.17 (default, Apr 15 2020, 17:20:14) [GCC 7.5.0]
Using /etc/ansible/ansible.cfg as config file
   652 1596737744.50482: Added group all to inventory
   652 1596737744.50488: Added group ungrouped to inventory
   652 1596737744.50495: Group all now contains ungrouped
   652 1596737744.50503: Examining possible inventory source: /x/ansible_azure_rm.yml
   652 1596737744.50684: trying /usr/local/lib/python2.7/dist-packages/ansible/plugins/cache
   652 1596737744.50750: Loading CacheModule 'memory' from /usr/local/lib/python2.7/dist-packages/ansible/plugins/cache/memory.py
   652 1596737744.50795: trying /usr/local/lib/python2.7/dist-packages/ansible/plugins/inventory
   652 1596737744.50876: Loading InventoryModule 'host_list' from /usr/local/lib/python2.7/dist-packages/ansible/plugins/inventory/host_list.py
   652 1596737744.50953: Loaded config def from plugin (inventory/script)
   652 1596737744.50960: Loading InventoryModule 'script' from /usr/local/lib/python2.7/dist-packages/ansible/plugins/inventory/script.py
   652 1596737744.50994: Loading InventoryModule 'auto' from /usr/local/lib/python2.7/dist-packages/ansible/plugins/inventory/auto.py
   652 1596737744.51055: Loaded config def from plugin (inventory/yaml)
   652 1596737744.51061: Loading InventoryModule 'yaml' from /usr/local/lib/python2.7/dist-packages/ansible/plugins/inventory/yaml.py
   652 1596737744.51112: Loading InventoryModule 'ini' from /usr/local/lib/python2.7/dist-packages/ansible/plugins/inventory/ini.py
   652 1596737744.51157: Loading InventoryModule 'toml' from /usr/local/lib/python2.7/dist-packages/ansible/plugins/inventory/toml.py
   652 1596737744.51165: Attempting to use plugin host_list (/usr/local/lib/python2.7/dist-packages/ansible/plugins/inventory/host_list.py)
host_list declined parsing /x/ansible_azure_rm.yml as it did not pass its verify_file() method
   652 1596737744.51252: Attempting to use plugin script (/usr/local/lib/python2.7/dist-packages/ansible/plugins/inventory/script.py)
   652 1596737744.52305: /x/ansible_azure_rm.yml was not parsable by script
   652 1596737744.52334: Attempting to use plugin auto (/usr/local/lib/python2.7/dist-packages/ansible/plugins/inventory/auto.py)
   652 1596737744.52462: Loading data from /x/ansible_azure_rm.yml
   652 1596737744.78182: trying /usr/local/lib/python2.7/dist-packages/ansible/plugins/doc_fragments
   652 1596737744.78368: Loading ModuleDocFragment 'azure' from /usr/local/lib/python2.7/dist-packages/ansible/plugins/doc_fragments/azure.py
   652 1596737744.78578: Loaded config def from plugin (inventory/azure_rm)
   652 1596737744.78590: Loading InventoryModule 'azure_rm' from /usr/local/lib/python2.7/dist-packages/ansible/plugins/inventory/azure_rm.py
   652 1596737744.78766: Loading data from /x/ansible_azure_rm.yml
Parsed /x/ansible_azure_rm.yml inventory source with auto plugin
   652 1596737745.44544: Reconcile groups and hosts in inventory.
   652 1596737745.44564: Loading CacheModule 'memory' from /usr/local/lib/python2.7/dist-packages/ansible/plugins/cache/memory.py (found_in_cache=True, class_only=False)
{
    "_meta": {
        "hostvars": {}
    },
    "all": {
        "children": [
            "ungrouped"
        ]
    }
}
EXPECTED RESULTS

hosts are retrieved and inventory is generated

ACTUAL RESULTS
{
    "_meta": {
        "hostvars": {} 
    }, 
    "all": {
        "children": [  
            "ungrouped"
        ]
    }
}
Fred-sun commented 4 years ago

@ljosten Do you still have this problem? I was checking the azure_rm.py module recently and I also tested it. I haven't encountered the problem you mentioned. Can you try again locally? And update the results of your test. Thank you!

ljosten commented 4 years ago

@Fred-sun the issue is occuring as of right now. It occurs in azure pipelines aswell as from my local setup. Here a sample output from a few minutes ago:

azure_rm.yml:

plugin: azure_rm
include_vm_resource_groups:
- s-example_40405977-sbx-rg
auth_source: auto
keyed_groups:
- prefix: tag
  key: tags

ansible-inventory -i azure_rm.yml:

{
    "_meta": {
        "hostvars": {}
    },
    "all": {
        "children": [
            "ungrouped"
        ]
    }
}

ansible-playbook -i azure_rm.yml site.yml:

[WARNING]: provided hosts list is empty, only localhost is available. Note that
the implicit localhost does not match 'all'

PLAY [Configure *****] ****************************************
skipping: no hosts matched

PLAY RECAP *********************************************************************

Finishing: Run Ansible rollout on *******

The exact same configurations worked without problems only minutes before and returned a valid inventory. As I stated above the behaviour and rate of occurence changes during the time of the day, possibly due to heavier load on Azure API endpoints. A retry mechanism, API error return codes or a validation mechanism needs to be present to check the inventory before processing the playbook.

ljosten commented 3 years ago

Is there any progress here or further information required? The issue persists as of today, using ansible 2.10

kgorskowski commented 3 years ago

I can confirm this is happening periodically in our pipelines. After some retries the inventory is eventually generated but no consistent behavior

kgorskowski commented 3 years ago

We had some feedback from Azure side that this may be related to replication times between regions. We run ansible right after the rollout of the instances with terraform. So its possible according to the support that the state of the instance or resource group is not yet available at the first few ansible runs. We use "westeurope" as our default location for terraform provider configuration and the azure inventory plugin with the default settings, which resolves from my understanding to cloud_environment = AzureCloud. Is there a way to configure the azure plugin "talk" to the same region in which the instances were created in maybe?

matt961 commented 3 years ago

Can report that I recently discovered this issue in a production deploy where one of my machines did not get picked up by the plugin, and the playbook skipped all tasks as a result. As there are technically no failures, my rolling release script happily puts back un-updated instances into the load balancer, resulting in two different versions of my code running. Without any changes I run my pipeline again and the same instance is added to the inventory as normal.

For me it is not (I would hope) a replication delay, as these machines have been deployed with the same tags for probably close to half a year now. That still doesn't explain why the very next run ~10 mins later it picks up both machines successfully.

To avoid potentially missing updates on machines I will be adding check tasks in my playbook that target localhost to make sure the relevant host groups have at least one host in them before all other tasks, and fail if not.

I'm on one of the later patches of ansible 2.9 using python 2.7.

ghost commented 3 years ago

I'm also seeing this issue. The majority of the time I'm seeing all the hosts being picked up correctly, but occasionally it'll only pick up a subset of them. When this happens, once I rerun the playbook, it will usually pick up all the hosts again correctly (with no changes at all.)

Originally I thought it was after terraform had updated our VM instances that we were seeing this behaviour, but I've noticed it also seems to happen even when there's been no recent updates to the VMs.

I'm using ansible 2.10.7 on Python 3.6.8

kgorskowski commented 3 years ago

Hey folks, just for completion. In our case it was a mix of network topology and a short delay in the sync of the resources between regions. We created the resources in one region but due to some background networking foo on our selfhosted agents side the request from the azure dynamic inventory went to a different region. In the end we were able to mitigate the network issue on our side and haven't seen this problem again so far.

ljosten commented 3 years ago

Also, in addition to Karsten's information, another problem with returning empty inventories was solved by passing an empty host filter list instead of using the default:

plugin: azure_rm
include_vm_resource_groups:
- sb8f2294b-xx1-sbx-rg
- sb8f2294b-xx2-sbx-rg
- sb8f2294b-xx3-sbx-rg
- sb8f2294b-xx4-sbx-rg
auth_source: auto
keyed_groups:
- prefix: tag
  key: tags
default_host_filters: []

We think this had to do with policies enforced in azure that idled for prolonged time and set the deployment status of nodes into a filtered state.