connection variables are not recursively templating under delegation

ansible / ansible

Ansible is a radically simple IT automation platform that makes your applications and systems easier to deploy and maintain. Automate everything from code deployment to network configuration to cloud management, in a language that approaches plain English, using SSH, with no agents to install on remote systems. https://docs.ansible.com.

https://www.ansible.com/

GNU General Public License v3.0

62.26k stars 23.79k forks source link

connection variables are not recursively templating under delegation #72776

Open rwagnergit opened 3 years ago

rwagnergit commented 3 years ago

SUMMARY

Beginning in ansible 2.9.10, tasks which use ansible_connection that is dynamically evaluated (i.e., from a jinja2 expression) fail.

ISSUE TYPE

Bug Report

COMPONENT NAME

ansible_connection

ANSIBLE VERSION

ansible --version
ansible 2.10.3
  config file = /home/rowagn/git/oracle-core/automation/ansible/db/ansible.cfg
  configured module search path = ['/home/rowagn/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /sso/sfw/virtualenv/ansible2915/lib/python3.8/site-packages/ansible_base-2.10.3-py3.8.egg/ansible
  executable location = /sso/sfw/virtualenv/ansible2915/bin/ansible
  python version = 3.8.6 (default, Dec  1 2020, 10:43:59) [GCC 5.4.0 20160609]

CONFIGURATION

ansible-config dump --only-changed
[DEPRECATION WARNING]: ALLOW_WORLD_READABLE_TMPFILES option, moved to a per plugin approach that is more flexible. , use mostly the same config will work, but now controlled from
 the plugin itself and not using the general constant. instead. This feature will be removed in version 2.14. Deprecation warnings can be disabled by setting 
deprecation_warnings=False in ansible.cfg.
ALLOW_WORLD_READABLE_TMPFILES(/home/rowagn/git/oracle-core/automation/ansible/db/ansible.cfg) = True
ANSIBLE_PIPELINING(/home/rowagn/git/oracle-core/automation/ansible/db/ansible.cfg) = True
ANSIBLE_SSH_ARGS(/home/rowagn/git/oracle-core/automation/ansible/db/ansible.cfg) = -C -o ControlMaster=no -o StrictHostKeyChecking=no -o GSSAPIAuthentication=no
DEFAULT_EXECUTABLE(/home/rowagn/git/oracle-core/automation/ansible/db/ansible.cfg) = /etc/ansible-wrapper
DEFAULT_ROLES_PATH(/home/rowagn/git/oracle-core/automation/ansible/db/ansible.cfg) = ['/usr/share/ansible/roles']
DEFAULT_TIMEOUT(/home/rowagn/git/oracle-core/automation/ansible/db/ansible.cfg) = 20
RETRY_FILES_ENABLED(/home/rowagn/git/oracle-core/automation/ansible/db/ansible.cfg) = False

OS / ENVIRONMENT

Ubuntu 16.04 uname -a Linux localhost.localdomain.na.sas.com 4.4.0-193-generic #224-Ubuntu SMP Tue Oct 6 17:15:28 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

STEPS TO REPRODUCE

Create a hosts file containing a host that is NOT localhost

cat ~/r
rowagn-tower-test01.vsp.sas.com

Then run the following playbook:

ansible-playbook test.yml -i ~/r

---
- hosts: all
  gather_facts: no

  tasks:
  - name: determine control machine user
    shell: whoami
    register: whoami_control_machine_output
    delegate_to: localhost
    become: no

  - debug: var=whoami_control_machine_output
  - debug: msg="{{ (whoami_control_machine_output.stdout == 'awx') | ternary('ssh','local') }}"

  - name: create directory localhost
    file:
      path: /tmp/bogus
      state: directory
      mode: u=rwX,g=,o=
    delegate_to: localhost
    vars:
      ansible_connection: "{{ (whoami_control_machine_output.stdout == 'awx') | ternary('ssh','local') }}"

EXPECTED RESULTS

On ansible 2.9.9, the playbook runs without errors:

$ ansible-playbook test.yml -i ~/r 

PLAY [all] *********************************************************************

TASK [determine control machine user] ******************************************
changed: [rowagn-tower-test01.vsp.sas.com -> localhost]

TASK [debug] *******************************************************************
ok: [rowagn-tower-test01.vsp.sas.com] => {
    "whoami_control_machine_output": {
        "changed": true, 
        "cmd": "whoami", 
        "delta": "0:00:00.004553", 
        "end": "2020-12-01 13:42:11.349327", 
        "failed": false, 
        "rc": 0, 
        "start": "2020-12-01 13:42:11.344774", 
        "stderr": "", 
        "stderr_lines": [], 
        "stdout": "rowagn", 
        "stdout_lines": [
            "rowagn"
        ]
    }
}

TASK [debug] *******************************************************************
ok: [rowagn-tower-test01.vsp.sas.com] => {
    "msg": "local"
}

TASK [create directory localhost] **********************************************
ok: [rowagn-tower-test01.vsp.sas.com -> localhost]

PLAY RECAP *********************************************************************
rowagn-tower-test01.vsp.sas.com : ok=4    changed=1    unreachable=0    failed=0    skipped=0    rescued=0    ignored=0

ACTUAL RESULTS

On Ansible > 2.9.9, it fails. Here is the output from 2.10.3:

$ ansible-playbook test.yml -i ~/r -vvvv
ansible-playbook 2.10.3
  config file = /home/rowagn/git/oracle-core/automation/ansible/db/ansible.cfg
  configured module search path = ['/home/rowagn/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /sso/sfw/virtualenv/ansible2915/lib/python3.8/site-packages/ansible_base-2.10.3-py3.8.egg/ansible
  executable location = /sso/sfw/virtualenv/ansible2915/bin/ansible-playbook
  python version = 3.8.6 (default, Dec  1 2020, 10:43:59) [GCC 5.4.0 20160609]
Using /home/rowagn/git/oracle-core/automation/ansible/db/ansible.cfg as config file
[DEPRECATION WARNING]: ALLOW_WORLD_READABLE_TMPFILES option, moved to a per 
plugin approach that is more flexible. , use mostly the same config will work, 
but now controlled from the plugin itself and not using the general constant. 
instead. This feature will be removed in version 2.14. Deprecation warnings can
 be disabled by setting deprecation_warnings=False in ansible.cfg.
setting up inventory plugins
host_list declined parsing /home/rowagn/r as it did not pass its verify_file() method
script declined parsing /home/rowagn/r as it did not pass its verify_file() method
auto declined parsing /home/rowagn/r as it did not pass its verify_file() method
Parsed /home/rowagn/r inventory source with ini plugin
Loading callback plugin default of type stdout, v2.0 from /sso/sfw/virtualenv/ansible2915/lib/python3.8/site-packages/ansible_base-2.10.3-py3.8.egg/ansible/plugins/callback/default.py
Skipping callback 'default', as we already have a stdout callback.
Skipping callback 'minimal', as we already have a stdout callback.
Skipping callback 'oneline', as we already have a stdout callback.

PLAYBOOK: test.yml *************************************************************
Positional arguments: test.yml
verbosity: 4
connection: smart
timeout: 20
become_method: sudo
tags: ('all',)
inventory: ('/home/rowagn/r',)
forks: 5
1 plays in test.yml

PLAY [all] *********************************************************************
META: ran handlers

TASK [determine control machine user] ******************************************
task path: /home/rowagn/git/oracle-core/automation/ansible/db/test.yml:6
Using module file /sso/sfw/virtualenv/ansible2915/lib/python3.8/site-packages/ansible_base-2.10.3-py3.8.egg/ansible/modules/command.py
Pipelining is enabled.
<localhost> ESTABLISH LOCAL CONNECTION FOR USER: rowagn
<localhost> EXEC /etc/ansible-wrapper -c '/sso/sfw/virtualenv/ansible2915/bin/python && sleep 0'
changed: [rowagn-tower-test01.vsp.sas.com] => {
    "changed": true,
    "cmd": "whoami",
    "delta": "0:00:00.005582",
    "end": "2020-12-01 13:43:49.748619",
    "invocation": {
        "module_args": {
            "_raw_params": "whoami",
            "_uses_shell": true,
            "argv": null,
            "chdir": null,
            "creates": null,
            "executable": null,
            "removes": null,
            "stdin": null,
            "stdin_add_newline": true,
            "strip_empty_ends": true,
            "warn": true
        }
    },
    "rc": 0,
    "start": "2020-12-01 13:43:49.743037",
    "stderr": "",
    "stderr_lines": [],
    "stdout": "rowagn",
    "stdout_lines": [
        "rowagn"
    ]
}

TASK [debug] *******************************************************************
task path: /home/rowagn/git/oracle-core/automation/ansible/db/test.yml:12
ok: [rowagn-tower-test01.vsp.sas.com] => {
    "whoami_control_machine_output": {
        "changed": true,
        "cmd": "whoami",
        "delta": "0:00:00.005582",
        "end": "2020-12-01 13:43:49.748619",
        "failed": false,
        "rc": 0,
        "start": "2020-12-01 13:43:49.743037",
        "stderr": "",
        "stderr_lines": [],
        "stdout": "rowagn",
        "stdout_lines": [
            "rowagn"
        ]
    }
}

TASK [debug] *******************************************************************
task path: /home/rowagn/git/oracle-core/automation/ansible/db/test.yml:13
ok: [rowagn-tower-test01.vsp.sas.com] => {
    "msg": "local"
}

TASK [create directory localhost] **********************************************
task path: /home/rowagn/git/oracle-core/automation/ansible/db/test.yml:15
fatal: [rowagn-tower-test01.vsp.sas.com]: FAILED! => {
    "msg": "'whoami_control_machine_output' is undefined"
}

PLAY RECAP *********************************************************************
rowagn-tower-test01.vsp.sas.com : ok=3    changed=1    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0

Note that 2.9.10 and 2.9.11 fail with a different error:

TASK [create directory localhost] **********************************************
fatal: [rowagn-tower-test01.vsp.sas.com]: FAILED! => {"msg": "the connection plugin '{{ (whoami_control_machine_output.stdout == 'awx') | ternary('ssh','local') }}' was not found"}

ansibot commented 3 years ago

Files identified in the description: None

If these files are incorrect, please update the component name section of the description or use the !component bot command.

click here for bot help

rwagnergit commented 3 years ago

Possibly related to https://github.com/ansible/ansible/issues/70653.

rwagnergit commented 3 years ago

Issue also exists in 2.11 (devel branch)

sivel commented 3 years ago

I'm looking over https://github.com/ansible/ansible/commit/8c213c93345db5489c24458880ec3ff81b334dbd and I'm not sure what the right thing to do is here.

I'm pretty sure that commit was addressing a symptom and not the problem.

The task definition should be templated using the host vars of the host in the loop, and as such, when post_validate runs, I expect that the task is finalized.

However, in https://github.com/ansible/ansible/commit/1da47bfa8c6711e19902e4a1460d3276d33664e1 we made a change to not template vars during post_validate, and I'm questioning that change as well.

Because of the decision to not post_validate vars, we've delayed evaluating vars until after we have substituted the vars of the delegated host, which is now incorrect, since the task definition should reflect the current host, not the delegated host.

I think we need to revert both https://github.com/ansible/ansible/commit/8c213c93345db5489c24458880ec3ff81b334dbd and https://github.com/ansible/ansible/commit/1da47bfa8c6711e19902e4a1460d3276d33664e1 and then implement a different fix for the issue that https://github.com/ansible/ansible/commit/1da47bfa8c6711e19902e4a1460d3276d33664e1 was attempting to solve.

semora81 commented 3 years ago

Hi, as this is my first time tracking an issue with ansible, how / when can we figure out if this has been reverted as sivel is proposing ?

will this issue be updated with the fix details?

thanks

bcoca commented 3 years ago

might be fixed by #72419

rwagnergit commented 3 years ago

@bcoca and @sivel (and anyone running into this) - following #ansible-meeting this morning, you suggested using set_fact to create the variable I need in ansible_connection. I tried that, but I cannot set a fact for localhost using set_fact:

I tried:

  - set_fact:
      use_ansible_connection: "{{ (whoami_control_machine_output.stdout == 'awx') | ternary('ssh','local') }}"
  - set_fact:
      use_ansible_connection: "{{ (whoami_control_machine_output.stdout == 'awx') | ternary('ssh','local') }}"
    delegate_to: localhost

To examine hostvars['localhost'], I used:

debug: var=hostvars['localhost']['use_ansible_connection']

but that yields "VARIABLE IS NOT DEFINED!" in both cases.

However, setting the variable with add_host works:

  - add_host:
      name: 'localhost'
      use_ansible_connection: "{{ (whoami_control_machine_output.stdout == 'awx') | ternary('ssh','local') }}"

and then I can use:

      ansible_connection: "{{ hostvars['localhost']['use_ansible_connection'] }}"

This works in 2.10. It's a clunky workaround, though, so I'm still holding out hope for bcoca's PR :-)

Just adding some more info here in case anyone else is following.

bcoca commented 3 years ago

see delegate_facts: https://docs.ansible.com/ansible/latest/user_guide/playbooks_delegation.html#delegating-facts

rwagnergit commented 3 years ago

I'll be damned. Thanks @bcoca. Just needed:

  - set_fact:
      use_ansible_connection: "{{ (whoami_control_machine_output.stdout == 'awx') | ternary('ssh','local') }}"
    delegate_to: localhost
    delegate_facts: yes

And then use it as above:

ansible_connection: "{{ hostvars['localhost']['use_ansible_connection'] }}"

I tested this workaround successfully in 2.9.17 and 2.10.5.

mutech commented 11 months ago

I think that since this is three years later, it's safe to assume that you cannot use delegation with an inventory in which variables such as ansible_host or ansible_become_pass are templated? At least not unless you know that the controller and target use the same dependent variables, which of course you won't know.

I'm a relatively fresh ansible user and it's incredibly difficult to predict what ansible does with templated variable definitions at any moment of time.

My personal best practices are becoming:

Don't use templates if you can help it.
Don't use templates in the inventory
Don't use roles, at least don't use variables and if you do, don't use templates
Copy & paste as much as you can.

This is a rather frustrating experience.

bcoca commented 11 months ago

I think that since this is three years later, it's safe to assume that you cannot use delegation with an inventory in which variables such as ansible_host or ansible_become_pass are templated?

@mutech no, you should be able to do so and have them templated in the 'correct host context'. All connection/become/shell options are templated in the DELEGATED host context, not in the inventory_hostname one, the only variable that still refers to the original host is inventory_hostname , even inventory_hostname_short refers to the delegated host.

All other variables and options are templated in the inventory_host context. I hope this clarifies templating for you.

I'll look at the documentation to be clearer, templating should be widely used to avoid copy and paste.

mutech commented 11 months ago

Of course inventory_hostname is what I use for ansible_hostname and querying things like become passwords from the keyring.

How do you know that? I did not even know that inventory_hostname_short exists and I'm pretty sure I read everything about variables (and forgot it again because it's too much). That might be a life safer for my current problem, I was actually about to go all copy and paste in my fury (spend a whole afternoon trying to delegate a single task "cleanly").

Just found the documentation, though I have no idea what the short version of the hostname is. I decided at one point to use unqualified hostnames in my inventory (to avoid having to quote everything). So I might be lucky and short hostnames might be the same as long ones.

But that would still be a hack in my view. I would expect that a task running in delegate_to has exactly the same scope as it would have if it would run in a playbook on its own (maybe with the exception that there might be an outer scope providing the view of the playbook, f.e. as in ".inventory_hostname").

That in my opinion the only way how you can get a reliable and understandable context for a delegated task.

bcoca commented 11 months ago

How do you know that?

I've either written and/or read and/or modified the code ... but that is cheating and I could not find docs, so I opened: https://github.com/ansible/ansible-documentation/pull/527

FYI: inventory_hostname_short = inventory_hostname.split('.')[0] so if no '.' they are basically the same. I'll update https://docs.ansible.com/ansible/latest/reference_appendices/special_variables.html#term-inventory_hostname_short to make it clearer.

".inventory_hostname").

That is what hostvars[inventory_hostname] should provide. Do you have examples in which this is insufficient?

mutech commented 11 months ago

Correct me if I'm wrong (I didn't have the time to read the source), what happens when a task is executed with delegate_to is that some variables are adapted to the delegation targets context, and some are not, e.g. inventory_hostname still is whatever it was before the delegation, while inventory_hostname_short is set to the delegation context, did I get that right?

The problem with that is, that without reading the source (every time I'm not certain that I did that for the ansible version I'm running on at that moment in time) I don't really know which variables I can use in my setup.

That's why I'm suggesting that a simple rule (simple for the user, for the ansible implementation it's probably a big change, especially considering that it's breaking code all over the place) such as all variables get set as if the task was executed in a playbook with the delegation target as host.

Using hostvars[some-inv-host] is not enough, because these variables are subject to templating, thus they might use references from "the other" context, hell I can imagine that even entries from hostvars[delegating-host] are invalid, if some variables from hostvars[delegation-target] change because of connection variables updated for delegation.

I don't see how that can ever work reliably without scoping (making sure that templating always happens in a well defined scope/context). By well defined I mean both well documented by ansible and well understood by users.

That is what hostvars[inventory_hostname] should provide. Do you have examples in which this is insufficient?

My setup looks like this (inventory pseudo code):

all:
  vars:
      become_key: "ansible-become:{{ inventory_hostname }} {{ ansible_user }}"
      ansible_become_pass: >-
        {{ lookup('community.general.keyring',
                       become_key, errors='ignore') | default(omit) }}
      # various configuration data that is shared between home and hosted

children/home:
    vars:
      ansible_user: mu
      become_key: "ansible-become:home {{ ansible_user }}"

  children/hosted:
    vars:
      ansible_user: admin

The general problem is, that I often need "the view of a host", for example when I need to create a cloud-init fragment, I need the ansible_become_pass for some VPS and its ansible-user. I would like to use hostvars[vps-name].ansible_become_pass, but that does not work, because it's a template using inventory_hostname (and has to be, unless I copy&paste). delegate_to would not work, because lookup is local.

Stangely, delegate_to in conjunction with become: yes sometimes work with this or similar setups and sometimes doesn't. It works in the home group context, because I have ldap and thus always the same pw, but cross group delegation also sometimes works (or it did in the past when I first tested this or a similar setup).

All this is not about the original problem with delegate_to, but as I understand it, it has the same root cause (lack of context/scope when using templated variables in different context like delegation/imported or included roles/playbook items/etc.).

mutech commented 11 months ago

I have to admit that I am probably using ansible in an unintended way. I have a lot of configuration data in my inventory (the whole network configuration, how networks are connected, service specifications, service deployments. It's pretty much what you find in etcd in a k8s cluster). Most of my playbooks are named using the theme of 'make it so' and the 'it' is something like update, DNS, nextcloud, freeipa, etc. I also have some state in the inventory (hostvars/groupvars folders) which is really nice because --diff --check provides a good overview of what's going on.

bcoca commented 11 months ago

is that some variables are adapted to the delegation targets context, and some are not,

It is not the variables themselves that are changed, it is the configuration options for the connection related plugins that are templated using a different set of variables. Any other option/parameter/variable/field will continue to use the variables in the inventory_hostname context.

For example remote_user is commonly set via either ansible.cfg, command line, environment variable or the ansible_user variable. For this setting we always consider the host we are connecting to and all variables, except inventory_hostname will be sourced from that host. 'That host' defaults to inventory_hostname, but can be the delegated host.

Also you assume ansible_user and other variables will reflect the configuration, this is sometimes true, but not guaranteed, specially for become information. These variables are meant as a high precedence way of setting it, not reflecting the configuration. I would switch to the config lookup.

become_key: "ansible-become:home {{q('config', 'remote_user',  plugin_type='connection', plugin_name='ssh'}}"

^ this also takes into account all configuration sources, include higher precedence variables like ansible_ssh_user, which for the ssh plugin would override ansible_user.

The only caveat, is that it always uses inventory_hostname context (I should add host= paramter for being able to specify context).

mutech commented 11 months ago

Is there a way to evaluate a template in the context of a host? F.e:

- hosts: [ localhost ]
  tasks:
  - debug:
      msg: "{{ lookup('???', 'some_variable', inventory_hostname='some_host') }}"

This is supposed to print exactly the same output (even if some_variable is defined as "{{ inventory_hostname }}-{{ ansible_user }}-..." as this:

- hosts: [ some_host ]
  tasks:
  - debug:
      msg: "{{ some_variable }}"

If there is a lookup '???' that does this, most of my issues go away. For my configuration, the only key that is relevant (as far as I can see) is inventory_hostname. I believe that ansible_user is (at least in my setup) deterministic, but you are right, the config lookup would be a better way to access the configured user. Using the connection plugin type and ssh as plugin ensures that variables are set so that delegate_to would be taking into account?

(I should add host= paramter for being able to specify context).

I'm not quite clear if I understand what the config lookup does. If it would do what the example above shows, then yes, please pretty please add that parameter :-)

Btw., Thanks a lot for taking the time to go into so much detail. That's very much appreciated!

bcoca commented 11 months ago

There is not currently a lookup that can do this, they don't get the required info to reconstruct the variable view.

I'm looking into a way to update the config lookup to do this. This lookup is designed to access configuration the same way the plugins do, so the resolution would be the same, the one problem is 'host context' which we currently provide to the plugin indirectly, via the variables available.

mutech commented 11 months ago

I have to apologize for my initial snappy comment, looking at how you reacted to it, I feel a bit like a neanderthaler grunting in displeasure. Huge thanks for your efforts, especially in reaction to that kind of attitude!

mutech commented 11 months ago

Just for clarification, looking at this:

- hosts: [ localhost ]
  vars:
    some_extra_var_a: ...
    some_extra_var_b: value_to_be_overridden
  tasks:
  - debug:
      msg: "{{ lookup('???', 'some_variable', inventory_hostname='some_host', some_extra_var_b='...') }}"

- hosts: [ some_host ]
  vars:
    some_extra_var_a: ...
    some_extra_var_b: ...
  tasks:
  - debug:
      msg: "{{ some_variable }}"

Assuming that some_extra_var_a/b would be defined the same, would that too result in some_extra_var_b overriding whatever might be defined for b?

That's not a feature I need, but I think that if that's not working, similar problems might pop up in other use cases (once someone discovers config had this semantics)

bcoca commented 11 months ago

I'm not sure what you are trying to do , but var precedence would have the vars you declare at play override host vars. So you would already be overriding them w/o having to have a special facility on the lookup.

mutech commented 11 months ago

I'm still thinking of the delegation problem, where you might need the delegated to host's view on things.

Having config set inventory_hostname would take care of obtaining the correct configuration of the target from the inventory, being able to override variables set by the playbook would take care of changes made by the playbook for the sake of the controller (localhost or more generally the delegation source) and override them to suit the target.

This is a bit construed, because if you keep the configuration data mostly in the inventory, everything should really depend only on the inventory_hostname and not on additional parameters set on the outside.

An example for why you might want something like this could be to have a parameter valued test and production where you might want the actual target configuration be identical, but the controller would need to know the difference, f.e. to setup a virtual network environment for testing while the target would use the same network parameters.

If the config lookup only sees the inventory data (and not whatever variables have been set in the playbook), there should be no problem.

I find that a bit difficult to explain, I hope my point comes across.

bcoca commented 11 months ago

What I think you are missing is that playbook variables override host vars, this does not change for delegation.

The config lookup sees the 'current variable context', it includes, but is not limited to inventory data. This context normally belongs to the inventory_hostname, what I'm trying to do is to see the delegated/arbitrary host context, which would still include overrides from the play itself.

So it would present the data as it would behave in the context of the play, not ignore it. It should work as close as the actual engine resolution for the config values does, otherwise it will be misinforming the user.

mutech commented 11 months ago

I'm not sure if I'm stretching your patience with this discussion or my inquiries, please tell me if so or when this gets too much off-topic.

I love Ansible's Inventory concept, because it's so expressive and you can beautifully sort configuration data into groups and override them for sub-groups and hosts. You can even put some external state (f.e. hosted DNS or VPS info) into group-vars files. Awesome.

Once you use this however, things get naturally complex. The author of configuration data has only the inventory available, they can't know what variables playbooks or roles use. The author of a role does not know whether their tasks are used in a delegated context and what the implications of that are. And - as this issue shows - the author of a playbook needing to delegate a task to another host often also does not know what the implications of that delegation are.

Now I understand (to some extend) how variable resolution works in Ansible. I can't really remember the process (it's too complex), but yes, the playbook can override variables and late evaluation of templates will have the most often expected results.

When you told me about the config lookup, I hoped that this would deliver the inventory authors view on data, that is the view that is "untained" by whatever playbooks and/or roles define.

I actually have no idea what the config lookup is actually doing. The documentation says:

Lookup current Ansible configuration values

I would take that to mean that it parses ansible.cfg and other sources of ansible configurations. I don't really understand why you would need to specify a plugin and plugin type for that though. From what you wrote, I understand that it does that in the current context, meaning with all the variable definitions and other deviations created by that context. So it actually seems to be a variable lookup that evaluates templates!? You proposed to exclude or further override inventory_hostname so that it can be used in a delegation context to obtain the variables for that host. Again, that's great, but it solves only part of the problem that delegation poses, because I guess most people understand delegation to work as "execute that task as if it was executed in the target hosts context".

The root of this whole thing is probably the expectation that when you write a task as a playbook author, that this mostly behaves like a method call, the execution depends on the parameters passed (and the external state). In the case of delegate_to, this is or was not true, because of "magic variables" (I remember them being called that) that are not all following the delegation. In the case of roles, it's even less true, because they don't have parameters, they operate on a global state inherited from whatever calls them. The define an interface (argument_spec), but that interface is incomplete, because of templates. So you never really know what happens when you import and much less include a role.

My original snappy comment about copy & paste was about controlling who get's to see which information, because I simply cannot manage the complexity involved with variable evaluation in various contexts. I found a way to code myself through this jungle in most cases, but every now and then I scream at my screen and don't get it why this or that variable is wrong and why I can't make it work without analyzing the entirety of my (huge) inventory, roles, collections, tasks, playbooks, keyring or vault definitions. I even tried to debug code running on the target to find out that this is utterly impossible.

The inventory is a holy grail for me, because it ideally has a single dependency (if the author is doing the right thing), inventory_hostname. This is the basis on which every playbook operates before it overrides variables. It would be great to be able to access this particular information from anywhere.

bcoca commented 11 months ago

Understanding how configuration and variables work in Ansible is not easy, we have made an effort to document both:

configuration 5 'simple' categories
variables much more complex, but subset of config, which makes config more complex

Sadly variables started with a basic design, but then grew organically and to meet 'real life' needs and got very complicated. But if we try to remove this complexity, we also remove the ability of people to function in many of real life's complex scenarios.

We normally suggest people start with 1-3 ways of defining things and expand as they get more experience, there is a lot of nuance and no lack of undefined behavior (we have been reducing it for a long time, but still some left), specially when you add roles into the mix.

The "magic variables" (really hate the name) are mostly 'variables that give the user access to engine information', sadly we have overlap with 'variables that let the user SET the engine information' like ansible_user. But this is inconsistent and not always correctly reflects the engine, it is also why i created the config lookup.

my current actions:

updating config lookup docs from your feedback #81951, hope this clarifies somethings.
One thing missing in config is also 'keywords' that appear in the play, so it is not always 'correct', but that was an option I've been looking to pass in also.
still working on config option to use 'alternate hostname' or at least be 'delegate to aware'.

I understand your want of a 'subsetted view' of ansible variables and configuration items, but that is not currently possible and I would even say, can be misleading to users as 'that view' can then be modified/overridden elsewhere. But this is also why the config lookup has a show_origin option, to allow the user to track down the source of the 'current' resolution.

mutech commented 10 months ago

You know that I do not have a premium support contract, right? This is how it must feel to get the VIP treatment :-) Thanks for your work!

[..] but that is not currently possible and I would even say, can be misleading to users [..]

I would not propose to change existing functionality for that purpose. Using an isolated mechanism such as a query or lookup, maybe a new one, should however be a stable change not affecting the rest of the ecosystem?

as 'that view' can then be modified/overridden elsewhere

I'm not sure I understand that part. A lookup would only provide a view on existing data, that can of course be manipulated by existing means. The difference to the current situation would - as far as I understand - be only that now you wouldn't know about such overrides because you can't see them, you only see the final values when templates are accessed (at the module/action level). If you meant that the view itself (the lookup) can be modified, I don't see that as a problem as the one doing that has to take over responsibility for that change (know what they're doing/test/...)

But I guess that's academic if it's not currently feasible to provide that view anyway.

Ansible, despite the problems I have with it, is a great tool, probably even the best of breed. But I feel it's falling short of what it could become which sadly is also what it is currently being used for. It's incredibly hard to create reusable components (trying to evade terms like module or role) unless you're an expert with a strict catalog of best practices engraved into you mind.

I think it's time for a version 3 with loads of breaking changes, as much as I hated to see python2/3 cataclysms in Ansible. If you're looking for a rookie users point of view, I had quite a few more suggestions resulting from my experience. :-) I'm getting more and more comfortable with time, but it's a steep learning curve and a stony path.

mutech commented 10 months ago

updating config lookup docs from your feedback https://github.com/ansible/ansible/pull/81951, hope this clarifies somethings.

Great, now it's clear to me what config does!

bcoca commented 10 months ago

You know that I do not have a premium support contract, right? This is how it must feel to get the VIP treatment :-) Thanks for your work!

People pay for support?!?! .. kidding aside, I responded to your points as they were hitting something most devs that have worked on core are aware of but we have not been good at explaining and distributing that knowledge to the users, from this ticket alone I think we have made some docs much better and clearer. This is not 'me' supporting 'you', this is core devs working with community members to make things better for the Community (in capital I include both 'free' users and 'paid' support subscribers). IMHO, this is the great strength of OSS.

Also once I post these explanations I can link to them when topic comes back up!

as 'that view' can then be modified/overridden elsewhere

What i mean is that a 'view of configuration just accounting for inventory' is misleading as not accounting for other config, environment, CLI parameters extra vars, play vars, role vars, ... will end up being misleading users to think 'this is how it will work', while there are dozens of ways to override it that are not obvious. Again, this is why config includes a show_origin to return the 'winner' of the configuration precedence battle royal.

But I guess that's academic if it's not currently feasible to provide that view anyway.

Maybe I spoke too soon, we don't have anything that does this now .. but I have thought of ways to get this, it is just a lot of work and change in core systems that might not be worth the return. Still, don't get your hopes up, core dev time is one of the scarcest resources on the planet.

Ansible, despite the problems I have with it, is a great tool, probably even the best of breed. But I feel it's falling short of what it could become which sadly is also what it is currently being used for. It's incredibly hard to create reusable components (trying to evade terms like module or role) unless you're an expert with a strict catalog of best practices engraved into you mind.

That is something we are aware of, plays and roles are not reusable 'by default', but can be made so by following a set of rules and parameterizing certain things. We have taken several steps to help with this and auto document, things like 'role args spec' are a step in that direction.

I think it's time for a version 3 with loads of breaking changes, as much as I hated to see python2/3 cataclysms in Ansible. If you're looking for a rookie users point of view, I had quite a few more suggestions resulting from my experience. :-) I'm getting more and more comfortable with time, but it's a steep learning curve and a stony path.

The more our user base expands the harder it is to make such changes, the 1-2 was hard enough and even the 2.x-2.6s with several 'adjustments' that came from the 2.0 shift. Even 2.9 to 2.10+/collections, which was 99.99999% backwards compatible was seen as a bit steep by many. Too many people still use 2.9 cause they think they need to change all plays to use FQCN .. they DO NOT!! . It is just that the 'devtools' (ansible-lint) and docs do favor a 'collections world', but any big change has misunderstandings like this (I will say this one last time with_items IS NOT DEPRECATED and you can still use yes/no as booleans!!!).

mutech commented 10 months ago

The more our user base expands the harder it is to make such changes

You only have one chance to get something right in software engineering, and that's before there are users.

I very much appreciate Linus' attitude towards breaking changes, but he is arguing based on a solid foundation of time tested concepts and established standards (UNIX/Posix/...).

I don't think that Ansible can continue to evolve in small increments fixing pain points. At some point there will be an alternative that copies the awesomeness of Ansible and combines it with a concept that supports engineering requirements that Ansible cannot fulfill with its current architecture. There are many things that can be improved incrementally, but I think there are some hard limits, and what you say about core-dev hours is exactly what I mean.

The breaking points that cannot be easily fixed without a remake are these:

Connections

One of the best features of Ansible is that its agent-less. But with ansiballs and a distinct lack of marketing/library support for raw actions/modules, this feature is eroding. The reality of Ansible is that it actually uses agents, they are just not well defined. The agent is whatever python code gets uploaded to the target. It's true that there is no service running on the target and there is no dedicated installation, which is both good, but the amount of logic required to make that work and the side effects of this strategy are just too complex. You can debug code on the host. Permission issues as a result of cascading elevations which are sometimes necessary keep popping up. There is a never ending pain resulting from version requirements for python and nested dependencies. And Ansible is incredibly slow, compared to doing stuff via ssh/shell, even considering all the advantages. SSH connections work well. Try using something like LXD connections, it works - unless your setup is not exactly the mainstream configuration, otherwise you will fail.

On top of that, there are mysterious hickups. I keep seeing ansible tasks hang forever with nothing going on at the target side. If something like this happens, will there be secrets in one of the files on the target/.ansible folder? Will they be cleaned up? I don't know. In our time of constant assault of anything facing the internet, the obscurity of a .ansible folder is probably not enough systemic protection.

The concept of a "connection" is also not the right abstraction. An SSH session is so much more than a shell, an LXD connection is just a shell and no choice of user, but you might get that shell from any node in a cluster. An API could well be a connection, just not to a shell. And there is different shells, no shells, windows, file systems and the list goes on. Ansible connections are the best choice if you had to choose one, but why should there only be one type of communication endpoint? Answer is of course to make it simple and that works beautifully, until you try to work with something like an LXD connection and see that it's not working (and probably cannot really be made to work without incredible amounts of time nobody want to invest).

Roles

Roles are a misnomer. Ansible roles are not roles, because they are not assigned and revoked, they are executed. They are also not roles, because they have parameters, so they can at best be role assignments.

To my understanding, Ansible is (other than terraform/docker images/etc.) a tool that supports management of live machines. Ansible is not primarily (only also) a provisioning or deployment tool. If a user decides for whatever reason that they want to maintain a dedicated long lived server instead of provisioning an instance of a deployment, roles will over time almost certainly have to be revoked. I tried to use ansible roles for that purpose and this is just not working without a whole stack of support software/effort. I thought about creating meta-deb packages to record role assignments, until I found out that if I go that far, I don't need the roles and can just implement these packages. OS packages have a bad reputation because it's so complex to maintain them, but somebody has to face these complexities or somebody suffers the consequences.

In practice, roles are most often used as modules, for reuse. But they are not modules either. They lack any concept of information hiding. You can compensate for that by adhering to conventions, but you can also do that with 6502 assembler and I actually did that, but it was not a great experience, neither 6502 nor Ansible roles.

With all the code out there using roles, you can of course not change the semantics of roles. It also didn't work to say that logic should not be implemented in tasks or roles, because the paywall for writing modules is just to high for all those users who chose Ansible because they appreciate the simplicity and (development) efficiency of composing tasks in yaml.

Modules

Actions should be as easy to write as roles, just in python and not yaml. I should have said "are as easy to write" but that would be a lie. That can be changed by providing a backward compatible abstraction, but there must be a reason why this is not part of Ansible 2.15.4 (my current version). I know, your time is limited. But isn't that the problem? If there are more important things to be done than to implement one of the core principles of Ansible (don't implement logic with tasks, use modules), that hints at a serious long standing issue.

I grew up with Eiffel and "Design by contract" is one of my personal principles when developing stuff. I'm really suffering from the chaos that is interface specifications in ansible. If I want to create a clean plugin, I have to document the specification in README's, code documents, sometimes in yaml files, then I have to pass python typings or watered down versions of it and yet I still have to implement argument validation manually. I tried to use ansible-doc to consume this information, but that only works sometimes, maybe often, but not always. Very often, I have to look at the source code. If I was a core team member, I would know the details of why and how and where, because it would be part of my daily routine. I am not (though I would seriously consider an offer - just kidding).

Ansible has a wonderful ecosystem with loads of very competent module/collection developers, but like you, they only have so much time. I'm an Ansible rookie, but I spend decades developing software, so I'm really not afraid to expose myself to complexity. But the chaos and complexity of Ansible module development is quite a paywall to overcome (part of that is that I spent decades evading Python, you got me there though - Ansible motivated me to look at it, thanks for that).

Secrets

Ansible has some support for handling secrets, but this is not a first class feature. In the end, secrets are just publicly visible variables that can be hidden behind figue leafs. There are plugins available that can be used to fix many of the problems, but I guess we all know that this is insufficient. I'm not aware that Ansible was the source of a major leak or security catastrophe, but I honestly find that surprising.

What I would expect would be plugin types for obtaining and deploying secrets, that prevent clear text secrets from ever appearing anywhere but at the deployment interface (f.e. when they are passed to a process or being written to a final TPM/disk/... location). I don't know if today secrets are actually put into Ansiballs but I guess I really don't want to know.

The question I'm asking myself is if such secret plugins would be at all possible to implement in a finite time as long as there is no direct secure communication channel between the source (a vault) and destination (process/file/...) of a secret. To make such a setting auditable, it had to be a first class feature.

Conclusion

I'm going to stop ranting here, but I could continue pointing out flaws you probably know much better than I do.

Despite all of this, Ansible does many things so well, that people are still more than just happy with it. But these features are well known, just as Ansibles flaws are obvious and for many of these flaws, there are established solutions.

I think that the Ansible team is best qualified to create a reincarnation of Ansible that preserves it's awesomeness and removes the worst flaws, so that you core team guys can spend time on awesome instead of having to stick figue leaves all over the place.

But that's of course just my 2 cents. I hope I'm not coming across as too patronizing. I just really love Ansible and at the same time I keep looking for alternatives because it's really torturing me.

mutech commented 10 months ago

On the flip side, if I would be the product owner of Ansible and had a huge budget, this is what I wanted to have:

A foundational abstraction of commands and queries, not unlike Ansible lookup/query on one and module/action on the other side, but with a single and more powerful specification mechanism for contracts (that's a lot of Eiffelisms, but Meyer really did a great job there).

JSON schema would be a good starting point for describing parameters, results and states, it should be extended by the most commonly seen data types and maybe some extensions specific to ansible. That would create an overlap with things like OpenAPI and that in turn would make it possible to look at API's as one kind of "connection", which could make a whole lot of custom modules in use today obsolete and add modules. It could also replace ansiballs, by using temporary service APIs encapsulated in SSH (and other capable connections). All tunneling, jump-hosted and otherwise obscure channels would be available to reach obscure devices and all existing (Open-)APIs not requiring weird authentication become valid targets.

Commands and queries would be applied to a "context" or "system". By default, the context would be the controller (actions) and while executing a playbook, there is a "current target" (modules). All contexts/systems are represented as a node in the inventory, but any such node can have multiple connections of different types. A system has an associated set of configurations or more generally information, that is strictly associated with that system and not to how or from where it is accessed (this is additional information that is available in a "connection context").

The contract of a command knows the state that is or might be affected by the command. That is to automate the implementation of diff functionality. As a result, diff functionality can be implemented by core using a defined data structure and the module/action had to provide the adapter. That of course would be done as a collection of queries that - in case of actions - might be lazy evaluated. Preconditions can be failure conditions or wait-conditions. No need for handlers if any task can handle events (~ wait conditions). If you know your contracts are complete, you know what can run in parallel.

In a check scenario, queries operate not (necessarily) on the real state of the affected systems, but on contracts of commands (looking at the state a command will produce if successful). This can be extended to implement flow analysis.

Ansible facts become well known entities, so that commands can declare as part of their contract which facts are or might be affected by them and which facts are affecting commands (pre/post conditions). Facts are just cached query results and as such always associated with a system and not with an execution context (like the running playbook). You don't need to declare which facts you need, by using a fact, a task depends on it and the gathering is automatic, as well as cache invalidation, at least as long as commands properly declare the facts they might influence.

You create new commands and queries by specifying their contract in yaml, and then you either implement them using tasks or generating boilerplate python plugin code (or just put code in yaml text blocks like you would for shell commands).

If "systems" are not primarily hosts but could also be things like f.e. an internet service or a database instance, they could have their own connection types and be target of commands and queries specific for their target type. They would benefit from all the awesomeness that is the ansible inventory.

Commands and queries can be published with a very small granularity, no need to assemble them into collections. The argument for that is the same as that for micro services vs. fat APIs. This requires high quality interface specifications, otherwise you just get a mess and are better served by collections.

Roles on the other hand could become inventory objects (if they are needed at all, groups basically do much of the same thing).

I think that most of what I suggest here are actually only minor conceptual shifts, but they would have a huge (as I think positive) impact. A lot of existing code (collections) could probably be reused with little to no change, they would just not benefit much, seeing as they don't provide the meta data of full blown contracts.

This is not a thought out concept, just what I would do if I had the money and time to work on it. Well if I had that money, I would not work on it, I would go sip magheritas or something. But that's another story :-)

Something like this would be a major overhaul, but do you think it would be too risky, considering the benefits and development perspectives this would provide? If I was Redhat, this is what I would do immediately, rather than to try quickfixing issues by encapsulating ansible in a tower.

bcoca commented 10 months ago

I'm flagging these last things 'off topic' as they are unrelated to the issue,(not because I don't think you make interesting points, some of which we have been debating in core for years now, they would be better hosted at:

https://github.com/ansible/proposals and/or https://forum.ansible.com/c/project/7

If you want to continue this conversation and have a wider audience.