community.routeros.facts fails for some devices

merlin-vrn commented 2 years ago

SUMMARY

community.routeros.facts module fails with cryptic Python error message on some managed targets. I don't know how these targets are different from other targets; a help is needed to help me to investigate this problem. The "reduced" facts collection (only hardware) doesn't seem to fail.

ISSUE TYPE

Bug Report

COMPONENT NAME

community.routeros.facts module

ANSIBLE VERSION

ansible [core 2.12.4]
  config file = /home/merlin/ansible/rterminal-ansible-default/trunk/routeros/ansible.cfg
  configured module search path = ['/home/merlin/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /home/merlin/ansible/lib/python3.9/site-packages/ansible
  ansible collection location = /home/merlin/.ansible/collections:/usr/share/ansible/collections
  executable location = /home/merlin/ansible/bin/ansible
  python version = 3.9.9 (main, Jan 29 2022, 08:46:33) [GCC 11.2.0]
  jinja version = 3.1.1
  libyaml = True

COLLECTION VERSION

Collection         Version
------------------ -------
community.routeros 2.0.0

I also tried the latest version from Github, got exactly same error message (up to apparently random strings).

CONFIGURATION

DEFAULT_HOST_LIST(/home/merlin/ansible/rterminal-ansible-default/trunk/routeros/ansible.cfg) = ['/home/merlin/ansible/rterminal-ansible-default/trunk/routeros/hosts']

OS / ENVIRONMENT

Controller node: Linux muon 5.15.23-gentoo-x86_64 #1 SMP PREEMPT Tue Feb 15 09:57:12 MSK 2022 x86_64 Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz GenuineIntel GNU/Linux Ansible is running in virtualenv, it was installed right as recommended in the official docs.

Managed target: 2011UiAS-2HnD, RouterOS version 6.47.4. It shows less than 20% of CPU load and 91.4MiB of free memory.

Another target where it appeared: 962UiGS-5HacT2HnT, 6.49.2, which has low CPU load and much free memory available too.

And another one, RB4011iGS+, 6.43.4.

STEPS TO REPRODUCE

ansible gw -m community.routeros.facts

EXPECTED RESULTS

finish without problems

ACTUAL RESULTS

ansible [core 2.12.4]
  config file = /home/merlin/ansible/rterminal-ansible-default/trunk/routeros/ansible.cfg
  configured module search path = ['/home/merlin/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules']
  ansible python module location = /home/merlin/ansible/lib/python3.9/site-packages/ansible
  ansible collection location = /home/merlin/.ansible/collections:/usr/share/ansible/collections
  executable location = /home/merlin/ansible/bin/ansible
  python version = 3.9.9 (main, Jan 29 2022, 08:46:33) [GCC 11.2.0]
  jinja version = 3.1.1
  libyaml = True
Using /home/merlin/ansible/rterminal-ansible-default/trunk/routeros/ansible.cfg as config file
setting up inventory plugins
host_list declined parsing /home/merlin/ansible/rterminal-ansible-default/trunk/routeros/hosts as it did not pass its verify_file() method
script declined parsing /home/merlin/ansible/rterminal-ansible-default/trunk/routeros/hosts as it did not pass its verify_file() method
auto declined parsing /home/merlin/ansible/rterminal-ansible-default/trunk/routeros/hosts as it did not pass its verify_file() method
Parsed /home/merlin/ansible/rterminal-ansible-default/trunk/routeros/hosts inventory source with ini plugin
Loading callback plugin minimal of type stdout, v2.0 from /home/merlin/ansible/lib/python3.9/site-packages/ansible/plugins/callback/minimal.py
Skipping callback 'default', as we already have a stdout callback.
Skipping callback 'minimal', as we already have a stdout callback.
Skipping callback 'oneline', as we already have a stdout callback.
META: ran handlers
Loading collection community.routeros from /home/merlin/ansible/lib/python3.9/site-packages/ansible_collections/community/routeros
Loading collection ansible.netcommon from /home/merlin/ansible/lib/python3.9/site-packages/ansible_collections/ansible/netcommon
<gw> attempting to start connection
<gw> using connection plugin ansible.netcommon.network_cli
Found ansible-connection at path /home/merlin/ansible/bin/ansible-connection
<gw> local domain socket does not exist, starting it
<gw> control socket path is /home/merlin/.ansible/pc/06a0df8ba2
<gw> Loading collection ansible.netcommon from /home/merlin/ansible/lib/python3.9/site-packages/ansible_collections/ansible/netcommon
<gw> Loading collection community.routeros from /home/merlin/ansible/lib/python3.9/site-packages/ansible_collections/community/routeros
<gw> local domain socket listeners started successfully
<gw> loaded cliconf plugin ansible_collections.community.routeros.plugins.cliconf.routeros from path /home/merlin/ansible/lib/python3.9/site-packages/ansible_collections/community/routeros/plugins/cliconf/routeros.py for network_os community.routeros.routeros
<gw> ssh type is set to libssh
<gw> 
<gw> local domain socket path is /home/merlin/.ansible/pc/06a0df8ba2
<gw> ESTABLISH LOCAL CONNECTION FOR USER: merlin
<gw> EXEC /bin/sh -c '( umask 77 && mkdir -p "` echo /home/merlin/.ansible/tmp/ansible-local-10384lhnt7xc0 `"&& mkdir "` echo /home/merlin/.ansible/tmp/ansible-local-10384lhnt7xc0/ansible-tmp-1651149712.703595-10387-20298977077536 `" && echo ansible-tmp-1651149712.703595-10387-20298977077536="` echo /home/merlin/.ansible/tmp/ansible-local-10384lhnt7xc0/ansible-tmp-1651149712.703595-10387-20298977077536 `" ) && sleep 0'
Using module file /home/merlin/ansible/lib/python3.9/site-packages/ansible_collections/community/routeros/plugins/modules/facts.py
<gw> PUT /home/merlin/.ansible/tmp/ansible-local-10384lhnt7xc0/tmp0coc6sms TO /home/merlin/.ansible/tmp/ansible-local-10384lhnt7xc0/ansible-tmp-1651149712.703595-10387-20298977077536/AnsiballZ_facts.py
<gw> EXEC /bin/sh -c 'chmod u+x /home/merlin/.ansible/tmp/ansible-local-10384lhnt7xc0/ansible-tmp-1651149712.703595-10387-20298977077536/ /home/merlin/.ansible/tmp/ansible-local-10384lhnt7xc0/ansible-tmp-1651149712.703595-10387-20298977077536/AnsiballZ_facts.py && sleep 0'
<gw> EXEC /bin/sh -c '/home/merlin/ansible/bin/python /home/merlin/.ansible/tmp/ansible-local-10384lhnt7xc0/ansible-tmp-1651149712.703595-10387-20298977077536/AnsiballZ_facts.py && sleep 0'
<gw> EXEC /bin/sh -c 'rm -f -r /home/merlin/.ansible/tmp/ansible-local-10384lhnt7xc0/ansible-tmp-1651149712.703595-10387-20298977077536/ > /dev/null 2>&1 && sleep 0'
The full traceback is:
Traceback (most recent call last):
  File "/home/merlin/.ansible/tmp/ansible-local-10384lhnt7xc0/ansible-tmp-1651149712.703595-10387-20298977077536/AnsiballZ_facts.py", line 107, in <module>
    _ansiballz_main()
  File "/home/merlin/.ansible/tmp/ansible-local-10384lhnt7xc0/ansible-tmp-1651149712.703595-10387-20298977077536/AnsiballZ_facts.py", line 99, in _ansiballz_main
    invoke_module(zipped_mod, temp_path, ANSIBALLZ_PARAMS)
  File "/home/merlin/.ansible/tmp/ansible-local-10384lhnt7xc0/ansible-tmp-1651149712.703595-10387-20298977077536/AnsiballZ_facts.py", line 47, in invoke_module
    runpy.run_module(mod_name='ansible_collections.community.routeros.plugins.modules.facts', init_globals=dict(_module_fqn='ansible_collections.community.routeros.plugins.modules.facts', _modlib_path=modlib_path),
  File "/usr/lib/python3.9/runpy.py", line 210, in run_module
    return _run_module_code(code, init_globals, run_name, mod_spec)
  File "/usr/lib/python3.9/runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/usr/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/tmp/ansible_community.routeros.facts_payload_ew3k5xxf/ansible_community.routeros.facts_payload.zip/ansible_collections/community/routeros/plugins/modules/facts.py", line 652, in <module>
  File "/tmp/ansible_community.routeros.facts_payload_ew3k5xxf/ansible_community.routeros.facts_payload.zip/ansible_collections/community/routeros/plugins/modules/facts.py", line 640, in main
  File "/tmp/ansible_community.routeros.facts_payload_ew3k5xxf/ansible_community.routeros.facts_payload.zip/ansible_collections/community/routeros/plugins/modules/facts.py", line 347, in populate
  File "/tmp/ansible_community.routeros.facts_payload_ew3k5xxf/ansible_community.routeros.facts_payload.zip/ansible_collections/community/routeros/plugins/modules/facts.py", line 365, in populate_addresses
KeyError: 'e7-kvant'
gw | FAILED! => {
    "changed": false,
    "module_stderr": "Traceback (most recent call last):\n  File \"/home/merlin/.ansible/tmp/ansible-local-10384lhnt7xc0/ansible-tmp-1651149712.703595-10387-20298977077536/AnsiballZ_facts.py\", line 107, in <module>\n    _ansiballz_main()\n  File \"/home/merlin/.ansible/tmp/ansible-local-10384lhnt7xc0/ansible-tmp-1651149712.703595-10387-20298977077536/AnsiballZ_facts.py\", line 99, in _ansiballz_main\n    invoke_module(zipped_mod, temp_path, ANSIBALLZ_PARAMS)\n  File \"/home/merlin/.ansible/tmp/ansible-local-10384lhnt7xc0/ansible-tmp-1651149712.703595-10387-20298977077536/AnsiballZ_facts.py\", line 47, in invoke_module\n    runpy.run_module(mod_name='ansible_collections.community.routeros.plugins.modules.facts', init_globals=dict(_module_fqn='ansible_collections.community.routeros.plugins.modules.facts', _modlib_path=modlib_path),\n  File \"/usr/lib/python3.9/runpy.py\", line 210, in run_module\n    return _run_module_code(code, init_globals, run_name, mod_spec)\n  File \"/usr/lib/python3.9/runpy.py\", line 97, in _run_module_code\n    _run_code(code, mod_globals, init_globals,\n  File \"/usr/lib/python3.9/runpy.py\", line 87, in _run_code\n    exec(code, run_globals)\n  File \"/tmp/ansible_community.routeros.facts_payload_ew3k5xxf/ansible_community.routeros.facts_payload.zip/ansible_collections/community/routeros/plugins/modules/facts.py\", line 652, in <module>\n  File \"/tmp/ansible_community.routeros.facts_payload_ew3k5xxf/ansible_community.routeros.facts_payload.zip/ansible_collections/community/routeros/plugins/modules/facts.py\", line 640, in main\n  File \"/tmp/ansible_community.routeros.facts_payload_ew3k5xxf/ansible_community.routeros.facts_payload.zip/ansible_collections/community/routeros/plugins/modules/facts.py\", line 347, in populate\n  File \"/tmp/ansible_community.routeros.facts_payload_ew3k5xxf/ansible_community.routeros.facts_payload.zip/ansible_collections/community/routeros/plugins/modules/facts.py\", line 365, in populate_addresses\nKeyError: 'e7-kvant'\n",
    "module_stdout": "",
    "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error",
    "rc": 1
}

e7-kvant is the name of one of the "public" interfaces, which has three public IP addresses configured.

An exception occurred during task execution. To see the full traceback, use -vvv. The error was: KeyError: 'address'
gw-wka-1.al5 | FAILED! => {
    "changed": false,
    "module_stderr": "Traceback (most recent call last):\n  File \"/home/merlin/.ansible/tmp/ansible-local-10430el3ik_nf/ansible-tmp-1651150294.4569864-10557-160384140379644/AnsiballZ_facts.py\", line 107, in <module>\n    _ansiballz_main()\n  File \"/home/merlin/.ansible/tmp/ansible-local-10430el3ik_nf/ansible-tmp-1651150294.4569864-10557-160384140379644/AnsiballZ_facts.py\", line 99, in _ansiballz_main\n    invoke_module(zipped_mod, temp_path, ANSIBALLZ_PARAMS)\n  File \"/home/merlin/.ansible/tmp/ansible-local-10430el3ik_nf/ansible-tmp-1651150294.4569864-10557-160384140379644/AnsiballZ_facts.py\", line 47, in invoke_module\n    runpy.run_module(mod_name='ansible_collections.community.routeros.plugins.modules.facts', init_globals=dict(_module_fqn='ansible_collections.community.routeros.plugins.modules.facts', _modlib_path=modlib_path),\n  File \"/usr/lib/python3.9/runpy.py\", line 210, in run_module\n    return _run_module_code(code, init_globals, run_name, mod_spec)\n  File \"/usr/lib/python3.9/runpy.py\", line 97, in _run_module_code\n    _run_code(code, mod_globals, init_globals,\n  File \"/usr/lib/python3.9/runpy.py\", line 87, in _run_code\n    exec(code, run_globals)\n  File \"/tmp/ansible_community.routeros.facts_payload_twxjy40a/ansible_community.routeros.facts_payload.zip/ansible_collections/community/routeros/plugins/modules/facts.py\", line 652, in <module>\n  File \"/tmp/ansible_community.routeros.facts_payload_twxjy40a/ansible_community.routeros.facts_payload.zip/ansible_collections/community/routeros/plugins/modules/facts.py\", line 640, in main\n  File \"/tmp/ansible_community.routeros.facts_payload_twxjy40a/ansible_community.routeros.facts_payload.zip/ansible_collections/community/routeros/plugins/modules/facts.py\", line 347, in populate\n  File \"/tmp/ansible_community.routeros.facts_payload_twxjy40a/ansible_community.routeros.facts_payload.zip/ansible_collections/community/routeros/plugins/modules/facts.py\", line 367, in populate_addresses\nKeyError: 'address'\n",
    "module_stdout": "",
    "msg": "MODULE FAILURE\nSee stdout/stderr for the exact error",
    "rc": 1
}

This is another kind of message, this time from hAP AC.

During the assembly of this report, I convinced myself this is a heisenbug. It happens sometimes, however with probability more than 50%. I can't see any apparent correlations. I see the error with interface name most often, but I've seen the other message about address at least once. The connection between control node and managed target is almost "direct" (through the managed switch), and I am sure there is no problems with it. Other failed nodes were accessed over the Internet.

felixfontein commented 2 years ago

Maybe this happens when an earlier command (/interface print detail without-paging) fails, prints incomplete information, or something goes wrong during parsing (that's possible at multiple levels). Probably adding some debug code that dumps the received information would be useful to debug this.

Adding something like

        import q
        q.q(self.responses[0])
        q.q(self.responses[1])
        q.q(self.responses[2])
        q.q(self.responses[3])

here: https://github.com/ansible-collections/community.routeros/blob/main/plugins/modules/facts.py#L338 (and installing https://pypi.org/project/q/) and then looking at /tmp/q after it failed could give some insight.

merlin-vrn commented 2 years ago

So I did that, and I have recorded two runs: one "good" and one "bad". So, I have two sets of files generated by q.

The only difference, beyond obvious timers or counters changes, is that these files have some ANSI codes embedded at arbitrary different places (appearance of these ANSI codes seems random). Can these ANSI codes break a parser in some cases?

Attached files are the ones I generated, I just renamed them, 0g = 0'th element of self.responses in a good run, 3b = 3'rd element of self.responses in a bad run. Hope this can help to shed the light.

0b-q71657705.txt 0g-q85449062.txt 1b-q56093662.txt 1g-q14033932.txt 2b-q67814371.txt 2g-q46262401.txt 3b-q05419377.txt 3g-q74509909.txt q.txt .

felixfontein commented 2 years ago

Yeah, the parser (in the collection) expects that the ANSI codes have already been stripped, which obviously isn't the case. Using the DETAIL_RE regex, the argument [32mname="e7-kvant" is interpreted as 32mname having the value e7-kvant. Since there is no key name, the whole record is ignored.

felixfontein commented 2 years ago

Out of curiosity, which version of ansible.netcommon are you using, what are your settings for ansible_network_cli_ssh_type and ansible_user (I'm interested in parts starting with + for the username, like I'm using +cte512w at the end, which configures how the terminal behaves). I'm asking because ansible.netcommon is handling the sanitization (including ANSI removal).

Anyway, it feels a bit more like using devices with SSH is mainly a PITA and we should have a facts module that uses the API insead :)

merlin-vrn commented 2 years ago

Doesn't ansible.netcommon have the same version as ansible itself? Anyway, it's recently installed on this node via pip install ansible (less than two weeks). How to view it's version?

ansible_network_cli_ssh_type=libssh

username is "merlin", so there is no terminal-specific characters like "+" and so on

The authentication is via RSA keys, and those keys are obtained from ssh-agent. I work remotely, connect to the control node via SSH and so the ssh-agent currently is forwarded. Also, only few hosts are directly accessible; most of them are accessed using a jumphost configured in the ~/.ssh/config. This is the main point of using libssh; paramiko doesn't support ssh-agent, and I am not sure it will use the openssh client configuration.

For the same reason it is extremly inconvenient to use anything other than SSH. API requires the whole burden of managing SSL keys for several dozens of hosts, also it would require the need to forward ports to each internal node, which is not very good scenario. Some hosts are configured in such a way so it is even impossible to reach their management other way than via the jumphost (e.g. even their API will be only available through ports forwarded with the help of SSH jumphost).

SSH is very versatile and, honestly, I wouldn't mind if they rather remove all other management methods (including the damn winbox) and only leave SSH available.

felixfontein commented 2 years ago

Doesn't ansible.netcommon have the same version as ansible itself?

No, they are totally unrelated. The only relation exists between the ansible.builtin collection and ansible-core (they have the same version). The best way to figure out its version is to look at the output of ansible-galaxy collection list.

SSH is very versatile and, honestly, I wouldn't mind if they rather remove all other management methods (including the damn winbox) and only leave SSH available.

I fully agree :) I usually only enable SSH and HTTPS (though that one is already annoying due to TLS certificate setup...), and so far I'm using API (with HTTPS) only for my home network's router.

Unfortunately the SSH interface is really hard to use programmatically, since RouterOS seems to equate SSH = real person looking at the output, which is true in some cases, but definitely not true for this collection :-( Right now only the API (or the new REST API, but that requires a rather new RouterOS version I think) is relatively well usable programmatically.

ansible-collections / community.routeros