Open merlin-vrn opened 2 years ago
Maybe this happens when an earlier command (/interface print detail without-paging
) fails, prints incomplete information, or something goes wrong during parsing (that's possible at multiple levels). Probably adding some debug code that dumps the received information would be useful to debug this.
Adding something like
import q
q.q(self.responses[0])
q.q(self.responses[1])
q.q(self.responses[2])
q.q(self.responses[3])
here: https://github.com/ansible-collections/community.routeros/blob/main/plugins/modules/facts.py#L338 (and installing https://pypi.org/project/q/) and then looking at /tmp/q
after it failed could give some insight.
So I did that, and I have recorded two runs: one "good" and one "bad". So, I have two sets of files generated by q
.
The only difference, beyond obvious timers or counters changes, is that these files have some ANSI codes embedded at arbitrary different places (appearance of these ANSI codes seems random). Can these ANSI codes break a parser in some cases?
Attached files are the ones I generated, I just renamed them, 0g = 0'th element of self.responses in a good run, 3b = 3'rd element of self.responses in a bad run. Hope this can help to shed the light.
0b-q71657705.txt 0g-q85449062.txt 1b-q56093662.txt 1g-q14033932.txt 2b-q67814371.txt 2g-q46262401.txt 3b-q05419377.txt 3g-q74509909.txt q.txt .
Yeah, the parser (in the collection) expects that the ANSI codes have already been stripped, which obviously isn't the case. Using the DETAIL_RE
regex, the argument [32mname="e7-kvant"
is interpreted as 32mname
having the value e7-kvant
. Since there is no key name
, the whole record is ignored.
Out of curiosity, which version of ansible.netcommon are you using, what are your settings for ansible_network_cli_ssh_type
and ansible_user
(I'm interested in parts starting with +
for the username, like I'm using +cte512w
at the end, which configures how the terminal behaves). I'm asking because ansible.netcommon is handling the sanitization (including ANSI removal).
Anyway, it feels a bit more like using devices with SSH is mainly a PITA and we should have a facts module that uses the API insead :)
Doesn't ansible.netcommon have the same version as ansible itself? Anyway, it's recently installed on this node via pip install ansible
(less than two weeks). How to view it's version?
ansible_network_cli_ssh_type=libssh
username is "merlin", so there is no terminal-specific characters like "+" and so on
The authentication is via RSA keys, and those keys are obtained from ssh-agent. I work remotely, connect to the control node via SSH and so the ssh-agent currently is forwarded. Also, only few hosts are directly accessible; most of them are accessed using a jumphost configured in the ~/.ssh/config
. This is the main point of using libssh; paramiko doesn't support ssh-agent, and I am not sure it will use the openssh client configuration.
For the same reason it is extremly inconvenient to use anything other than SSH. API requires the whole burden of managing SSL keys for several dozens of hosts, also it would require the need to forward ports to each internal node, which is not very good scenario. Some hosts are configured in such a way so it is even impossible to reach their management other way than via the jumphost (e.g. even their API will be only available through ports forwarded with the help of SSH jumphost).
SSH is very versatile and, honestly, I wouldn't mind if they rather remove all other management methods (including the damn winbox) and only leave SSH available.
Doesn't ansible.netcommon have the same version as ansible itself?
No, they are totally unrelated. The only relation exists between the ansible.builtin collection and ansible-core (they have the same version). The best way to figure out its version is to look at the output of ansible-galaxy collection list
.
SSH is very versatile and, honestly, I wouldn't mind if they rather remove all other management methods (including the damn winbox) and only leave SSH available.
I fully agree :) I usually only enable SSH and HTTPS (though that one is already annoying due to TLS certificate setup...), and so far I'm using API (with HTTPS) only for my home network's router.
Unfortunately the SSH interface is really hard to use programmatically, since RouterOS seems to equate SSH = real person looking at the output, which is true in some cases, but definitely not true for this collection :-( Right now only the API (or the new REST API, but that requires a rather new RouterOS version I think) is relatively well usable programmatically.
SUMMARY
community.routeros.facts
module fails with cryptic Python error message on some managed targets. I don't know how these targets are different from other targets; a help is needed to help me to investigate this problem. The "reduced" facts collection (onlyhardware
) doesn't seem to fail.ISSUE TYPE
COMPONENT NAME
community.routeros.facts
moduleANSIBLE VERSION
COLLECTION VERSION
I also tried the latest version from Github, got exactly same error message (up to apparently random strings).
CONFIGURATION
OS / ENVIRONMENT
Controller node:
Linux muon 5.15.23-gentoo-x86_64 #1 SMP PREEMPT Tue Feb 15 09:57:12 MSK 2022 x86_64 Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz GenuineIntel GNU/Linux
Ansible is running in virtualenv, it was installed right as recommended in the official docs.Managed target: 2011UiAS-2HnD, RouterOS version 6.47.4. It shows less than 20% of CPU load and 91.4MiB of free memory.
Another target where it appeared: 962UiGS-5HacT2HnT, 6.49.2, which has low CPU load and much free memory available too.
And another one, RB4011iGS+, 6.43.4.
STEPS TO REPRODUCE
EXPECTED RESULTS
finish without problems
ACTUAL RESULTS
e7-kvant
is the name of one of the "public" interfaces, which has three public IP addresses configured.This is another kind of message, this time from hAP AC.
During the assembly of this report, I convinced myself this is a heisenbug. It happens sometimes, however with probability more than 50%. I can't see any apparent correlations. I see the error with interface name most often, but I've seen the other message about address at least once. The connection between control node and managed target is almost "direct" (through the managed switch), and I am sure there is no problems with it. Other failed nodes were accessed over the Internet.