Random failed tests on routeros devices

hart323 commented 1 week ago

Validity Version

3.0.3

NetBox Version

4.1.1

Python Version

3.12.3

Steps to Reproduce

As amount of tests for routeros and number of routeros devices started to increase (8 tests/33 devices as of now) I noticed that I get random failed tests. For example if I run 8 tests/1 device I get result with all tests passed, but if I run 8 tests/33 devices I end up with around 80% of tests passed but others failed. This is how particular test for particular device may look when scheduled 8/33 tests have been done:

It seems like failed tests are because of netmiko or validity collecting results from netmiko because command result file may contain the following lines:

POLLING ERROR
AssertionError:

or trimmed output like:

             enabled: yes
         primary-ntp: 10.10.10.10
       secondary-ntp: 10.10.10.11
    server-dns-names: 
                mode: unicast
       poll-interval: 15m
       active-server: 10.10.10.10
    last-update-from: 10.10.10.10
  las

or some other strange lines which looks like lines from mikrotik config but not related to command sent to device. I attach this example when I find one.

I added to routeros poller config line: session_log: /tmp/netmiko_session.log and this log file contains all command outputs from device however device result file does not contain data.

Note. I don't notice same behavior on cisco.ios devices.

Traceback

No tracebacks

amyasnikov commented 1 week ago

Let's break it down:

There are 2 separate processes inside Validity: tests execution and device polling (although they can be executed together for the sake of convenience, in general they have nothing to do with each other). So, the former (tests execution) seems to be completely unrelated to the issue.
There is some problem with netmiko (or the network itself, network_issue.jpg) as I can see. AssertionError likely means the output received from a device is empty. SSH/telnet are not so reliable in terms of network automation as you may know.
Presence of the answer (that seems correct) in the logs and absence of it in the polling result might mean netmiko could not find the trailing prompt ([MyMikrotik] >). I'm not sure here, just a guess.

Hence:

I can't make it fixed unless it becomes somehow reproducible (e.g. you'd describe the exact algorithm for reproducing it with clean netbox/validity setup and some generic mikrotik).
Try to play with timeouts (via poller params)
I'm not sure if netmiko disables colors. So you can try to do it by yourself (use myuser+ct as your username)
You can try to reproduce it without Validity: just using netmiko directly from python shell

hart323 commented 1 week ago

It seems like this is solely mikrotik-netmiko issue. I found some similar issues: https://github.com/ktbyers/netmiko/issues/2880 https://github.com/ktbyers/netmiko/issues/2512 Have to figure out what to do with this. Issue closed.

hart323 commented 4 days ago

I did a little investigation and got the following. Plain netmiko script with a list of commands pushed to send_command indeed somehow does not have time to handle the input/output of the ROS and I got empty responses on random commands. I found that netmiko has another delay-based mechanism to send commands with send_command_timing which is used primarily for show commands. (https://ktbyers.github.io/netmiko/docs/netmiko/index.html#netmiko.BaseConnection.send_command_timing).

So I replaced https://github.com/amyasnikov/validity/blob/e40aec1a1f24c2f3c858d478e79c50279142af53/validity/pollers/cli.py#L17

with return driver.send_command_timing(command.parameters["cli_command"], last_read=0.1, read_timeout=30, cmd_verify=True) and first time I got all my tests succeed.

Maybe you can extend validity poller to give possibility to specify which method/arguments to run or just change send_command with send_command_timing?

hart323 commented 4 days ago

Oops, Mikrotik tests succeed but now Cisco tests randomly failed :(

hart323 commented 4 days ago

Reverted code to original send_command(), but in the poller settings I added 2 options:

{
  "disable_lf_normalization": "True",
  "global_cmd_verify": "True"
}

Short tests on routeros and ios now return valid results. Let's see how things go further.

amyasnikov commented 4 days ago

Maybe you can extend validity poller to give possibility to specify which method/arguments to run or just change send_command with send_command_timing?

I can consider adding custom user-defined pollers via plugin settings

amyasnikov / validity