aristanetworks / sonic

Open source drivers and initialization library for Arista platforms running SONiC
GNU General Public License v2.0
22 stars 30 forks source link

[all linecards] platform_tests/test_reload_config.py::test_reload_configuration_checks failure #91

Closed wenyiz2021 closed 10 months ago

wenyiz2021 commented 1 year ago

hi @patrickmacarthur @Staphylo @kenneth-arista

it failed to issue 'Retry later' message immediately after a reboot happened on the dut. failure point: https://github.com/sonic-net/sonic-mgmt/blob/0d6fedb76aa3b95ce5b6cbd44c528b0dd7ffbfcd/tests/platform_tests/test_reload_config.py#L102

output of shell cmd on Arista card:

(Pdb) out
{'stderr_lines': [], u'cmd': u'sudo config reload -y', u'end': u'2023-06-16 23:21:19.811575', '_ansible_no_log': False, u'stdout': u'Disabling container monitoring ...\nStopping SONiC target ...\nRunning command: /usr/local/bin/sonic-cfggen  -j /etc/sonic/init_cfg.json  -j /etc/sonic/config_db.json  --write-to-db\nRunning command: /usr/local/bin/db_migrator.py -o migrate\nRunning command: /usr/local/bin/sonic-cfggen -d -y /etc/sonic/sonic_version.yml -t /usr/share/sonic/templates/sonic-environment.j2,/etc/sonic/sonic-environment\nRestarting SONiC target ...\nEnabling container monitoring ...\nReloading Monit configuration ...\nReinitializing monit daemon', u'changed': True, u'rc': 0, u'start': u'2023-06-16 23:20:45.459575', u'stderr': u'', u'delta': u'0:00:34.352000', u'invocation': {u'module_args': {u'creates': None, u'executable': u'/bin/bash', u'_uses_shell': True, u'strip_empty_ends': True, u'_raw_params': u'sudo config reload -y', u'removes': None, u'argv': None, u'warn': True, u'chdir': None, u'stdin_add_newline': True, u'stdin': None}}, 'stdout_lines': [u'Disabling container monitoring ...', u'Stopping SONiC target ...', u'Running command: /usr/local/bin/sonic-cfggen  -j /etc/sonic/init_cfg.json  -j /etc/sonic/config_db.json  --write-to-db', u'Running command: /usr/local/bin/db_migrator.py -o migrate', u'Running command: /usr/local/bin/sonic-cfggen -d -y /etc/sonic/sonic_version.yml -t /usr/share/sonic/templates/sonic-environment.j2,/etc/sonic/sonic-environment', u'Restarting SONiC target ...', u'Enabling container monitoring ...', u'Reloading Monit configuration ...', u'Reinitializing monit daemon'], u'warnings': [u"Consider using 'become', 'become_method', and 'become_user' rather than running sudo"], 'failed': False}

expected:

(Pdb) out
{'stderr_lines': [], u'changed': True, u'end': u'2023-06-16 22:39:27.768113', '_ansible_no_log': False, u'stdout': u'Relevant services are not up. Retry later or use -f to avoid system checks', u'cmd': u'sudo config reload -y', u'msg': u'non-zero return code', u'rc': 1, u'start': u'2023-06-16 22:39:26.168914', u'warnings': [u"Consider using 'become', 'become_method', and 'become_user' rather than running sudo"], u'delta': u'0:00:01.599199', u'invocation': {u'module_args': {u'creates': None, u'executable': u'/bin/bash', u'_uses_shell': True, u'strip_empty_ends': True, u'_raw_params': u'sudo config reload -y', u'removes': None, u'argv': None, u'warn': True, u'chdir': None, u'stdin_add_newline': True, u'stdin': None}}, 'stdout_lines': [u'Relevant services are not up. Retry later or use -f to avoid system checks'], u'stderr': u'', 'failed': True}
(Pdb) out['stdout']
u'Relevant services are not up. Retry later or use -f to avoid system checks'

this is the case happened on all linecards -- CL2 and wolverine

wenyiz2021 commented 1 year ago

I tried change module_ignore_error to false, on terminal it'll show the cmd fail, but output of the shell cmd still say failed = false

(Pdb) out = duthost.shell("sudo config reload -y", executable="/bin/bash", module_ignore_errors=False)
Friday 16 June 2023  23:48:14 +0000 (0:00:46.822)       0:09:26.849 *********** 
*** RunAnsibleModuleFail: run module shell failed, Ansible Results =>
{"changed": true, "cmd": "sudo config reload -y", "delta": "0:00:00.448957", "end": "2023-06-16 23:48:15.978370", "failed": true, "msg": "non-zero return code", "rc": 1, "start": "2023-06-16 23:48:15.529413", "stderr": "", "stderr_lines": [], "stdout": "SwSS container is not ready. Retry later or use -f to avoid system checks", "stdout_lines": ["SwSS container is not ready. Retry later or use -f to avoid system checks"], "warnings": ["Consider using 'become', 'become_method', and 'become_user' rather than running sudo"]}
(Pdb) out
{'stderr_lines': [], u'cmd': u'sudo config reload -y', u'end': u'2023-06-16 23:48:04.137881', '_ansible_no_log': False, u'stdout': u'Disabling container monitoring ...\nStopping SONiC target ...\nRunning command: /usr/local/bin/sonic-cfggen  -j /etc/sonic/init_cfg.json  -j /etc/sonic/config_db.json  --write-to-db\nRunning command: /usr/local/bin/db_migrator.py -o migrate\nRunning command: /usr/local/bin/sonic-cfggen -d -y /etc/sonic/sonic_version.yml -t /usr/share/sonic/templates/sonic-environment.j2,/etc/sonic/sonic-environment\nRestarting SONiC target ...\nEnabling container monitoring ...\nReloading Monit configuration ...\nReinitializing monit daemon', u'changed': True, u'rc': 0, u'start': u'2023-06-16 23:47:28.697315', u'stderr': u'', u'delta': u'0:00:35.440566', u'invocation': {u'module_args': {u'creates': None, u'executable': u'/bin/bash', u'_uses_shell': True, u'strip_empty_ends': True, u'_raw_params': u'sudo config reload -y', u'removes': None, u'argv': None, u'warn': True, u'chdir': None, u'stdin_add_newline': True, u'stdin': None}}, 'stdout_lines': [u'Disabling container monitoring ...', u'Stopping SONiC target ...', u'Running command: /usr/local/bin/sonic-cfggen  -j /etc/sonic/init_cfg.json  -j /etc/sonic/config_db.json  --write-to-db', u'Running command: /usr/local/bin/db_migrator.py -o migrate', u'Running command: /usr/local/bin/sonic-cfggen -d -y /etc/sonic/sonic_version.yml -t /usr/share/sonic/templates/sonic-environment.j2,/etc/sonic/sonic-environment', u'Restarting SONiC target ...', u'Enabling container monitoring ...', u'Reloading Monit configuration ...', u'Reinitializing monit daemon'], u'warnings': [u"Consider using 'become', 'become_method', and 'become_user' rather than running sudo"], 'failed': False}
(Pdb) out['stdout']
u'Disabling container monitoring ...\nStopping SONiC target ...\nRunning command: /usr/local/bin/sonic-cfggen  -j /etc/sonic/init_cfg.json  -j /etc/sonic/config_db.json  --write-to-db\nRunning command: /usr/local/bin/db_migrator.py -o migrate\nRunning command: /usr/local/bin/sonic-cfggen -d -y /etc/sonic/sonic_version.yml -t /usr/share/sonic/templates/sonic-environment.j2,/etc/sonic/sonic-environment\nRestarting SONiC target ...\nEnabling container monitoring ...\nReloading Monit configuration ...\nReinitializing monit daemon'

cc @arlakshm

wenyiz2021 commented 1 year ago

expectation is:

  1. able to recognize this cmd failed to executre -> 'failed' = True
  2. able to log error message in out['stdout']
  3. even though when changing module_ignore_errors to False, we are able to catch error message, while at the same time, stdout still say fail=False which is inconsistent.

I am unsure if it's Ansible issue or hardware issue, @Staphylo @kenneth-arista can you please help to confirm?

patrickmacarthur commented 1 year ago

able to recognize this cmd failed to executre -> 'failed' = True

Ansible by default only considers return code for determining if the command succeeded or failed. And from the perspective of the config command, it looks like returning success is appropriate if it didn't detect that the system still booting.

I'm looking into why the config command isn't detecting that the system is still booting.

patrickmacarthur commented 10 months ago

I haven't been able to reproduce this locally but my theory is that this is being caused by (1) the switch booting up faster than the test can reach this check, so the system is already running or (2) a service fails during startup, leaving the system in degraded state, which may be overriding the started state that config reload is looking for.

If you encounter this issue again, it would be useful to run systemctl status on the DUT to rule out (2).

wenyiz2021 commented 10 months ago

this is fixed in https://github.com/sonic-net/sonic-mgmt/pull/7953