aristanetworks / sonic

Open source drivers and initialization library for Arista platforms running SONiC
GNU General Public License v2.0
22 stars 30 forks source link

[chassis] [sup] cold reboot cause not shown #94

Closed wenyiz2021 closed 6 months ago

wenyiz2021 commented 12 months ago

AssertionError: got reboot-cause failed after rebooted by cold

@Staphylo, @patrickmacarthur @kenneth-arista

kenneth-arista commented 12 months ago

Did you see this with a manual reboot or during sonic-mgmt testing?

patrickmacarthur commented 11 months ago

Which sonic-mgmt test was this and do you know what the expected and actual reboot causes were?

wenyiz2021 commented 11 months ago

hi @kenneth-arista @patrickmacarthur I see this during mgmt testing. with manual reboot I was able to see reboot cause.

is it cold reboot cause not added?

admin@str2-7804-sup-1:~$ show reboot-cause history
Name                 Cause                                                                                         Time                             User    Comment
-------------------  --------------------------------------------------------------------------------------------  -------------------------------  ------  ---------
2023_07_12_00_23_34  Unknown                                                                                       N/A                              N/A     N/A
2023_07_12_00_16_01  Watchdog (watchdog, description: gpi 6 detailed fault - watchdog, time: 2023-07-12 00:14:03)  N/A                              N/A     N/A
2023_07_11_23_16_40  reboot                                                                                        Tue 11 Jul 2023 11:14:36 PM UTC  admin   N/A
2023_07_11_18_29_41  Unknown                                                                                       N/A                              N/A     N/A
2023_07_10_17_31_55  reboot                                                                                        Mon Jul 10 17:30:03 UTC 2023     admin   N/A
2023_07_10_16_24_40  reboot                                                                                        Mon Jul 10 16:22:48 UTC 2023     admin   N/A
2023_07_10_14_12_57  Unknown                                                                                       N/A                              N/A     N/A
2023_07_10_14_02_32  Watchdog (watchdog, description: gpi 6 detailed fault - watchdog, time: 2023-07-10 14:00:38)  N/A                              N/A     N/A
2023_07_10_09_16_11  Unknown                                                                                       N/A                              N/A     N/A
2023_07_10_09_09_25  Unknown   
wenyiz2021 commented 11 months ago

Which sonic-mgmt test was this and do you know what the expected and actual reboot causes were?

test_continuous_reboot[str2-7804-sup-1] AssertionError: got reboot-cause failed after rebooted by cold

wenyiz2021 commented 11 months ago

manual reboot on sup could show reboot cause is reboot, user is admin:

admin@str2-7804-sup-1:~$ show reboot-cause history
Name                 Cause                                                                                         Time                             User    Comment
-------------------  --------------------------------------------------------------------------------------------  -------------------------------  ------  ---------
2023_07_12_19_33_57  reboot                                                                                        Wed Jul 12 19:32:01 UTC 2023     admin   N/A
2023_07_12_19_11_22  reboot                                                                                        Wed 12 Jul 2023 07:09:20 PM UTC  admin   N/A

but on pipeline cold reboot it shows 'Unknown' for sup, user is N/A

2023_07_12_00_23_34  Unknown                                                                                       N/A                              N/A     N/A

pipeline was running test_cold_reboot on sup around this time 00:13:51

patrickmacarthur commented 11 months ago

I am unable to reproduce this issue with platform_tests/test_reboot.py::test_continuous_reboot

wenyiz2021 commented 11 months ago

I'm still seeing this as of today:

    if reboot_type is not None:
        logging.info("Check reboot cause")
        assert wait_until(MAX_WAIT_TIME_FOR_REBOOT_CAUSE, 20, 30, check_reboot_cause, dut, reboot_type), \
          "got reboot-cause failed after rebooted by %s" % reboot_type

E AssertionError: got reboot-cause failed after rebooted by cold

dut = MultiAsicSonicHost str2-7804-sup-1 interfaces = {} interfaces_wait_time = 800 reboot_type = 'cold' xcvr_skip_list = {'str2-7804-lc3-1': [], 'str2-7804-lc5-1': [], 'str2-7804-lc7-1': [], 'str2-7804-sup-1': []}

platform_tests/test_reboot.py:143: AssertionError

patrickmacarthur commented 10 months ago

Could you send the contents of /var/log/arista*.log on the DUT when you see this failure (you can e-mail to pmacarthur@arista.com)?

wenyiz2021 commented 10 months ago

Could you send the contents of /var/log/arista*.log on the DUT when you see this failure (you can e-mail to pmacarthur@arista.com)?

I will send out EOD as it is sup reboot failure, it needs to take whole chassis

wenyiz2021 commented 10 months ago

this is not seen on another chassis sup, same sku

22:22:37 reboot.check_reboot_cause_history L0419 INFO | index: 0, reboot cause: 'reboot'|Non-Hardware (reboot|^reboot, reboot cause from DUT: reboot PASSED

rlhui commented 6 months ago

@wenyiz2021 , @kenneth-arista - is this issue still there?

wenyiz2021 commented 6 months ago

@rlhui this seems issue only with vms26, str2 chassis in our lab, there are some reboot tests fail on this chassis but ot seen on the str3 chassis. I'm closing this issue