aristanetworks / sonic

Open source drivers and initialization library for Arista platforms running SONiC
GNU General Public License v2.0
25 stars 30 forks source link

[chassis] [Arista-7800R3-48CQ2-C48] suspect kernel error when running watchdog reboot #96

Closed wenyiz2021 closed 1 year ago

wenyiz2021 commented 1 year ago

Platform: x86_64-arista_7800r3_48cq2_lc HwSKU: Arista-7800R3-48CQ2-C48

Both linecards from our side with above sku has failures when running test_watchdog_reboot. Failure is due to that after watchdog reboot, not all transceivers came back after 800 sec.

error msg:

00:10:25 transceiver_utils.all_transceivers_detec L0051 INFO   | Interfaces not detected: [u'Ethernet8', u'Ethernet0', u'Ethernet4', u'Ethernet108', u'Ethernet100', u'Ethernet104', u'Ethernet68', u'Ethernet96', u'Ethernet124', u'Ethernet92', u'Ethernet120', u'Ethernet52', u'Ethernet56', u'Ethernet76', u'Ethernet72', u'Ethernet64', u'Ethernet16', u'Ethernet12', u'Ethernet88', u'Ethernet116', u'Ethernet80', u'Ethernet112', u'Ethernet84', u'Ethernet48', u'Ethernet28', u'Ethernet60', u'Ethernet20', u'Ethernet24']
00:10:25 interface_utils.check_all_interface_info L0122 INFO   | Not all transceivers are detected

on dut during test run:

admin@str2-7804-lc3-1:~$ redis-cli --raw -n 6 keys TRANSCEIVER_INFO*
TRANSCEIVER_INFO|Ethernet32
TRANSCEIVER_INFO|Ethernet44
TRANSCEIVER_INFO|Ethernet40
TRANSCEIVER_INFO|Ethernet36
admin@str2-7804-lc3-1:~$ 

when running watchdog reboot test, for every missing transceiver from state DB, there is kernel error which shows timeout_error=1:

Aug  9 23:42:03.260222 str2-7804-lc3-1 ERR pmon#xcvrd[29]: CMIS: Ethernet88: skipping CMIS state machine since no xcvr api!!!
Aug  9 23:42:03.370596 str2-7804-lc3-1 NOTICE pmon#psud: PSU supplied power warning cleared: supplied power is back to normal.
Aug  9 23:42:03.400054 str2-7804-lc3-1 WARNING kernel: [  135.819675] scd 0000:07:00.0: #6 rsp { .reg=0x00040200, .fe=0, .foe=0, .ss=04, .ti=00, .flushed=0, .ack_error=0, .timeout_error=1, .bus_conflict_error=0, .d=0x00 } bus=58 addr=0x50 ti=0 err=-5

@Staphylo @patrickmacarthur @kenneth-arista @arlashm for viz

wenyiz2021 commented 1 year ago

@arlakshm

patrickmacarthur commented 1 year ago

I am looking into this.

wenyiz2021 commented 1 year ago

test_watchdog_reboot is passing on our another TB LC, which is same sku Arista-7800R3-48CQ2-C48