aristanetworks / sonic

Open source drivers and initialization library for Arista platforms running SONiC
GNU General Public License v2.0
22 stars 30 forks source link

[Chassis] Wolverine LC crashes on bootup. #71

Closed arlakshm closed 1 year ago

arlakshm commented 1 year ago

We are seeing the below crash on bootup

Jan 25 19:49:12.468709 str3-7800-lc8-1 INFO kernel: [ 5457.441535] seq_file: buggy .next function bkn_seq_dma_next [linux_bcm_knet] did not update position index
Jan 25 19:49:57.788719 str3-7800-lc8-1 WARNING kernel: [ 5502.764052] linux-bcm-knet (243338): bkn_knet_dev_inst_set sinfo- evt_idx 0
Jan 25 19:49:57.788748 str3-7800-lc8-1 WARNING kernel: [ 5502.764054] linux-bcm-knet (243338): bkn_knet_dev_inst_set sinfo- evt_idx 0
Jan 25 19:49:57.788751 str3-7800-lc8-1 WARNING kernel: [ 5502.764060] linux-bcm-knet (243576):  bkn_get_next_dma_event skip dev(1)
Jan 25 19:49:57.788752 str3-7800-lc8-1 WARNING kernel: [ 5502.764063] linux-bcm-knet (243576): dev_no 0 dev_evt 0 wait queue index 0
Jan 25 19:49:57.788754 str3-7800-lc8-1 WARNING kernel: [ 5502.764518] linux-kernel-bde (243338): _interrupt_connect d 0
Jan 25 19:49:57.788755 str3-7800-lc8-1 WARNING kernel: [ 5502.764521] linux-kernel-bde (243338): _interrupt_connect:isr_active = 1
Jan 25 19:49:57.788756 str3-7800-lc8-1 WARNING kernel: [ 5502.764523] linux-kernel-bde (243338): connect secondary isr
Jan 25 19:50:00.296703 str3-7800-lc8-1 INFO kernel: [ 5505.269315] syncd[243338]: segfault at 0 ip 0000000000000000 sp 00007ffe10365f08 error 14 in syncd[55b97c82a000+53000]
Jan 25 19:50:00.296715 str3-7800-lc8-1 INFO kernel: [ 5505.269324] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
Jan 25 19:50:04.004724 str3-7800-lc8-1 INFO kernel: [ 5508.978698] device dummy left promiscuous mode
Jan 25 19:50:04.004747 str3-7800-lc8-1 INFO kernel: [ 5508.978828] Bridge: port 1(dummy) entered disabled state
Jan 25 19:50:04.196696 str3-7800-lc8-1 INFO kernel: [ 5509.172094] Bridge: port 1(dummy) entered blocking state
Jan 25 19:50:04.196714 str3-7800-lc8-1 INFO kernel: [ 5509.172097] Bridge: port 1(dummy) entered disabled state
Jan 25 19:50:04.196716 str3-7800-lc8-1 INFO kernel: [ 5509.172217] device dummy entered promiscuous mode
Jan 25 19:56:02.072716 str3-7800-lc8-1 WARNING kernel: [ 5867.053028] linux-bcm-knet (250341): bkn_get_next_dma_event dev 1 evt_idx 1
Jan 25 19:56:02.072750 str3-7800-lc8-1 WARNING kernel: [ 5867.053030] linux-bcm-knet (250341): dev_no 1 dev_evt 1 wait queue index 1
Jan 25 19:56:02.072752 str3-7800-lc8-1 WARNING kernel: [ 5867.053037] linux-bcm-knet (249755): bkn_knet_dev_inst_set sinfo->inst_id 1 d 1 inst 1
Jan 25 19:56:02.072754 str3-7800-lc8-1 WARNING kernel: [ 5867.053485] linux-kernel-bde (249755): _interrupt_connect d 1
Jan 25 19:56:02.072755 str3-7800-lc8-1 WARNING kernel: [ 5867.053488] linux-kernel-bde (249755): _interrupt_connect:isr_active = 1
Jan 25 19:56:02.072757 str3-7800-lc8-1 WARNING kernel: [ 5867.053489] linux-kernel-bde (249755): connect secondary isr
Jan 25 19:56:04.568730 str3-7800-lc8-1 INFO kernel: [ 5869.546485] syncd[249755]: segfault at 0 ip 0000000000000000 sp 00007ffe04cf0bc8 error 14 in syncd[55852a648000+53000]
Jan 25 19:56:04.568756 str3-7800-lc8-1 INFO kernel: [ 5869.546495] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.
Jan 25 19:56:07.152721 str3-7800-lc8-1 WARNING kernel: [ 5872.130431] linux-bcm-knet (251104): bkn_knet_dev_inst_set sinfo->inst_id 0 d 0 inst 0
Jan 25 19:56:07.152757 str3-7800-lc8-1 WARNING kernel: [ 5872.130811] linux-kernel-bde (251104): _interrupt_connect d 0
Jan 25 19:56:07.152759 str3-7800-lc8-1 WARNING kernel: [ 5872.130813] linux-kernel-bde (251104): _interrupt_connect:isr_active = 1
Jan 25 19:56:07.152760 str3-7800-lc8-1 WARNING kernel: [ 5872.130814] linux-kernel-bde (251104): connect secondary isr
Jan 25 19:56:07.152761 str3-7800-lc8-1 WARNING kernel: [ 5872.133807] linux-bcm-knet (251478): bkn_get_next_dma_event dev 0 evt_idx 0
Jan 25 19:56:07.152762 str3-7800-lc8-1 WARNING kernel: [ 5872.133815] linux-bcm-knet (251478):  bkn_get_next_dma_event skip dev(1)
Jan 25 19:56:07.152763 str3-7800-lc8-1 WARNING kernel: [ 5872.133826] linux-bcm-knet (251478): dev_no 0 dev_evt 0 wait queue index 0
Jan 25 19:56:07.404818 str3-7800-lc8-1 INFO kernel: [ 5872.384328] device dummy left promiscuous mode
Jan 25 19:56:07.404837 str3-7800-lc8-1 INFO kernel: [ 5872.384453] Bridge: port 1(dummy) entered disabled state
Jan 25 19:56:07.576873 str3-7800-lc8-1 INFO kernel: [ 5872.558003] Bridge: port 1(dummy) entered blocking state
Jan 25 19:56:07.576891 str3-7800-lc8-1 INFO kernel: [ 5872.558007] Bridge: port 1(dummy) entered disabled state
Jan 25 19:56:07.576893 str3-7800-lc8-1 INFO kernel: [ 5872.558419] device dummy entered promiscuous mode
Jan 25 19:56:09.644709 str3-7800-lc8-1 INFO kernel: [ 5874.622311] syncd[251104]: segfault at 0 ip 0000000000000000 sp 00007ffebc007488 error 14 in syncd[558730e5c000+53000]
Jan 25 19:56:09.644739 str3-7800-lc8-1 INFO kernel: [ 5874.622318] Code: Unable to access opcode bytes at RIP 0xffffffffffffffd6.

This is a new linecard which was recently converted from EOS. The linecard crashes were seen when configuring as 100G or 400G links. The same configuration seems to work on other wolverine linecard on different chassis.

lscpi output

admin@str3-7800-lc8-1:~$ sudo lspci -vvvvs 07:00.0
07:00.0 Ethernet controller: Broadcom Inc. and subsidiaries Device 8852 (rev 02)
        Subsystem: Broadcom Inc. and subsidiaries Device 8852
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 66
        IOMMU group: 31
        Region 0: Memory at 10040800000 (64-bit, prefetchable) [size=32K]
        Region 2: Memory at 10040000000 (64-bit, prefetchable) [size=8M]
        Capabilities: [48] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [58] MSI: Enable+ Count=1/32 Maskable+ 64bit+
                Address: 00000000fee00000  Data: 0000
                Masking: fffffffe  Pending: 00000000
        Capabilities: [a0] MSI-X: Enable- Count=64 Masked-
                Vector table: BAR=2 offset=00300000
                PBA: BAR=2 offset=00301000
        Capabilities: [ac] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <4us, L1 <64us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 25.000W
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr+ NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 128 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM not supported
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s (ok), Width x1 (downgraded)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+
                         10BitTagComp- 10BitTagReq- OBFF Via WAKE#, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- TPHComp- ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
                         AtomicOpsCtl: ReqEn-
                LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
                LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
                         EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [13c v1] Device Serial Number 00-00-00-00-00-00-00-00
        Capabilities: [150 v1] Power Budgeting <?>
        Capabilities: [160 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed- WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
                        Status: NegoPending- InProgress-
        Capabilities: [180 v1] Vendor Specific Information: ID=0000 Rev=0 Len=028 <?>
        Capabilities: [1b0 v1] Latency Tolerance Reporting
                Max snoop latency: 0ns
                Max no snoop latency: 0ns
        Capabilities: [240 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
                          PortCommonModeRestoreTime=8us PortTPowerOnTime=10us
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
                           T_CommonMode=0us LTR1.2_Threshold=0ns
                L1SubCtl2: T_PwrOn=10us
        Capabilities: [300 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn- PerformEqu-
                LaneErrStat: 0
        Capabilities: [250 v1] Multicast
                McastCap: MaxGroups 1, WindowSz 0 (1 bytes)             McastCtl: NumGroups 1, Enable-
                McastBAR: IndexPos 0, BaseAddr 0000000000000000
                McastReceiveVec:      0000000000000000
                McastBlockAllVec:     0000000000000000
                McastBlockUntransVec: 0000000000000000
        Kernel driver in use: linux-kernel-bde
        Kernel modules: linux_kernel_bde
admin@str3-7800-lc8-1:~$ sudo lspci -vvvvs 06:00.0           
06:00.0 Ethernet controller: Broadcom Inc. and subsidiaries Device 8852 (rev 02)
        Subsystem: Broadcom Inc. and subsidiaries Device 8852
        Physical Slot: 3
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0
        Interrupt: pin A routed to IRQ 65
        IOMMU group: 30
        Region 0: Memory at 10000800000 (64-bit, prefetchable) [size=32K]
        Region 2: Memory at 10000000000 (64-bit, prefetchable) [size=8M]
        Capabilities: [48] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
        Capabilities: [58] MSI: Enable+ Count=1/32 Maskable+ 64bit+
                Address: 00000000fee00000  Data: 0000
                Masking: fffffffe  Pending: 00000000
        Capabilities: [a0] MSI-X: Enable- Count=64 Masked-
                Vector table: BAR=2 offset=00300000
                PBA: BAR=2 offset=00301000
        Capabilities: [ac] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <4us, L1 <64us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 25.000W
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr+ NoSnoop+
                        MaxPayload 128 bytes, MaxReadReq 128 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
                LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM not supported
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 8GT/s (ok), Width x1 (downgraded)
                        TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+
                         10BitTagComp- 10BitTagReq- OBFF Via WAKE#, ExtFmt- EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- TPHComp- ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
                         AtomicOpsCtl: ReqEn-
                LnkCap2: Supported Link Speeds: 2.5-8GT/s, Crosslink- Retimer- 2Retimers- DRS-
                LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+ EqualizationPhase1+
                         EqualizationPhase2+ EqualizationPhase3+ LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported
        Capabilities: [100 v1] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
                AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
                        MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
                HeaderLog: 00000000 00000000 00000000 00000000
        Capabilities: [13c v1] Device Serial Number 00-00-00-00-00-00-00-00
        Capabilities: [150 v1] Power Budgeting <?>
        Capabilities: [160 v1] Virtual Channel
                Caps:   LPEVC=0 RefClk=100ns PATEntryBits=1
                Arb:    Fixed- WRR32- WRR64- WRR128-
                Ctrl:   ArbSelect=Fixed
                Status: InProgress-
                VC0:    Caps:   PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
                        Arb:    Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
                        Ctrl:   Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
                        Status: NegoPending- InProgress-
        Capabilities: [180 v1] Vendor Specific Information: ID=0000 Rev=0 Len=028 <?>
        Capabilities: [1b0 v1] Latency Tolerance Reporting
                Max snoop latency: 0ns
                Max no snoop latency: 0ns
        Capabilities: [240 v1] L1 PM Substates
                L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
                          PortCommonModeRestoreTime=8us PortTPowerOnTime=10us
                L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
                           T_CommonMode=0us LTR1.2_Threshold=0ns
                L1SubCtl2: T_PwrOn=10us
        Capabilities: [300 v1] Secondary PCI Express
                LnkCtl3: LnkEquIntrruptEn- PerformEqu-
                LaneErrStat: 0
        Capabilities: [250 v1] Multicast
                McastCap: MaxGroups 1, WindowSz 0 (1 bytes)             McastCtl: NumGroups 1, Enable-
                McastBAR: IndexPos 0, BaseAddr 0000000000000000
                McastReceiveVec:      0000000000000000
                McastBlockAllVec:     0000000000000000
                McastBlockUntransVec: 0000000000000000
        Kernel driver in use: linux-kernel-bde
        Kernel modules: linux_kernel_bde

other output collected

admin@str3-7800-lc8-1:~$ cat /proc/linux-user-bde 
Broadcom Device Enumerator (linux-user-bde)
        0: Interrupt mode  CMICx Inst id 0x0
        1: Interrupt mode  CMICx Inst id 0x1
Instance resource 
        0: DMA offset 0 size 32 MB Dev mask 0x00000001 
        1: DMA offset 32 size 32 MB Dev mask 0x00000002 
arlakshm commented 1 year ago

syncd.1674618124.25.0.core.gz dmesg.log

arlakshm commented 1 year ago

@Staphylo for viz..

Staphylo commented 1 year ago

We're looking into this

kenneth-arista commented 1 year ago

I believe we understand the RCA. This is related to MACsec licensing.

kenneth-arista commented 1 year ago

Closing as we will follow up offline the tooling specifics.