aristanetworks / sonic

Open source drivers and initialization library for Arista platforms running SONiC
GNU General Public License v2.0
22 stars 30 forks source link

[chassis] [Clearwater2MS ] restart of teamd causes bgp sessions & links to come back after 10min #79

Closed wenyiz2021 closed 1 year ago

wenyiz2021 commented 1 year ago

Clearwater2MS fails on test_container_autorestart for teamd container. If restart teamd, or kill any critical process which causing teamd to restart, bgp sessions and links come back after ~12min:

admin@str2-7804-lc6-1:~$ sudo systemctl restart teamd
admin@str2-7804-lc6-1:~$ date
Thu 09 Feb 2023 06:07:18 PM UTC
Neighbhor      V     AS    MsgRcvd    MsgSent    TblVer    InQ    OutQ  Up/Down      State/PfxRcd  NeighborName
-----------  ---  -----  ---------  ---------  --------  -----  ------  ---------  --------------  --------------
10.0.0.65      4  65001         26         25         0      0       0  00:05:16              257  ARISTA01T1
10.0.0.69      4  65002         25         24         0      0       0  00:04:58              257  ARISTA03T1
10.0.0.73      4  65003         29         30         0      0       0  00:08:29              257  ARISTA05T1
10.0.0.77      4  65004         28         29         0      0       0  00:07:59              257  ARISTA07T1
10.0.0.81      4  65005         28         28         0      0       0  00:07:27              257  ARISTA09T1
10.0.0.85      4  65006         25         24         0      0       0  00:04:23              257  ARISTA11T1
10.0.0.89      4  65007         24         23         0      0       0  00:03:29              257  ARISTA15T1
10.0.0.93      4  65008         24         22         0      0       0  00:02:34              258  ARISTA18T1
10.0.0.97      4  65009         26         24         0      0       0  00:04:07              258  ARISTA13T1
10.0.0.99      4  65010         25         23         0      0       0  00:03:13              258  ARISTA17T1
10.0.0.101     4  65011         23         22         0      0       0  00:02:17              257  ARISTA19T1
10.0.0.103     4  65012         22         21         0      0       0  00:01:59              257  ARISTA20T1
10.0.0.105     4  65013         22         21         0      0       0  00:01:40              257  ARISTA21T1
10.0.0.107     4  65014         22         21         0      0       0  00:01:20              257  ARISTA22T1
10.0.0.109     4  65015         22         21         0      0       0  00:01:01              257  ARISTA23T1
10.0.0.111     4  65016         21         20         0      0       0  00:00:41              257  ARISTA24T1
**10.0.0.113     4  65017         21         20         0      0       0  00:00:22              257  ARISTA25T1**
10.0.0.115     4  65018         28         28         0      0       0  00:07:14              257  ARISTA26T1
10.0.0.117     4  65019         27         27         0      0       0  00:06:58              257  ARISTA27T1
10.0.0.119     4  65020         27         27         0      0       0  00:06:42              257  ARISTA28T1
10.0.0.121     4  65021         27         27         0      0       0  00:06:25              257  ARISTA29T1
10.0.0.123     4  65022         27         27         0      0       0  00:06:08              257  ARISTA30T1
10.0.0.125     4  65023         26         26         0      0       0  00:05:52              257  ARISTA31T1
10.0.0.127     4  65024         26         26         0      0       0  00:05:35              257  ARISTA32T1

Total number of neighbors 24
admin@str2-7804-lc6-1:~$ date
Thu 09 Feb 2023 06:19:38 PM UTC

While Clearwater2 takes ~3min. Test gives an timeout of 6min(360sec)

RCA:

Feb  9 02:49:41.015823 str2-7804-lc6-1 INFO bgp#bgpcfgd: DEVICE_NEIGHBOR_METADATA is not ready for neighbor '10.0.0.81' - 'ARISTA09T1'
Feb  9 02:49:41.019910 str2-7804-lc6-1 INFO bgp#bgpcfgd: DEVICE_NEIGHBOR_METADATA is not ready for neighbor '10.0.0.81' - 'ARISTA09T1'
Feb  9 02:49:41.024354 str2-7804-lc6-1 INFO bgp#bgpcfgd: DEVICE_NEIGHBOR_METADATA is not ready for neighbor '10.0.0.81' - 'ARISTA09T1'
Feb  9 02:49:41.028629 str2-7804-lc6-1 INFO bgp#bgpcfgd: DEVICE_NEIGHBOR_METADATA is not ready for neighbor '10.0.0.81' - 'ARISTA09T1'
Feb  9 02:49:41.032773 str2-7804-lc6-1 INFO bgp#bgpcfgd: DEVICE_NEIGHBOR_METADATA is not ready for neighbor '10.0.0.81' - 'ARISTA09T1'
Feb  9 02:49:41.036856 str2-7804-lc6-1 INFO bgp#bgpcfgd: DEVICE_NEIGHBOR_METADATA is not ready for neighbor '10.0.0.81' - 'ARISTA09T1'
Feb  9 02:49:41.041125 str2-7804-lc6-1 INFO bgp#bgpcfgd: DEVICE_NEIGHBOR_METADATA is not ready for neighbor '10.0.0.81' - 'ARISTA09T1'
Feb  9 02:49:41.043753 str2-7804-lc6-1 INFO bgp#bgpcfgd: DEVICE_NEIGHBOR_METADATA is not ready for neighbor '10.0.0.81' - 'ARISTA09T1'

steps to repro:

  1. restart teamd using sudo systemctl restart teamd
  2. on card, check bgp sessions and links with show ip bgp summary and show interface status, note for date
wenyiz2021 commented 1 year ago

@arlakshm @kenneth-arista @Staphylo for viz

wenyiz2021 commented 1 year ago

Arista-7800R3-48CQM2-C48

kenneth-arista commented 1 year ago

@wenyiz2021 in your chassis, what linecards do you have and which linecard position is the CL2MS at? The question is whether or not the CL2MS has the uplink ports or not.

wenyiz2021 commented 1 year ago

@wenyiz2021 in your chassis, what linecards do you have and which linecard position is the CL2MS at? The question is whether or not the CL2MS has the uplink ports or not.

@kenneth-arista we converted the topology so right now I don't have this card in testbed, but for when this issue happened, wolverine linecard is the uplink card, CL2 and CL2MS are 2 downlink cards, 3 linecards in total

kenneth-arista commented 1 year ago

Will close this issue for now. Please reopen if it comes up again.