aristanetworks / sonic

Open source drivers and initialization library for Arista platforms running SONiC
GNU General Public License v2.0
22 stars 30 forks source link

[chassis] GBsyncd sai errors seen on syncd shutdown #42

Closed arlakshm closed 1 year ago

arlakshm commented 2 years ago

The following gbsyncd errors are seen on syncd shutdown. I think these are beign errors, please check

May  1 02:24:57.475181 str2-7804-lc6-1 NOTICE admin: User requested rebooting device ...
May  1 02:24:58.031759 str2-7804-lc6-1 NOTICE admin: Stopping syncd process...
May  1 02:24:58.186149 str2-7804-lc6-1 NOTICE syncd#syncd_request_shutdown: :- loadFromFile: no context config specified, will load default context config
May  1 02:24:58.186149 str2-7804-lc6-1 NOTICE syncd#syncd_request_shutdown: :- insert: added switch: idx 0, hwinfo ''
May  1 02:24:58.186679 str2-7804-lc6-1 NOTICE syncd#syncd_request_shutdown: :- send: requested COLD shutdown
May  1 02:24:58.186912 str2-7804-lc6-1 NOTICE syncd#syncd: :- run: is asic queue empty: 1
May  1 02:24:58.186912 str2-7804-lc6-1 NOTICE syncd#syncd: :- run: drained queue
May  1 02:24:58.186942 str2-7804-lc6-1 NOTICE syncd#syncd: :- handleRestartQuery: received COLD switch shutdown event
May  1 02:24:58.187043 str2-7804-lc6-1 NOTICE gbsyncd#syncd: :- run: is asic queue empty: 1
May  1 02:24:58.187078 str2-7804-lc6-1 NOTICE gbsyncd#syncd: :- run: drained queue
May  1 02:24:58.187078 str2-7804-lc6-1 NOTICE gbsyncd#syncd: :- handleRestartQuery: received COLD switch shutdown event
May  1 02:24:58.187100 str2-7804-lc6-1 NOTICE gbsyncd#syncd: :- removeAllSwitches: Removing all switches
May  1 02:24:58.187100 str2-7804-lc6-1 NOTICE gbsyncd#syncd: :- removeAllSwitches: Can't delete a switch RID oid:0x1: SAI_STATUS_OBJECT_IN_USE
May  1 02:24:58.187124 str2-7804-lc6-1 NOTICE gbsyncd#syncd: :- removeAllSwitches: removing switch RID oid:0x1 took 0.000027 sec
May  1 02:24:58.187124 str2-7804-lc6-1 NOTICE gbsyncd#syncd: :- removeAllSwitches: Can't delete a switch RID oid:0x2: SAI_STATUS_OBJECT_IN_USE
May  1 02:24:58.187141 str2-7804-lc6-1 NOTICE gbsyncd#syncd: :- removeAllSwitches: removing switch RID oid:0x2 took 0.000008 sec
May  1 02:24:58.187174 str2-7804-lc6-1 NOTICE gbsyncd#syncd: :- removeAllSwitches: Can't delete a switch RID oid:0x3: SAI_STATUS_OBJECT_IN_USE
May  1 02:24:58.187174 str2-7804-lc6-1 NOTICE gbsyncd#syncd: :- removeAllSwitches: removing switch RID oid:0x3 took 0.000020 sec
May  1 02:24:58.187186 str2-7804-lc6-1 NOTICE gbsyncd#syncd: :- removeAllSwitches: Can't delete a switch RID oid:0x4: SAI_STATUS_OBJECT_IN_USE
May  1 02:24:58.187186 str2-7804-lc6-1 NOTICE gbsyncd#syncd: :- removeAllSwitches: removing switch RID oid:0x4 took 0.000006 sec
May  1 02:24:58.187209 str2-7804-lc6-1 NOTICE gbsyncd#syncd: :- removeAllSwitches: Can't delete a switch RID oid:0x5: SAI_STATUS_OBJECT_IN_USE
May  1 02:24:58.187209 str2-7804-lc6-1 NOTICE gbsyncd#syncd: :- removeAllSwitches: removing switch RID oid:0x5 took 0.000007 sec
May  1 02:24:58.187227 str2-7804-lc6-1 NOTICE gbsyncd#syncd: :- removeAllSwitches: Can't delete a switch RID oid:0x6: SAI_STATUS_OBJECT_IN_USE
May  1 02:24:58.187258 str2-7804-lc6-1 NOTICE gbsyncd#syncd: :- removeAllSwitches: removing switch RID oid:0x6 took 0.000007 sec
May  1 02:24:58.187258 str2-7804-lc6-1 NOTICE gbsyncd#syncd: :- removeAllSwitches: Can't delete a switch RID oid:0x7: SAI_STATUS_OBJECT_IN_USE
May  1 02:24:58.187273 str2-7804-lc6-1 NOTICE gbsyncd#syncd: :- removeAllSwitches: removing switch RID oid:0x7 took 0.000005 sec
May  1 02:24:58.187273 str2-7804-lc6-1 NOTICE gbsyncd#syncd: :- removeAllSwitches: Can't delete a switch RID oid:0x8: SAI_STATUS_OBJECT_IN_USE
May  1 02:24:58.187293 str2-7804-lc6-1 NOTICE gbsyncd#syncd: :- removeAllSwitches: removing switch RID oid:0x8 took 0.000006 sec
May  1 02:24:58.187293 str2-7804-lc6-1 NOTICE gbsyncd#syncd: :- removeAllSwitches: Can't delete a switch RID oid:0x9: SAI_STATUS_OBJECT_IN_USE
May  1 02:24:58.187312 str2-7804-lc6-1 NOTICE gbsyncd#syncd: :- removeAllSwitches: removing switch RID oid:0x9 took 0.000007 sec
May  1 02:24:58.187312 str2-7804-lc6-1 NOTICE gbsyncd#syncd: :- removeAllSwitches: Can't delete a switch RID oid:0xa: SAI_STATUS_OBJECT_IN_USE
May  1 02:24:58.187349 str2-7804-lc6-1 NOTICE gbsyncd#syncd: :- removeAllSwitches: removing switch RID oid:0xa took 0.000021 sec
May  1 02:24:58.187349 str2-7804-lc6-1 NOTICE gbsyncd#syncd: :- removeAllSwitches: Can't delete a switch RID oid:0xb: SAI_STATUS_OBJECT_IN_USE
May  1 02:24:58.187368 str2-7804-lc6-1 NOTICE gbsyncd#syncd: :- removeAllSwitches: removing switch RID oid:0xb took 0.000005 sec
May  1 02:24:58.187368 str2-7804-lc6-1 NOTICE gbsyncd#syncd: :- removeAllSwitches: Can't delete a switch RID oid:0xc: SAI_STATUS_OBJECT_IN_USE
May  1 02:24:58.187386 str2-7804-lc6-1 NOTICE gbsyncd#syncd: :- removeAllSwitches: removing switch RID oid:0xc took 0.000006 sec
May  1 02:24:58.187386 str2-7804-lc6-1 NOTICE gbsyncd#syncd: :- run: calling api uninitialize
May  1 02:24:58.218165 str2-7804-lc6-1 NOTICE syncd#syncd: :- removeAllSwitches: Removing all switches
May  1 02:24:58.497013 str2-7804-lc6-1 NOTICE syncd#syncd: :- threadFunction: time span 309 ms for 'shutting down syncd'
May  1 02:24:58.779265 str2-7804-lc6-1 NOTICE gbsyncd#syncd: :- run: uninitialize finished
May  1 02:24:59.057069 str2-7804-lc6-1 NOTICE gbsyncd#syncd: :- threadFunction: ending timer watchdog thread
May  1 02:24:59.231222 str2-7804-lc6-1 INFO syncd#syncd: [none] SAI_API_SWITCH:_brcm_sai_dm_fini:28104 SAI Timing: DM fini time 0.007 seconds
May  1 02:24:59.232044 str2-7804-lc6-1 INFO syncd#/supervisord: syncd PRBSStat thread is not running.#015
May  1 02:24:59.497106 str2-7804-lc6-1 NOTICE syncd#syncd: :- threadFunction: time span 1310 ms for 'shutting down syncd'
May  1 02:24:59.909893 str2-7804-lc6-1 INFO gbsyncd#/supervisord: syncd PORT: SAI_LOG_LEVEL_WARN> set_port_attribute ignore error SAI_STATUS_OBJECT_IN_USE
May  1 02:24:59.909893 str2-7804-lc6-1 INFO gbsyncd#/supervisord: syncd <SAI_API_PORT: SAI_LOG_LEVEL_WARN> set_port_attribute ignore error SAI_STATUS_OBJECT_IN_USE
May  1 02:24:59.910223 str2-7804-lc6-1 INFO gbsyncd#/supervisord: message repeated 33 times: [ syncd <SAI_API_PORT: SAI_LOG_LEVEL_WARN> set_port_attribute ignore error SAI_STATUS_OBJECT_IN_USE]
May  1 02:24:59.910223 str2-7804-lc6-1 INFO gbsyncd#/supervisord: syncd run() thread exit
May  1 02:24:59.910223 str2-7804-lc6-1 INFO gbsyncd#/supervisord: syncd run() end
May  1 02:24:59.910241 str2-7804-lc6-1 INFO gbsyncd#/supervisord: syncd [OWL][phy_api_uninitialize]
May  1 02:24:59.911251 str2-7804-lc6-1 INFO gbsyncd#supervisord 2022-05-01 02:24:59,910 INFO exited: syncd (exit status 0; expected)
May  1 02:25:00.497205 str2-7804-lc6-1 NOTICE syncd#syncd: :- threadFunction: time span 2310 ms for 'shutting down syncd'
Staphylo commented 2 years ago

@byu343 could you comment on that?

byu343 commented 2 years ago

This is because the credo driver assumes that its children resources (like ports) should be removed first through SAI API before the removal of switches, so the remove_switch in the log trace returns SAI_STATUS_OBJECT_IN_USE. Although it fails, it doesn't really affect the following steps of shutdown and the restart, so the log items can be ignored for now. Meanwhile we will try to investigate the fix that the remove_switch (implemented by credo driver) can recursively clean its children.

byu343 commented 1 year ago

The error in removeAllSwitches() has been fixed on the latest source code. The version of the binary (in https://github.com/sonic-net/sonic-buildimage/blob/master/platform/components/docker-gbsyncd-credo.mk) needs to upgraded to absorb the fix.