aristanetworks / sonic

Open source drivers and initialization library for Arista platforms running SONiC
GNU General Public License v2.0
25 stars 30 forks source link

[Chassis]: Errors seen on the clearwater-2 card #76

Closed arlakshm closed 1 year ago

arlakshm commented 1 year ago

The following errors are seen continous on the clearwater-2 card. This causes the log-anaylsers to fail Dont know what caused this error. Let me know if you need any more logs.


Jan 30 01:09:50.688285 str2-7804-lc7-1 WARNING kernel: [87721.072616] scd 0000:07:00.0: #4 rsp { .reg=0x00040200, .fe=0, .foe=0, .ss=04, .ti=00, .flushed=0, .ack_error=0, .timeout_error=1, .bus_conflict_error=0, .d=0x00 } bus=47 addr=0x50 ti=0 err=-5
Jan 30 01:09:50.688299 str2-7804-lc7-1 WARNING kernel: [87721.072616]  (scd_smbus_master_xfer:398)
Jan 30 01:09:50.828289 str2-7804-lc7-1 WARNING kernel: [87721.214419] scd 0000:07:00.0: #4 rsp { .reg=0x00040200, .fe=0, .foe=0, .ss=04, .ti=00, .flushed=0, .ack_error=0, .timeout_error=1, .bus_conflict_error=0, .d=0x00 } bus=47 addr=0x50 ti=0 err=-5
Jan 30 01:09:50.828312 str2-7804-lc7-1 WARNING kernel: [87721.214419]  (scd_smbus_master_xfer:398)
Jan 30 01:09:50.972277 str2-7804-lc7-1 WARNING kernel: [87721.357075] scd 0000:07:00.0: #4 rsp { .reg=0x00040200, .fe=0, .foe=0, .ss=04, .ti=00, .flushed=0, .ack_error=0, .timeout_error=1, .bus_conflict_error=0, .d=0x00 } bus=46 addr=0x50 ti=0 err=-5
Jan 30 01:09:50.972289 str2-7804-lc7-1 WARNING kernel: [87721.357075]  (scd_smbus_master_xfer:398)
Jan 30 01:09:51.116293 str2-7804-lc7-1 WARNING kernel: [87721.499275] scd 0000:07:00.0: #4 rsp { .reg=0x00040200, .fe=0, .foe=0, .ss=04, .ti=00, .flushed=0, .ack_error=0, .timeout_error=1, .bus_conflict_error=0, .d=0x00 } bus=46 addr=0x50 ti=0 err=-5
Jan 30 01:09:51.116306 str2-7804-lc7-1 WARNING kernel: [87721.499275]  (scd_smbus_master_xfer:398)
Jan 30 01:09:51.256285 str2-7804-lc7-1 WARNING kernel: [87721.641958] scd 0000:07:00.0: #5 rsp { .reg=0x00040200, .fe=0, .foe=0, .ss=04, .ti=00, .flushed=0, .ack_error=0, .timeout_error=1, .bus_conflict_error=0, .d=0x00 } bus=52 addr=0x50 ti=0 err=-5
Jan 30 01:09:51.256297 str2-7804-lc7-1 WARNING kernel: [87721.641958]  (scd_smbus_master_xfer:398)
Jan 30 01:09:51.400276 str2-7804-lc7-1 WARNING kernel: [87721.783722] scd 0000:07:00.0: #5 rsp { .reg=0x00040200, .fe=0, .foe=0, .ss=04, .ti=00, .flushed=0, .ack_error=0, .timeout_error=1, .bus_conflict_error=0, .d=0x00 } bus=52 addr=0x50 ti=0 err=-5
Jan 30 01:09:51.400284 str2-7804-lc7-1 WARNING kernel: [87721.783722]  (scd_smbus_master_xfer:398)
Jan 30 01:09:51.540281 str2-7804-lc7-1 WARNING kernel: [87721.926406] scd 0000:07:00.0: #6 rsp { .reg=0x00040200, .fe=0, .foe=0, .ss=04, .ti=00, .flushed=0, .ack_error=0, .timeout_error=1, .bus_conflict_error=0, .d=0x00 } bus=60 addr=0x50 ti=0 err=-5
Jan 30 01:09:51.540297 str2-7804-lc7-1 WARNING kernel: [87721.926406]  (scd_smbus_master_xfer:398)
Jan 30 01:09:51.684296 str2-7804-lc7-1 WARNING kernel: [87722.068609] scd 0000:07:00.0: #6 rsp { .reg=0x00040200, .fe=0, .foe=0, .ss=04, .ti=00, .flushed=0, .ack_error=0, .timeout_error=1, .bus_conflict_error=0, .d=0x00 } bus=60 addr=0x50 ti=0 err=-5
Jan 30 01:09:51.684314 str2-7804-lc7-1 WARNING kernel: [87722.068609]  (scd_smbus_master_xfer:398)
Jan 30 01:09:51.828294 str2-7804-lc7-1 WARNING kernel: [87722.211264] scd 0000:07:00.0: #4 rsp { .reg=0x00040200, .fe=0, .foe=0, .ss=04, .ti=00, .flushed=0, .ack_error=0, .timeout_error=1, .bus_conflict_error=0, .d=0x00 } bus=48 addr=0x50 ti=0 err=-5
Jan 30 01:09:51.828309 str2-7804-lc7-1 WARNING kernel: [87722.211264]  (scd_smbus_master_xfer:398)
Jan 30 01:09:51.968277 str2-7804-lc7-1 WARNING kernel: [87722.353480] scd 0000:07:00.0: #4 rsp { .reg=0x00040200, .fe=0, .foe=0, .ss=04, .ti=00, .flushed=0, .ack_error=0, .timeout_error=1, .bus_conflict_error=0, .d=0x00 } bus=48 addr=0x50 ti=0 err=-5
Jan 30 01:09:51.968294 str2-7804-lc7-1 WARNING kernel: [87722.353480]  (scd_smbus_master_xfer:398)
Jan 30 01:09:52.112284 str2-7804-lc7-1 WARNING kernel: [87722.495712] scd 0000:07:00.0: #5 rsp { .reg=0x00040200, .fe=0, .foe=0, .ss=04, .ti=00, .flushed=0, .ack_error=0, .timeout_error=1, .bus_conflict_error=0, .d=0x00 } bus=53 addr=0x50 ti=0 err=-5
arlakshm commented 1 year ago

@Staphylo , @kenneth-arista @ysmanman for viz...

Staphylo commented 1 year ago

@arlakshm these are smbus IO errors during the communication issues to the xcvrs. Do you know exactly when these are happening? Is it only happening during platform tests? (e.g xcvr reset) Are your interfaces working?

arlakshm commented 1 year ago

Hi @Staphylo, One of the port is link down. I dont know if this logs are related this log.

admin@str2-7804-lc7-1:~$ show interface status | grep -i Ethernet96
     Ethernet96    94,95     100G   9100     rs  Ethernet25/1           routed    down       up     N/A         off                                                                                                                      
admin@str2-7804-lc7-1:~$

I do not know when this problem started.

arlakshm commented 1 year ago

These logs are causing about 200 cases to be marked as error.

rlhui commented 1 year ago

Hi Arista team, as this is a top issue causing low pass rate in our testbed, can it be fixed this week? Thanks.

kenneth-arista commented 1 year ago

Hi Rita,

Samuel asked some questions earlier to help understand if this is a hardware issue or not. Specifically,

Thanks, Kenneth

On Tue, Jan 31, 2023 at 9:48 PM Rita Hui @.***> wrote:

Hi Arista team, as this is a top issue causing low pass rate in our testbed, can it be fixed this week? Thanks.

— Reply to this email directly, view it on GitHub https://github.com/aristanetworks/sonic/issues/76#issuecomment-1411500488, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWIHIWYX52SRKOOTPT5IXBDWVH2JPANCNFSM6AAAAAAUKSH2NQ . You are receiving this because you were mentioned.Message ID: @.***>

arlakshm commented 1 year ago

Hi @kenneth-arista , Please see responses below.

kenneth-arista commented 1 year ago

Thanks for the debug session last week. We confirmed that the CL2 linecard is from an early production batch. We will follow up offline to determine if we can replace this hardware to avoid future issues.