aristanetworks / sonic

Open source drivers and initialization library for Arista platforms running SONiC
GNU General Public License v2.0
25 stars 30 forks source link

[chassis] Clearwater2 cards are going to read-only #90

Closed arlakshm closed 1 year ago

arlakshm commented 1 year ago

The clearwater 2 linecards are going to read-only mode during sonic-mgmt. nightly test.

Message from dmsg

[  650.871471] EXT4-fs (loop1): I/O error while writing superblock
[  650.871473] EXT4-fs (loop1): previous I/O error to superblock detected
[  650.871474] EXT4-fs error (device loop1): ext4_journal_check_start:83: Detected aborted journal
[  650.871475] EXT4-fs (loop1): Remounting filesystem read-only
[  650.871504] Buffer I/O error on dev loop1, logical block 0, lost sync page write
[  650.871510] EXT4-fs (loop1): I/O error while writing superblock
[  650.871511] EXT4-fs error (device loop1): ext4_journal_check_start:83: Detected aborted journal
[  650.871514] EXT4-fs (loop1): failed to convert unwritten extents to written extents -- potential data loss!  (inode 12, error -30)
[  650.871516] Buffer I/O error on device loop1, logical block 429626
[  650.871519] Buffer I/O error on device loop1, logical block 429627
[  650.871520] Buffer I/O error on device loop1, logical block 429628
[  650.871521] Buffer I/O error on device loop1, logical block 429629
[  650.871575] EXT4-fs (loop1): failed to convert unwritten extents to written extents -- potential data loss!  (inode 13, error -30)
[  650.963391] EXT4-fs (loop1): This should not happen!! Data will be lost
kenneth-arista commented 1 year ago

@arlakshm as discussed offline, mounting /tmp and /var as tmpfs will help minimize potential flash corruption due to power loss, etc.

kenneth-arista commented 1 year ago

In terms of logs/info to collect,

rlhui commented 1 year ago

@kenneth-arista is root cause confirmed/known?

kenneth-arista commented 1 year ago

The trigger is not specific to CL2. But instead it is a known behavior of EXT4 when there is some file system corruption due to unclean unmounts (e.g. sudden power loss, etc.).

kenneth-arista commented 1 year ago

Looks like other platforms are moving /var/log to tmpfs to minimize writes to flash. See https://github.com/sonic-net/sonic-buildimage/pull/15077

kenneth-arista commented 1 year ago

The problem is understood and thus closing this issue. We'll be pushing some changes in the platform code that should help mitigate occurrences in sonic-mgmt testing.