ColinIanKing / stress-ng

This is the stress-ng upstream project git repository. stress-ng will stress test a computer system in various selectable ways. It was designed to exercise various physical subsystems of a computer as well as the various operating system kernel interfaces.
https://github.com/ColinIanKing/stress-ng
GNU General Public License v2.0
1.81k stars 289 forks source link

pci stressor regression #255

Closed Cypresslin closed 1 year ago

Cypresslin commented 1 year ago

Hi Colin, while testing the V0.15.02 update, I noticed that the pci stressor would fail with Bionic (4.15.0-202-generic), Focal (5.4.0-138.155), Jammy (5.15.0-59.65) and Kinetic (5.19.0-30.31) on ARM64 / PowerPC

Test log on a Cavium ThunderX2 ARM64 box with B-4.15:

$ sudo ./stress-ng -v -t 5 --pci 4 --pci-ops 3000 --ignite-cpu --syslog --verbose --verify --oomable
stress-ng: debug: [36741] invoked with './stress-ng -v -t 5 --pci 4 --pci-ops 3000 --ignite-cpu --syslog --verbose --verify --oomable' by user 0 'root'
stress-ng: debug: [36741] stress-ng 0.15.02 g91ec6bccd7e9
stress-ng: debug: [36741] system: Linux helo-kernel 4.15.0-202-generic #213-Ubuntu SMP Thu Jan 5 19:22:02 UTC 2023 aarch64
stress-ng: debug: [36741] RAM total: 251.8G, RAM free: 249.2G, swap free: 8.0G
stress-ng: debug: [36741] temporary file path: '.', filesystem type: ext2
stress-ng: debug: [36741] 224 processors online, 224 processors configured
stress-ng: info:  [36741] setting to a 5 second run per stressor
stress-ng: info:  [36741] dispatching hogs: 4 pci
stress-ng: debug: [36741] cache allocate: shared cache buffer size: 32768K
stress-ng: debug: [36741] starting stressors
stress-ng: debug: [36741] 4 stressors started
stress-ng: debug: [36742] pci: started [36742] (instance 0)
stress-ng: debug: [36744] pci: started [36744] (instance 2)
stress-ng: debug: [36745] pci: started [36745] (instance 3)
stress-ng: debug: [36743] pci: started [36743] (instance 1)
stress-ng: debug: [36741] process [36742] (pci) terminated on signal: 11 (Segmentation fault)
stress-ng: debug: [36741] process [36742] terminated
stress-ng: debug: [36744] pci: exited [36744] (instance 2)
stress-ng: debug: [36743] pci: exited [36743] (instance 1)
stress-ng: debug: [36741] process [36743] terminated
stress-ng: debug: [36741] process [36744] terminated
stress-ng: debug: [36741] process [36745] (pci) terminated on signal: 11 (Segmentation fault)
stress-ng: debug: [36741] process [36745] terminated
stress-ng: fail:  [36741] pci instance 0 corrupted bogo-ops counter, 102 vs 0
stress-ng: fail:  [36741] pci instance 0 hash error in bogo-ops counter and run flag, 1605818725 vs 0
stress-ng: fail:  [36741] pci instance 3 corrupted bogo-ops counter, 20 vs 0
stress-ng: fail:  [36741] pci instance 3 hash error in bogo-ops counter and run flag, 2612107102 vs 0
stress-ng: fail:  [36741] metrics-check: stressor metrics corrupted, data is compromised
info: 5 failures reached, aborting stress process
stress-ng: info:  [36741] unsuccessful run completed in 1.46s

dmesg output:

[ 2948.854546] [qed_dbg_parse_attn:7435(enp1s0f0-0)]grc (Interrupt) : timeout_event [address 0x0005018c, bit 1]
[ 2948.854552] [qed_dbg_parse_attn:7435(enp1s0f1-0)]grc (Interrupt) : trace_fifo_valid_data [address 0x0005018c, bit 4] [masked]
[ 2948.855363] [qed_dbg_parse_attn:7435(enp1s0f1-0)]grc (Interrupt) : timeout_event [address 0x0005018c, bit 1]
[ 2948.857141] Synchronous External Abort: synchronous external abort (0x96000610) at 0x0000fffff1a1bfc0
[ 2948.857146] Synchronous External Abort: synchronous external abort (0x96000610) at 0x0000fffff1a1bfc0
[ 2948.857150] Internal error: : 96000610 [#15] SMP
[ 2948.857156] Modules linked in: nls_iso8859_1 joydev input_leds binfmt_misc shpchp ipmi_ssif ipmi_devintf ipmi_msghandler cppc_cpufreq thunderx2_pmu sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear i2c_smbus ast i2c_algo_bit ttm qede drm_kms_helper hid_generic syscopyarea sysfillrect ptp sysimgblt usbhid pps_core fb_sys_fops uas hid usb_storage aes_ce_blk aes_ce_cipher crc32_ce crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce mpt3sas qed drm raid_class scsi_transport_sas ahci libahci gpio_xlp i2c_xlp9xx aes_neon_bs aes_neon_blk crypto_simd cryptd aes_arm64
[ 2948.857229] Process stress-ng-pci (pid: 36778, stack limit = 0x00000000658a47c2)
[ 2948.857235] CPU: 176 PID: 36778 Comm: stress-ng-pci Tainted: G      D          4.15.0-202-generic #213-Ubuntu
[ 2948.857238] Hardware name: To be filled by O.E.M. Saber/Saber, BIOS 0ACKL030 06/04/2020
[ 2948.857243] pstate: 80400009 (Nzcv daif +PAN -UAO)
[ 2948.857250] pc : __memcpy_fromio+0x68/0xb8
[ 2948.857257] lr : pci_read_rom+0xd4/0xf8
[ 2948.857260] sp : ffff00001e3d3d00
[ 2948.857263] x29: ffff00001e3d3d00 x28: ffff809e8ee63c00 
[ 2948.857267] x27: ffff000008b82000 x26: 000000000000003f 
[ 2948.857272] x25: ffff80bedf9bc000 x24: 0000000000001000 
[ 2948.857275] x23: ffff000047e40000 x22: 0000000000000002 
[ 2948.857279] x21: 0000000000000ffa x20: ffff000047e40008 
[ 2948.857283] x19: ffff80bedf9bc6fe x18: 0000ffffae2c9a70 
[ 2948.857287] x17: 0000ffffae2555c0 x16: ffff0000082fdd88 
[ 2948.857290] x15: 0140000000000000 x14: 000391214aca5833 
[ 2948.857294] x13: 0000000063c8c3d4 x12: ffff008047e40000 
[ 2948.857298] x11: 0000000003bd7277 x10: 0000000063c8c3d4 
[ 2948.857302] x9 : ffff000087e40000 x8 : 0000009ed6c151f8 
[ 2948.857306] x7 : 0000fffff99c0000 x6 : ffff000047e80000 
[ 2948.857309] x5 : ffff000047e80000 x4 : 0000000052494350 
[ 2948.857313] x3 : 0000000000000ff8 x2 : ffff80bedf9bcffe 
[ 2948.857316] x1 : f0c9c1fc24697e82 x0 : ffff000047e40700 
[ 2948.857320] Call trace:
[ 2948.857325]  __memcpy_fromio+0x68/0xb8
[ 2948.857330]  pci_read_rom+0xd4/0xf8
[ 2948.857338]  sysfs_kf_bin_read+0x6c/0xa8
[ 2948.857342]  kernfs_fop_read+0xa8/0x210
[ 2948.857346]  __vfs_read+0x48/0x80
[ 2948.857350]  vfs_read+0x94/0x150
[ 2948.857353]  SyS_read+0x74/0xf0
[ 2948.857358]  el0_svc_naked+0x30/0x34
[ 2948.857363] Code: 927df063 91002063 8b030262 d503201f (f9400001) 
[ 2948.857368] ---[ end trace f11e6e82b0f89869 ]---
[ 2948.857373] Internal error: : 96000610 [#16] SMP
[ 2948.857378] Modules linked in: nls_iso8859_1 joydev input_leds binfmt_misc shpchp ipmi_ssif ipmi_devintf ipmi_msghandler cppc_cpufreq thunderx2_pmu sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear i2c_smbus ast i2c_algo_bit ttm qede drm_kms_helper hid_generic syscopyarea sysfillrect ptp sysimgblt usbhid pps_core fb_sys_fops uas hid usb_storage aes_ce_blk aes_ce_cipher crc32_ce crct10dif_ce ghash_ce sha2_ce sha256_arm64 sha1_ce mpt3sas qed drm raid_class scsi_transport_sas ahci libahci gpio_xlp i2c_xlp9xx aes_neon_bs aes_neon_blk crypto_simd cryptd aes_arm64
[ 2948.857450] Process stress-ng-pci (pid: 36781, stack limit = 0x00000000bb1a59f0)
[ 2948.857456] CPU: 151 PID: 36781 Comm: stress-ng-pci Tainted: G      D          4.15.0-202-generic #213-Ubuntu
[ 2948.857458] Hardware name: To be filled by O.E.M. Saber/Saber, BIOS 0ACKL030 06/04/2020
[ 2948.857462] pstate: 80400009 (Nzcv daif +PAN -UAO)
[ 2948.857469] pc : __memcpy_fromio+0x68/0xb8
[ 2948.857476] lr : pci_read_rom+0xd4/0xf8
[ 2948.857478] sp : ffff00001d5d3d00
[ 2948.857481] x29: ffff00001d5d3d00 x28: ffff809e8ee62d00 
[ 2948.857485] x27: ffff000008b82000 x26: 000000000000003f 
[ 2948.857489] x25: ffff80bedf0c4000 x24: 0000000000001000 
[ 2948.857493] x23: ffff000048040000 x22: 0000000000000002 
[ 2948.857497] x21: 0000000000000ffa x20: ffff000048040008 
[ 2948.857500] x19: ffff80bedf0c4056 x18: 0000ffffae2c9a70 
[ 2948.857504] x17: 0000ffffae2555c0 x16: ffff0000082fdd88 
[ 2948.857508] x15: 0140000000000000 x14: 0003ce2bc5c7d433 
[ 2948.857512] x13: 0000000063c8c3d4 x12: ffff008048040000 
[ 2948.857515] x11: 0000000003e75bce x10: 0000000063c8c3d4 
[ 2948.857519] x9 : ffff000088040000 x8 : 0000009ed6c15200 
[ 2948.857523] x7 : 0000fffff97c0000 x6 : ffff000048080000 
[ 2948.857526] x5 : ffff000048080000 x4 : 0000000052494350 
[ 2948.857530] x3 : 0000000000000ff8 x2 : ffff80bedf0c4ffe 
[ 2948.857533] x1 : 024592cc44002a09 x0 : ffff000048040058 
[ 2948.857537] Call trace:
[ 2948.857543]  __memcpy_fromio+0x68/0xb8
[ 2948.857548]  pci_read_rom+0xd4/0xf8
[ 2948.857554]  sysfs_kf_bin_read+0x6c/0xa8
[ 2948.857559]  kernfs_fop_read+0xa8/0x210
[ 2948.857563]  __vfs_read+0x48/0x80
[ 2948.857566]  vfs_read+0x94/0x150
[ 2948.857569]  SyS_read+0x74/0xf0
[ 2948.857574]  el0_svc_naked+0x30/0x34
[ 2948.857579] Code: 927df063 91002063 8b030262 d503201f (f9400001) 
[ 2948.857584] ---[ end trace f11e6e82b0f8986a ]---
[ 2948.864381] [qed_dbg_parse_attn:7435(enp1s0f0-0)]grc (Interrupt) : trace_fifo_valid_data [address 0x0005018c, bit 4] [masked]
[ 2948.864582] [qed_dbg_parse_attn:7435(enp1s0f0-0)]grc (Interrupt) : timeout_event [address 0x0005018c, bit 1]
[ 2948.875712] [qed_dbg_parse_attn:7435(enp1s0f1-0)]grc (Interrupt) : trace_fifo_valid_data [address 0x0005018c, bit 4] [masked]
[ 2948.885535] [qed_dbg_parse_attn:7435(enp1s0f0-0)]grc (Interrupt) : trace_fifo_valid_data [address 0x0005018c, bit 4] [masked]

Bisect suggest this commit might be the cause:

e1ebe935ecd438ac0a04a61a8e761598544f341d is the first bad commit
commit e1ebe935ecd438ac0a04a61a8e761598544f341d
Author: Colin Ian King <colin.i.king@gmail.com>
Date:   Fri Jan 6 12:49:19 2023 +0000

stress-pci: print PCI config and resource space read rates

Signed-off-by: Colin Ian King <colin.i.king@gmail.com>

:100644 100644 2e54c0f5cae583c0eefce8d64c3458e81a833359 630b7c296dabffca06cd41dff1e3f31f58a5bf17 M  stress-pci.c

Bisect log

$ git bisect log
git bisect start
# bad: [4164f6842c712c2d9a13619c3c70fd35d8d02cdb] Debian: update changelog
git bisect bad 4164f6842c712c2d9a13619c3c70fd35d8d02cdb
# good: [91ec6bccd7e94fd04674b29e081e579710b97d71] Debian: update changelog
git bisect good 91ec6bccd7e94fd04674b29e081e579710b97d71
# good: [16c45ed56baf02a85b97f785f5dc81c154ccfb04] stress-mmap: make variable mask non-clobberable
git bisect good 16c45ed56baf02a85b97f785f5dc81c154ccfb04
# good: [e49b6cf6cea5f55f817ca1e41c758263bb1a52a2] stress-jpeg: add compression ration metrics
git bisect good e49b6cf6cea5f55f817ca1e41c758263bb1a52a2
# bad: [697782d4fe07ed8a030a2f965e074e47279b07a1] stress-ng.h: divide by size of make allocator bitmap elements instead of 8
git bisect bad 697782d4fe07ed8a030a2f965e074e47279b07a1
# good: [fb940afa9e682d2a00d6c3bf8d6997e2febbee8a] core-hash: add more comments to explain the nuances of memcpy
git bisect good fb940afa9e682d2a00d6c3bf8d6997e2febbee8a
# bad: [bcf1ec7d7c53373077f495e173af7abeafc7d9c3] stress-list: use builtin shim_ror64
git bisect bad bcf1ec7d7c53373077f495e173af7abeafc7d9c3
# good: [ec8283a4b2d0b5429e7a120564a71a8dc9d9b46a] stress-forkheavy: remove sleep and send SIGALRM to pids on termination
git bisect good ec8283a4b2d0b5429e7a120564a71a8dc9d9b46a
# bad: [45337e787786f9af8dc79eb3488bc8789e082357] core-helper: fix stack size for non-first calls (incorrect cached value)
git bisect bad 45337e787786f9af8dc79eb3488bc8789e082357
# bad: [e1ebe935ecd438ac0a04a61a8e761598544f341d] stress-pci: print PCI config and resource space read rates
git bisect bad e1ebe935ecd438ac0a04a61a8e761598544f341d
# first bad commit: [e1ebe935ecd438ac0a04a61a8e761598544f341d] stress-pci: print PCI config and resource space read rates

On another two Lenovo ThinkSystem HR330a ARM64 nodes and our Power8 node, the ssh session will hang, here is the dmesg output from console (sorry for the malformed output, copied-pasted as-is):

[  787.507576] EEH: Frozen PHB#21-PE#fc detected
[  787.507746] EEH: PE location: Backplane SATA, PHB location: N/A
[  787.507576] EE[  787.510400] BH: Frozen PHB#21uffer I/O error -on device sda2, PE#fc detectedlogical block 11
419
[  787.507746] E[  787.510806] BEuffer I/O error H: PE location: on device sda2, Blogical block 11ackplane SATA, P420
H[  787.510957] BB location: N/Auffer I/O error 
on device sda2, logical block 11421
[  787.507877] CP[  787.511060] BU: 41 PID: 5518 uffer I/O error Con device sda2, omm: stress-ng-plogical block 11c422
i Not tainted 4.[  787.511204] B1uffer I/O error 5.0-202-generic on device sda2, #logical block 11213-Ubuntu423

[  787.511335] B[  787.507879] Cuffer I/O error aon device sda2, ll Trace:logical block 11
424
[  787.511466] B[  787.507886] [uffer I/O error con device sda2, 000000714cd7b80]logical block 11 425
[c000000000d1d62[  787.511589] B8uffer I/O error ] dump_stack+0xbon device sda2, clogical block 11/0x104 (unreliab426
l[  787.511748] Be)uffer I/O error 
on device sda2, logical block 11427
[  787.507891] [[  787.511972] Bcuffer I/O error 000000714cd7bc0]on device sda2,  logical block 11[c000000000040c3428
c] eeh_dev_check_failure+0x57c/0x590
[  787.507894] [c000000714cd7c60] [c000000000040cdc] eeh_check_failure+0x8c/0xe0
[  787.507898] [c000000714cd7ca0] [c0000000007a03d4] pci_read_rom+0x134/0x160
[  787.507901] [c000000714cd7d00] [c0000000004beb90] sysfs_kf_bin_read+0x70/0xd0
[  787.507904] [c000000714cd7d20] [c0000000004bdab0] kernfs_fop_read+0xe0/0x290[  787.512867] A
borting journal on device sda2-8.
[  787.507907] [[  787.512890] EcXT4-fs (sda2): e000000714cd7d70]xt4_writepages:  jbd2_start: 9223[c0000000003ef0d372036854775806 cpages, ino 60820] __vfs_read+0x3251; err -30
c[  787.512978] J/0x70BD2: Error -5 de
tected when updating journal superblock for sda2[  787.507910] [-8.
c000000714cd7d90] [c0000000003ef1cc] vfs_read+0xbc/0x1b0
[  787.507913] [c000000714cd7de0] [c0000000003ef7d8] SyS_read+0x78/0x140
[  787.507917] [c000000714cd7e30] [c00000000000b388] system_call+0x5c/0x70
[  787.507929] EEH: Detected PCI bus error on PHB#21-PE#fc
[  787.507934] EEH: This PCI device has failed 1 times in the last hour
[  787.507935] EEH: Notify device drivers to shutdown
[  787.507957] EEH: Collect temporary log
[  787.508073] EEH: of node=0021:0[  787.514243] Ee:00.0XT4-fs (sda2): p
revious I/O error to superblock [  787.508079] Edetected
EH: PCI device/vendor: 92351b4b
[  787.514404] EXT4-fs error (de[  787.508092] Evice sda2): ext4E_journal_check_sH: PCI cmd/statutart:61: Detectesd aborted journa register: 00100l
5[  787.514414] E47XT4-fs error (de
vice sda2): ext4_journal_check_s[  787.508094] Etart:61: DetecteEd aborted journaH: PCI-E capabill
i[  787.514422] Eties and status XT4-fs (sda2): pfrevious I/O erroollow:r to superblock 
detected
[  787.514438] E[  787.508140] EXT4-fs error (deEvice sda2): ext4H: PCI-E 00: 001_journal_check_s2tart:61: Detectee010 00648702 00d aborted journa0l
0201e 00033c22 [  787.514440] E
XT4-fs (sda2): Remounting filesystem read-only
[  787.508174] E[  787.514473] EEXT4-fs (sda2): pH: PCI-E 10: 102revious I/O erro2r to superblock 0000 00000000 00detected
0[  787.514562] E00000 00000000 XT4-fs (sda2): e
xt4_writepages: jbd2_start: 9223[  787.508177] E372036854775807 Epages, ino 60817H: PCI-E 20: 000577; err -30
0[  787.514873] E0000 XT4-fs (sda2): e
xt4_writepages: jbd2_start: 9223372036854775749 [  787.508179] Epages, ino 60817E577; err -30
H: PCI-E AER capability register s[  787.515803] pet follows:rint_req_error: 
I/O error, dev sda, sector 0
[  787.508233] EEH: PCI-E AER 00: 00010001 00000000 00000000 00062030 
[  787.508284] EEH: PCI-E AER 10: 00000000 00002000 00000140 00000000 
[  787.508321] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000 
[  787.508335] EEH: PCI-E AER 30: 00000000 00000000 
[  787.508338] PHB3 PHB#21 Diag-data (Version: 1)
[  787.508340] brdgCtl:     00000002
[  787.508344] RootSts:     0000000f 00400000 b0830008 00100147 00002000
[  787.508347] PhbSts:      0000001c00000000 0000001c00000000
[  787.508350] Lem:         0000000000100000 42498e367f502eae 0000000000000000
[  787.508354] OutErr:      0200000000000000 0200000000000000 0201006000003fe8 81010f6900000000
[  787.508357] InAErr:      4000000000000000 4000000000000000 0200fe8000000000 0000000000000000
[  787.508363] PE[0fc] A/B: 8440002b00000000 8000000000000000
[  787.508365] EEH: Reset with hotplug activity
[  787.509520] ata1.00: disabled
[  787.510394] EXT4-fs warning (device sda2): ext4_end_bio:325: I/O error 10 writing to inode 60817577 (offset 0 size 0 starting block 13754)
[  787.510400] Buffer I/O error on device sda2, logical block 11419
[  787.510806] Buffer I/O error on device sda2, logical block 11420
[  787.510957] Buffer I/O error on device sda2, logical block 11421
[  787.511060] Buffer I/O error on device sda2, logical block 11422
[  787.511204] Buffer I/O error on device sda2, logical block 11423
[  787.511335] Buffer I/O error on device sda2, logical block 11424
[  787.511466] Buffer I/O error on device sda2, logical block 11425
[  787.511589] Buffer I/O error on device sda2, logical block 11426
[  787.511748] Buffer I/O error on device sda2, logical block 11427
[  787.511972] Buffer I/O error on device sda2, logical block 11428
[  787.512140] EXT4-fs warning (device sda2): ext4_end_bio:325: I/O [  787.552863] EXT4-fs error (device sda2): ext4_find_entry:1494: inode #13893633: comm (spawn): reading directory lblock 0
[  787.553408] EXT4-fs (sda2): previous I/O error to superblock detected
[  787.554433] EXT4-fs error (device sda2): ext4_find_entry:1494: inode #13893633: comm (spawn): reading directory lblock 0
[  787.556159] EXT4-fs error (device sda2): ext4_find_entry:1494: inode #38535169: comm systemd-journal: reading directory lblock 0
[  787.556167] EXT4-fs (sda2): previous I/O error to superblock detected
[  787.570338] EXT4-fs (sda2): previous I/O error to superblock detected
[  787.572100] EXT4-fs error (device sda2): ext4_find_entry:1494: inode #38535169: comm systemd-journal: reading directory lblock 0
[  787.574945] EXT4-fs error (device sda2): ext4_find_entry:1494: inode #38535169: comm systemd-journal: reading directory lblock 0
[  787.575259] EXT4-fs error (device sda2): ext4_find_entry:1494: inode #38535169: comm systemd-journal: reading directory lblock 0
[  787.576684] EEH: Frozen PHB#22-PE#0 detected
[  787.577469] tg3 0022:01:00.1: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  787.579055] tg3 0022:01:00.2: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  787.580643] tg3 0022:01:00.3: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  787.583838] EEH: PE location: Slot1, PHB location: N/A
[  787.589554] EXT4-fs error (device sda2): ext4_find_entry:1494: inode #38535169: comm systemd-journal: reading directory lblock 0
[  787.593237] EXT4-fs error (device sda2): ext4_find_entry:1494: inode #13893633: comm (spawn): reading directory lblock 0
[  787.593576] EXT4-fs error (device sda2): ext4_find_entry:1494: inode #38535169: comm systemd-journal: reading directory lblock 0
[  787.593952] tg3 0022:01:00.1: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  787.594333] EXT4-fs error (device sda2): ext4_find_entry:1494: inode #13893633: comm (spawn): reading directory lblock 0
[  787.597551] tg3 0022:01:00.2: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  787.599188] tg3 0022:01:00.3: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
error 10 writing to inode 60817577 (offset 0 size 0 starting block 13786)
[  787.512157] EXT4-fs warning (device sda2): ext4_end_bio:325: I/O error 10 writing to inode 60817577 (offset 0 size 0 starting block 13818)
[  787.512173] EXT4-fs warning (device sda2): ext4_end_bio:325: I/O error 10 writing to inode 60817577 (offset 0 size 0 starting block 13850)
[  787.512191] EXT4-fs warning (device sda2): ext4_end_bio:325: I/O error 10 writing to inode 60817577 (offset 0 size 0 starting block 13882)
[  787.512211] EXT4-fs warning (device sda2): ext4_end_bio:325: I/O error 10 writing to inode 60817577 (offset 0 size 0 starting block 14042)
[  787.512227] EXT4-fs warning (device sda2): ext4_end_bio:325: I/O error 10 writing to inode 60817577 (offset 0 size 0 starting block 14106)
[  787.512254] EXT4-fs warning (device sda2): ext4_end_bio:325: I/O error 10 writing to inode 60820254 (offset 0 size 0 starting block 243323137)
[  787.512270] EXT4-fs warning (device sda2): ext4_end_bio:325: I/O error 10 writing to inode 60817577 (offset 0 size 0 starting block 243342593)
[  787.512429] EXT4-fs warning (device sda2): ext4_end_bio:325: I/O error 10 writing to inode 60817577 (offset 0 size 0 starting block 14218)
[  787.512830] JBD2: Detected IO errors while flushing file data on sda2-8
[  787.512867] Aborting journal on device sda2-8.
[  787.512890] EXT4-fs (sda2): ext4_writepages: jbd2_start: 9223372036854775806 pages, ino 60820251; err -30
[  787.512978] JBD2: Error -5 detected when updating journal superblock for sda2-8.
[  787.513361] JBD2: Detected IO errors while flushing file data on sda2-8
[  787.514243] EXT4-fs (sda2): previous I/O error to superblock detected
[  787.514404] EXT4-fs error (device sda2): ext4_journal_check_start:61: Detected aborted journal
[  787.514414] EXT4-fs error (device sda2): ext4_journal_check_start:61: Detected aborted journal
[  787.514422] EXT4-fs (sda2): previous I/O error to superblock detected
[  787.514438] EXT4-fs error (device sda2): ext4_journal_check_start:61: Detected aborted journal
[  787.514440] EXT4-fs (sda2): Remounting filesystem read-only
[  787.514473] EXT4-fs (sda2): previous I/O error to superblock detected
[  787.514562] EXT4-fs (sda2): ext4_writepages: jbd2_start: 9223372036854775807 pages, ino 60817577; err -30
[  787.514873] EXT4-fs (sda2): ext4_writepages: jbd2_start: 9223372036854775749 pages, ino 60817577; err -30
[  787.515803] print_req_error: I/O error, dev sda, sector 0
[  787.552855] blk_partition_remap: fail for partition 2
[  787.552863] EXT4-fs error (device sda2): ext4_find_entry:1494: inode #13893633: comm (spawn): reading directory lblock 0
[  787.553408] EXT4-fs (sda2): previous I/O error to superblock detected
[  787.554417] blk_partition_remap: fail for partition 2
[  787.554430] blk_partition_remap: fail for partition 2
[  787.554433] EXT4-fs error (device sda2): ext4_find_entry:1494: inode #13893633: comm (spawn): reading directory lblock 0
[  787.554794] blk_partition_remap: fail for partition 2
[  787.556159] EXT4-fs error (device sda2): ext4_find_entry:1494: inode #38535169: comm systemd-journal: reading directory lblock 0
[  787.556167] EXT4-fs (sda2): previous I/O error to superblock detected
[  787.570335] blk_partition_remap: fail for partition 2
[  787.570338] EXT4-fs (sda2): previous I/O error to superblock detected
[  787.572087] blk_partition_remap: fail for partition 2
[  787.572097] blk_partition_remap: fail for partition 2
[  787.572100] EXT4-fs error (device sda2): ext4_find_entry:1494: inode #38535169: comm systemd-journal: reading directory lblock 0
[  787.574085] blk_partition_remap: fail for partition 2
[  787.574942] blk_partition_remap: fail for partition 2
[  787.574945] EXT4-fs error (device sda2): ext4_find_entry:1494: inode #38535169: comm systemd-journal: reading directory lblock 0
[  787.575139] blk_partition_remap: fail for partition 2
[  787.575256] blk_partition_remap: fail for partition 2
[  787.575259] EXT4-fs error (device sda2): ext4_find_entry:1494: inode #38535169: comm systemd-journal: reading directory lblock 0
[  787.576684] EEH: Frozen PHB#22-PE#0 detected
[  787.577469] tg3 0022:01:00.1: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  787.579055] tg3 0022:01:00.2: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  787.580643] tg3 0022:01:00.3: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  787.583814] blk_partition_remap: fail for partition 2
[  787.583838] EEH: PE location: Slot1, PHB location: N/A
[  787.586971] blk_partition_remap: fail for partition 2
[  787.588630] CPU: 33 PID: 5517 Comm: stress-ng-pci Not tainted 4.15.0-202-generic #213-Ubuntu
[  787.589552] Call Trace:
[  787.589554] EXT4-fs error (device sda2): ext4_find_entry:1494: inode #38535169: comm systemd-journal: reading directory lblock 0
[  787.589572] blk_partition_remap: fail for partition 2
[  787.591448] [c000000714cafb80] [c000000000d1d628] dump_stack+0xbc/0x104 (unreliable)
[  787.591452] [c000000714cafbc0] [c000000000040c3c] eeh_dev_check_failure+0x57c/0x590
[  787.591458] [c000000714cafc60] [c000000000040cdc] eeh_check_failure+0x8c/0xe0
[  787.591463] [c000000714cafca0] [c0000000007a03d4] pci_read_rom+0x134/0x160
[  787.591469] [c000000714cafd00] [c0000000004beb90] sysfs_kf_bin_read+0x70/0xd0
[  787.591475] [c000000714cafd20] [c0000000004bdab0] kernfs_fop_read+0xe0/0x290
[  787.591485] [c000000714cafd70] [c0000000003ef0dc] __vfs_read+0x3c/0x70
[  787.591489] [c000000714cafd90] [c0000000003ef1cc] vfs_read+0xbc/0x1b0
[  787.591492] [c000000714cafde0] [c0000000003ef7d8] SyS_read+0x78/0x140
[  787.591496] [c000000714cafe30] [c00000000000b388] system_call+0x5c/0x70
[  787.591784] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[  787.591864] sd 0:0:0:0: [sda] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  787.591866] sd 0:0:0:0: [sda] Stopping disk
[  787.591876] sd 0:0:0:0: [sda] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  787.593232] blk_partition_remap: fail for partition 2
[  787.593237] EXT4-fs error (device sda2): ext4_find_entry:1494: inode #13893633: comm (spawn): reading directory lblock 0
[  787.593572] blk_partition_remap: fail for partition 2
[  787.593576] EXT4-fs error (device sda2): ext4_find_entry:1494: inode #38535169: comm systemd-journal: reading directory lblock 0
[  787.593952] tg3 0022:01:00.1: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  787.594329] blk_partition_remap: fail for partition 2
[  787.594333] EXT4-fs error (device sda2): ext4_find_entry:1494: inode #13893633: comm (spawn): reading directory lblock 0
[  787.595884] blk_partition_remap: fail for partition 2
[  787.596024] blk_partition_remap: fail for partition 2
[  787.597551] tg3 0022:01:00.2: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  787.598768] blk_partition_remap: fail for partition 2
[  787.599188] tg3 0022:01:00.3: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  787.601213] blk_partition_remap: fail for partition 2
[  787.601350] blk_partition_remap: fail for partition 2
[  787.601419] blk_partition_remap: fail for partition 2
[  787.601593] blk_partition_remap: fail for partition 2
[  787.601836] blk_partition_remap: fail for partition 2
[  787.601929] blk_partition_remap: fail for partition 2
[  787.654670] ata2.00: disabled
[  787.656380] blk_partition_remap: fail for partition 2
[  787.656644] blk_partition_remap: fail for partition 2
[  787.656695] blk_partition_remap: fail for partition 2
[  787.656955] blk_partition_remap: fail for partition 2
[  787.657094] blk_partition_remap: fail for partition 2
[  787.657292] blk_partition_remap: fail for partition 2
[  787.657312] blk_partition_remap: fail for partition 2
[  787.657353] blk_partition_remap: fail for partition 2
[  787.657601] blk_partition_remap: fail for partition 2
[  787.657776] blk_partition_remap: fail for partition 2
[  787.658045] blk_partition_remap: fail for partition 2
[  787.658347] sd 1:0:0:0: [sdb] Synchronizing SCSI cache
[  787.658358] blk_partition_remap: fail for partition 2
[  787.658445] sd 1:0:0:0: [sdb] Synchronize Cache(10) failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  787.658449] sd 1:0:0:0: [sdb] Stopping disk
[  787.658459] sd 1:0:0:0: [sdb] Start/Stop Unit failed: Result: hostbyte=DID_BAD_TARGET driverbyte=DRIVER_OK
[  787.658716] blk_partition_remap: fail for partition 2
[  787.658960] blk_partition_remap: fail for partition 2
[  787.659153] blk_partition_remap: fail for partition 2
[  787.659364] blk_partition_remap: fail for partition 2
[  787.659427] blk_partition_remap: fail for partition 2
[  787.659491] blk_partition_remap: fail for partition 2
[  787.659617] blk_partition_remap: fail for partition 2
[  787.659710] blk_partition_remap: fail for partition 2
[  787.659810] blk_partition_remap: fail for partition 2
[  787.659890] blk_partition_remap: fail for partition 2
[  787.659970] blk_partition_remap: fail for partition 2
[  787.660277] blk_partition_remap: fail for partition 2
[  787.660406] blk_partition_remap: fail for partition 2
[  787.660519] blk_partition_remap: fail for partition 2
[  787.660532] blk_partition_remap: fail for partition 2
[  787.660607] blk_partition_remap: fail for partition 2
[  787.660734] blk_partition_remap: fail for partition 2
[  787.661010] blk_partition_remap: fail for partition 2
[  787.661088] blk_partition_remap: fail for partition 2
[  787.661215] blk_partition_remap: fail for partition 2
[  787.661476] blk_partition_remap: fail for partition 2
[  787.661666] blk_partition_remap: fail for partition 2
[  787.661908] blk_partition_remap: fail for partition 2
[  787.662043] blk_partition_remap: fail for partition 2
[  787.697900] blk_partition_remap: fail for partition 2
[  787.698198] blk_partition_remap: fail for partition 2
[  787.698605] blk_partition_remap: fail for partition 2
[  787.700374] blk_partition_remap: fail for partition 2
[  787.700612] blk_partition_remap: fail for partition 2
[  787.700945] blk_partition_remap: fail for partition 2
[  787.732248] blk_partition_remap: fail for partition 2
[  787.732588] blk_partition_remap: fail for partition 2
[  787.732706] blk_partition_remap: fail for partition 2
[  787.732921] blk_partition_remap: fail for partition 2
[  787.732975] blk_partition_remap: fail for partition 2
[  787.733023] blk_partition_remap: fail for partition 2
[  787.733121] blk_partition_remap: fail for partition 2
[  787.733316] blk_partition_remap: fail for partition 2
[  787.733630] blk_partition_remap: fail for partition 2
[  787.733639] blk_partition_remap: fail for partition 2
[  787.733801] blk_partition_remap: fail for partition 2
[  787.733914] blk_partition_remap: fail for partition 2
[  787.733966] blk_partition_remap: fail for partition 2
[  787.734104] blk_partition_remap: fail for partition 2
[  787.734444] blk_partition_remap: fail for partition 2
[  787.734560] blk_partition_remap: fail for partition 2
[  787.735036] blk_partition_remap: fail for partition 2
[  787.735161] blk_partition_remap: fail for partition 2
[  787.735171] blk_partition_remap: fail for partition 2
[  787.735256] blk_partition_remap: fail for partition 2
[  787.735428] blk_partition_remap: fail for partition 2
[  787.735434] blk_partition_remap: fail for partition 2
[  787.735446] blk_partition_remap: fail for partition 2
[  787.735553] blk_partition_remap: fail for partition 2
[  787.735673] blk_partition_remap: fail for partition 2
[  787.735743] blk_partition_remap: fail for partition 2
[  787.735990] blk_partition_remap: fail for partition 2
[  787.736000] blk_partition_remap: fail for partition 2
[  787.736124] blk_partition_remap: fail for partition 2
[  787.736381] blk_partition_remap: fail for partition 2
[  787.736397] blk_partition_remap: fail for partition 2
[  787.736504] blk_partition_remap: fail for partition 2
[  787.736605] blk_partition_remap: fail for partition 2
[  787.736625] blk_partition_remap: fail for partition 2
[  787.736695] blk_partition_remap: fail for partition 2
[  787.736816] blk_partition_remap: fail for partition 2
[  787.737071] blk_partition_remap: fail for partition 2
[  787.737200] blk_partition_remap: fail for partition 2
[  787.737353] blk_partition_remap: fail for partition 2
[  787.737456] blk_partition_remap: fail for partition 2
[  787.737566] blk_partition_remap: fail for partition 2
[  787.737844] blk_partition_remap: fail for partition 2
[  787.737995] blk_partition_remap: fail for partition 2
[  787.738073] blk_partition_remap: fail for partition 2
[  787.738194] blk_partition_remap: fail for partition 2
[  787.739084] blk_partition_remap: fail for partition 2
[  787.739208] blk_partition_remap: fail for partition 2
[  787.739324] blk_partition_remap: fail for partition 2
[  788.230709] ata1: failed to stop engine (-5)
[  788.730568] ata2: failed to stop engine (-5)
[  789.230524] ata3: failed to stop engine (-5)
[  789.730490] ata4: failed to stop engine (-5)
[  789.782466] iommu: Removing device 0021:0e:00.0 from group 1
[  789.785523] blk_partition_remap: fail for partition 2
[  789.786182] blk_partition_remap: fail for partition 2
[  789.787017] blk_partition_remap: fail for partition 2
[  789.787607] blk_partition_remap: fail for partition 2
[  789.788080] blk_partition_remap: fail for partition 2
[  789.788647] blk_partition_remap: fail for partition 2
[  789.807959] tg3 0022:01:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  789.807960] tg3 0022:01:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  789.808120] tg3 0022:01:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  789.808235] tg3 0022:01:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  789.809776] tg3 0022:01:00.1: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  789.810058] tg3 0022:01:00.1: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  789.807959] tg3 0022:01:00.0: [  789.810275] tIg3 0022:01:00.1:nvalid PCI ROM h Invalid PCI ROMe header signaturader signature: e: expecting 0xaea55, got 0xffff
xpecting 0xaa55,[  789.810549] t g3 0022:01:00.1:got 0xffff Invalid PCI ROM
 header signature: expecting 0xaa55, got 0xffff
[  789.807960] tg3 0022:01:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  789.808120] tg3 0022:01:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  789.808235] tg3 0022:01:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  789.811708] t[  789.809776] tg3 0022:01:00.2:g Invalid PCI ROM3 0022:01:00.1:  header signaturIe: expecting 0xanvalid PCI ROM ha55, got 0xffff
eader signature: expecting 0xaa55, got 0xffff
got 0xffff058] tg3 0022:01:00.1: Invalid PCI ROM h[  789.812044] teg3 0022:01:00.2:ader signature:  Invalid PCI ROMe header signaturxpecting 0xaa55,e: expecting 0xa a55, got 0xffff

[  789.810275] tg3 0022:01:00.1: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff[  789.812561] t
g3 0022:01:00.2: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  789.810549] tg3 0022:01:00.1: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  789.812784] tg3 0022:01:00.2: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  789.811708] tg3 0022:01:00.2: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  789.812044] tg3 0022:01:00.2: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  789.812561] tg3 0022:01:00.2: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  789.812784] tg3 0022:01:00.2: Invalid PCI ROM header signature: expecting 0xaa55, [  789.813754] tgot 0xffffg3 0022:01:00.3:
 Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  789.813754] tg3 0022:01:00.3: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  789.814224] tg3 0022:01:00.3: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  789.814551] tg3 0022:01:00.3: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  789.814876] tg3 0022:01:00.3: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  789.814224] tg3 0022:01:00.3: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  789.814551] tg3 0022:01:00.3: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  789.814876] tg3 0022:01:00.3: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  789.911304] tg3 0022:01:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  789.913479] tg3 0022:01:00.1: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  789.911304] tg3 0022:01:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  789.913479] tg3 0022:01:00.1: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  789.916042] tg3 0022:01:00.2: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  789.916752] tg3 0022:01:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  789.917605] tg3 0022:01:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  789.917994] tg3 0022:01:00.3: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  789.916042] tg3 0022:01:00.2: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
[  789.916752] tg3 0022:01:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff
(repeats)

After this, the system will keep complaining the following when you try to run any command over the console and requires a hard-reset:

[ 1543.670704] EXT4-fs error (device sda2): ext4_find_entry:1494: inode #44302337: comm bash: reading directory lblock 0
[ 1543.670792] EXT4-fs error (device sda2): ext4_find_entry:1494: inode #44302337: comm bash: reading directory lblock 0
[ 1543.670883] EXT4-fs error (device sda2): ext4_find_entry:1494: inode #2: comm bash: reading directory lblock 0
[ 1543.671154] systemd-journald[1543]: Failed to write entry (9 items, 256 bytes), ignoring: Read-only file system
[ 1543.671287] systemd-journald[1543]: Failed to write entry (9 items, 320 bytes), ignoring: Read-only file system
ColinIanKing commented 1 year ago

I think it may be related to this issue: https://bugs.launchpad.net/ubuntu-kernel-tests/+bug/1859764

ColinIanKing commented 1 year ago

I've disabled ROM PCI reads, I'm going to release a new version to day with the fix.

ColinIanKing commented 1 year ago

Fix released.