ColinIanKing / stress-ng

This is the stress-ng upstream project git repository. stress-ng will stress test a computer system in various selectable ways. It was designed to exercise various physical subsystems of a computer as well as the various operating system kernel interfaces.
https://github.com/ColinIanKing/stress-ng
GNU General Public License v2.0
1.78k stars 283 forks source link

unshare test in ubuntu_stress_smoke_tests triggers "BUG: unable to handle page fault for address" on 5.13/5.14 #179

Closed Cypresslin closed 2 years ago

Cypresslin commented 2 years ago

Issue found on Intel node "vought" with:

The test will hang with unshare test in ubuntu_stress_smoke_tests: 12:39:39 DEBUG| [stdout] udp RETURNED 0 12:39:39 DEBUG| [stdout] udp PASSED 12:39:39 DEBUG| [stdout] udp-flood STARTING 12:39:41 DEBUG| [stdout] udp-flood RETURNED 0 12:39:41 DEBUG| [stdout] udp-flood PASSED 12:39:41 DEBUG| [stdout] unshare STARTING (Test hangs here)

And eventually the test will be killed because of the timeout setting.

stress-ng Test suite HEAD SHA1: b81116c or 48be8ff

Error can be found in dmesg: [ 2371.109961] BUG: unable to handle page fault for address: 0000000000001cc8 [ 2371.110074] #PF: supervisor read access in kernel mode [ 2371.114323] #PF: error_code(0x0000) - not-present page [ 2371.119931] PGD 0 P4D 0 [ 2371.125257] Oops: 0000 [#1] SMP NOPTI [ 2371.129247] CPU: 51 PID: 207256 Comm: stress-ng Tainted: P O 5.13.0-27-generic #29-Ubuntu [ 2371.133203] Hardware name: Intel Corporation S2600WFD/S2600WFD, BIOS SE5C620.86B.0D.01.0395.022720191340 02/27/2019 [ 2371.135887] RIP: 0010:next_zones_zonelist+0x6/0x50 [ 2371.138525] Code: d0 0f 4e d0 3d ff 03 00 00 7f 0d 48 63 d2 5d 48 8b 04 d5 60 e5 35 af c3 31 c0 5d c3 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 <8b> 4f 08 48 89 f8 48 89 e5 48 85 d2 75 10 eb 1d 48 63 49 50 48 0f [ 2371.143813] RSP: 0018:ffffa9c8b399fac0 EFLAGS: 00010282 [ 2371.146078] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 2371.148293] RDX: ffff9c98e894ea98 RSI: 0000000000000002 RDI: 0000000000001cc0 [ 2371.150477] RBP: ffffa9c8b399fb28 R08: 0000000000000000 R09: 0000000000000000 [ 2371.152650] R10: 0000000000000002 R11: ffffd9bfbfcc5600 R12: 0000000000052cc0 [ 2371.154778] R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000152cc0 [ 2371.156876] FS: 00007fcbd141d740(0000) GS:ffff9cc14ccc0000(0000) knlGS:0000000000000000 [ 2371.158936] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2371.160958] CR2: 0000000000001cc8 CR3: 000000059f292001 CR4: 00000000007706e0 [ 2371.162950] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 2371.164888] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 2371.166811] PKRU: 55555554 [ 2371.168694] Call Trace: [ 2371.170544] ? alloc_pages+0x2f1/0x330 [ 2371.172386] kmalloc_large_node+0x45/0xb0 [ 2371.174222] kmalloc_node+0x276/0x300 [ 2371.176036] ? queue_delayed_work_on+0x39/0x60 [ 2371.177853] kvmalloc_node+0x5a/0x90 [ 2371.179622] expand_one_shrinker_info+0x82/0x190 [ 2371.181382] prealloc_shrinker+0x175/0x1d0 [ 2371.183091] alloc_super+0x2bf/0x330 [ 2371.184764] ? fput_sync+0x30/0x30 [ 2371.186384] sget_fc+0x74/0x2e0 [ 2371.187951] ? set_anon_super+0x50/0x50 [ 2371.189473] ? mqueue_create+0x20/0x20 [ 2371.190944] get_tree_keyed+0x34/0xd0 [ 2371.192363] mqueue_get_tree+0x1c/0x20 [ 2371.193734] vfs_get_tree+0x2a/0xc0 [ 2371.195105] fc_mount+0x13/0x50 [ 2371.196409] mq_init_ns+0x10a/0x1b0 [ 2371.197667] copy_ipcs+0x130/0x220 [ 2371.198899] create_new_namespaces+0xa6/0x2e0 [ 2371.200113] unshare_nsproxy_namespaces+0x5a/0xb0 [ 2371.201303] ksys_unshare+0x1db/0x3c0 [ 2371.202480] x64_sys_unshare+0x12/0x20 [ 2371.203649] do_syscall_64+0x61/0xb0 [ 2371.204804] ? exit_to_user_mode_loop+0xec/0x160 [ 2371.205966] ? exit_to_user_mode_prepare+0x37/0xb0 [ 2371.207102] ? syscall_exit_to_user_mode+0x27/0x50 [ 2371.208222] ? x64_sys_close+0x11/0x40 [ 2371.209336] ? do_syscall_64+0x6e/0xb0 [ 2371.210438] ? asm_exc_page_fault+0x8/0x30 [ 2371.211545] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 2371.212641] RIP: 0033:0x7fcbd1562c4b [ 2371.213698] Code: 73 01 c3 48 8b 0d e5 e1 0e 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa b8 10 01 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b5 e1 0e 00 f7 d8 64 89 01 48 [ 2371.215851] RSP: 002b:00007ffc5d8eb878 EFLAGS: 00000246 ORIG_RAX: 0000000000000110 [ 2371.216846] RAX: ffffffffffffffda RBX: 00007ffc5d8eba20 RCX: 00007fcbd1562c4b [ 2371.217830] RDX: 0000560296049862 RSI: 0000000008000000 RDI: 0000000008000000 [ 2371.218886] RBP: 00007ffc5d8eb8d0 R08: 00005602960234a2 R09: 00007fcbd141d740 [ 2371.219908] R10: 0000000000000000 R11: 0000000000000246 R12: 0000560296049862 [ 2371.220904] R13: 00007ffc5d8eba20 R14: 0000000000032980 R15: 00005602960397d5 [ 2371.221896] Modules linked in: unix_diag binfmt_misc uhid userio hci_vhci bluetooth ecdh_generic ecc vhost_net tap vhost_vsock vmw_vsock_virtio_transport_common vhost vhost_iotlb vsock zfs(PO) zunicode(PO) zzstd(O) zlua(O) zavl(PO) icp(PO) zcommon(PO) znvpair(PO) spl(O) dccp_ipv4 dccp atm wp512 streebog_generic sm3_generic sha3_generic rmd160 poly1305_generic poly1305_x86_64 nhpoly1305_avx2 nhpoly1305_sse2 nhpoly1305 libpoly1305 michael_mic md4 cmac ccm algif_rng twofish_generic twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 twofish_common sm4_generic serpent_avx2 serpent_avx_x86_64 serpent_sse2_x86_64 serpent_generic fcrypt des3_ede_x86_64 des_generic libdes cast6_avx_x86_64 cast6_generic cast5_avx_x86_64 cast5_generic cast_common camellia_generic camellia_aesni_avx2 camellia_aesni_avx_x86_64 camellia_x86_64 blowfish_generic blowfish_x86_64 blowfish_common algif_skcipher algif_hash aegis128 aegis128_aesni algif_aead af_alg cfg80211 nls_iso8859_1 dm_multipath scsi_dh_rdac [ 2371.221970] scsi_dh_emc scsi_dh_alua intel_rapl_msr intel_rapl_common isst_if_common dax_pmem_compat nd_pmem device_dax nd_btt dax_pmem_core skx_edac ipmi_ssif x86_pkg_temp_thermal intel_powerclamp coretemp joydev kvm_intel input_leds kvm rapl intel_cstate efi_pstore mei_me intel_pch_thermal mei ioatdma dca acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler nfit mac_hid sch_fq_codel msr ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor hid_generic usbhid hid raid6_pq libcrc32c raid1 raid0 multipath linear ast drm_vram_helper i2c_algo_bit drm_ttm_helper ttm drm_kms_helper crct10dif_pclmul syscopyarea crc32_pclmul sysfillrect ghash_clmulni_intel sysimgblt fb_sys_fops aesni_intel cec crypto_simd rc_core i40e cryptd drm i2c_i801 ahci i2c_smbus lpc_ich xhci_pci libahci xhci_pci_renesas wmi [ 2371.242432] CR2: 0000000000001cc8 [ 2371.244111] ---[ end trace 9f58bca9f2f22e80 ]--- [ 2371.341907] RIP: 0010:__next_zones_zonelist+0x6/0x50 [ 2371.343167] Code: d0 0f 4e d0 3d ff 03 00 00 7f 0d 48 63 d2 5d 48 8b 04 d5 60 e5 35 af c3 31 c0 5d c3 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 <8b> 4f 08 48 89 f8 48 89 e5 48 85 d2 75 10 eb 1d 48 63 49 50 48 0f [ 2371.345193] RSP: 0018:ffffa9c8b399fac0 EFLAGS: 00010282 [ 2371.346243] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 2371.347429] RDX: ffff9c98e894ea98 RSI: 0000000000000002 RDI: 0000000000001cc0 [ 2371.348434] RBP: ffffa9c8b399fb28 R08: 0000000000000000 R09: 0000000000000000 [ 2371.349439] R10: 0000000000000002 R11: ffffd9bfbfcc5600 R12: 0000000000052cc0 [ 2371.350564] R13: 0000000000000002 R14: 0000000000000001 R15: 0000000000152cc0 [ 2371.351706] FS: 00007fcbd141d740(0000) GS:ffff9cc14ccc0000(0000) knlGS:0000000000000000 [ 2371.352731] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 2371.353757] CR2: 0000000000001cc8 CR3: 000000059f292001 CR4: 00000000007706e0 [ 2371.354981] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 2371.356073] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 2371.357105] PKRU: 55555554

https://bugs.launchpad.net/stress-ng/+bug/1959215 https://bugs.launchpad.net/stress-ng/+bug/1962551

ColinIanKing commented 2 years ago

Isn't this a kernel bug and not a stress-ng issue?

Cypresslin commented 2 years ago

Oooh ok, I think it's for stress-ng at the very beginning as this is only affecting this HW, but yeah this could be a HW-specific kernel bug. Thanks for pointing this out, will do more tests next.

ColinIanKing commented 2 years ago

I suggest seeing if this occurs on older kernels and doing a course bisect first

Cypresslin commented 2 years ago

OK! Will arrange tests between SRU cycles.

ColinIanKing commented 2 years ago

Any luck in bisecting this?

Cypresslin commented 2 years ago

Hi Colin, I have this Intel node tested with 5.17.0-051700rc8 and this issue still exist. Next plan is to test it with stress-ng V0.14.0 + a newer mainline kernel. Thanks for the reminder!

Cypresslin commented 2 years ago

Hi Colin, A coarse bisect with the mainline kernel shows this issue can be reproduced with 5.17.5-051705-generic, but not 5.18.0-051800rc1-generic. Tested with stress-ng V0.13.12.

I guess the next step is to do a git bisect?

ColinIanKing commented 2 years ago

Hi, yes, a bisect on the kernel is the way forward on that.

Cypresslin commented 2 years ago

OK! Since this issues is kernel related, I will close this one here. Thank you!