dm-vdo / kvdo

A kernel module which provide a pool of deduplicated and/or compressed block storage.
GNU General Public License v2.0
239 stars 45 forks source link

vdo status command may trigger kernel panic #40

Closed hedongzhang closed 3 years ago

hedongzhang commented 3 years ago

Description

Item Version
OS rhel-7.8
kernel 3.10.0-1127.19.1.el7.x86_64
kvdo 6.1.1.8
vdo 6.1.1.125

When I was using FIO to pressure VDO, I added a monitoring script, which would obtain information through "vdo status" every minute. After a period of time, the system panic occurred. Later I looked up the commit record that showed the BUG was fixed in Version 6.1.1.8: "Fixed General Protection Fault unlocking the UDS callback mutex", But the problem persisted when I upgraded to 6.1.1.8

[84607.724514] BUG: unable to handle kernel paging request at 000000000001b8f0
[84607.726005] IP: [<ffffffff9d517fd0>] native_queued_spin_lock_slowpath+0x110/0x200
[84607.727588] PGD 1ecad06067 PUD 1c63a5c067 PMD 0
[84607.728867] Oops: 0002 [#1] SMP
[84607.729571] Modules linked in: mpt3sas mptctl mptbase nvmet_rdma(OE) nvmet(OE) dell_rbu scsi_transport_iscsi drbg ansi_cprng bonding usdm_drv(OE) rdma_ucm(OE) ib_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_fpga_tools(OE) kvdo(OE) sha512_ssse3 sha512_generic qat_api(OE) uds(E) mlx4_ib(OE) mlx4_en(OE) mlx4_core(OE) iTCO_wdt iTCO_vendor_support dell_smbios dell_wmi_descriptor dcdbas cas_cache(OE) cas_disk(OE) skx_edac intel_powerclamp coretemp intel_rapl iosf_mbi kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr sg i2c_i801 lpc_ich mei_me mei wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad knem(OE) binfmt_misc ip_tables xfs libcrc32c mlx5_ib(OE) ib_uverbs(OE) ib_core(OE) mgag200 drm_kms_helper syscopyarea
[84607.736604]  sysfillrect sysimgblt fb_sys_fops ttm drm crc32c_intel mlx5_core(OE) mlxfw(OE) vfio_mdev(OE) igb vfio_iommu_type1 qat_c62x(OE) vfio intel_qat(OE) authenc mdev(OE) uio devlink nvme(OE) nvme_core(OE) mlx_compat(OE) ptp pps_core dca i2c_algo_bit drm_panel_orientation_quirks nfit libnvdimm sd_mod crc_t10dif crct10dif_generic crct10dif_pclmul crct10dif_common ahci libahci libata mpt2sas raid_class scsi_transport_sas megaraid_sas dm_mirror dm_region_hash dm_log dm_mod
[84607.743088] CPU: 1 PID: 60128 Comm: kvdo69:callback Kdump: loaded Tainted: G           OE  ------------   3.10.0-1127.19.1.el7.x86_64 #1
[84607.745664] Hardware name: Dell Inc. PowerEdge R740xd/06WXJT, BIOS 2.11.2 004/21/2021
[84607.747010] task: ffff9e174048c1c0 ti: ffff9e1703648000 task.ti: ffff9e1703648000
[84607.748373] RIP: 0010:[<ffffffff9d517fd0>]  [<ffffffff9d517fd0>] native_queued_spin_lock_slowpath+0x110/0x200
[84607.749576] RSP: 0018:ffff9e170364bd78  EFLAGS: 00010206
[84607.750785] RAX: 0000000000001fff RBX: ffff9e11e6cebba0 RCX: 0000000000090000
[84607.751774] RDX: 000000000001b8f0 RSI: 00000000ffff9e12 RDI: ffff9e11e6cebb9c
[84607.752726] RBP: ffff9e170364bd78 R08: ffff9e17eba5b8c0 R09: 0000000000000000
[84607.753713] R10: 000000000406c000 R11: 0000000000000800 R12: ffff9e11e6cebb9c
[84607.754717] R13: ffff9e170364bda8 R14: 0000000000000000 R15: 0000000000000000
[84607.755895] FS:  0000000000000000(0000) GS:ffff9e17eba40000(0000) knlGS:0000000000000000
[84607.757233] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[84607.758664] CR2: 000000000001b8f0 CR3: 0000001c8e3aa000 CR4: 00000000007607e0
[84607.759984] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[84607.761290] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[84607.762548] PKRU: 00000000
84607.763859] Call Trace:
[84607.765226]  [<ffffffff9db7a024>] queued_spin_lock_slowpath+0xb/0xf
[84607.766552]  [<ffffffff9db886d0>] _raw_spin_lock+0x20/0x30
[84607.767547]  [<ffffffff9db84b4a>] __mutex_unlock_slowpath+0x4a/0x90
[84607.768943]  [<ffffffff9db83fcb>] mutex_unlock+0x1b/0x20
[84607.770367]  [<ffffffffc0cf360c>] enterCallbackStage+0x6c/0xc0 [uds]
[84607.771631]  [<ffffffffc0cea908>] handleCallbacks+0x108/0x120 [uds]
[84607.773151]  [<ffffffffc0cfb6a7>] ? eventCountWait+0xb7/0xe0 [uds]
[84607.774639]  [<ffffffffc0cde12a>] ? pollQueues+0x3a/0x50 [uds]
[84607.775717]  [<ffffffffc0cde29c>] requestQueueWorker+0x15c/0x1b0 [uds]
[84607.776639]  [<ffffffffc0cdbcb0>] ? lookupThread+0x60/0x60 [uds]
[84607.777539]  [<ffffffffc0cdbd4a>] threadStarter+0x9a/0xd0 [uds]
[84607.778465]  [<ffffffff9d4c6691>] kthread+0xd1/0xe0
[84607.779396]  [<ffffffff9d4c65c0>] ? insert_kthread_work+0x40/0x40
[84607.780312]  [<ffffffff9db92d1d>] ret_from_fork_nospec_begin+0x7/0x21
[84607.781457]  [<ffffffff9d4c65c0>] ? insert_kthread_work+0x40/0x40
[84607.782783] Code: 87 47 02 c1 e0 10 45 31 c9 85 c0 74 44 48 89 c2 c1 e8 13 48 c1 ea 0d 48 98 83 e2 30 48 81 c2 c0 b8 01 00 48 03 14 c5 a0 10 15 9e <4c> 89 02 41 8b 40 08 85 c0 75 0f 0f 1f 44 00 00 f3 90 41 8b 40
[84607.785653] RIP  [<ffffffff9d517fd0>] native_queued_spin_lock_slowpath+0x110/0x200
[84607.786743]  RSP <ffff9e170364bd78>
[84607.788095] CR2: 000000000001b8f0

Reproducible

  1. using FIO to (randwrite)pressure VDO volume
  2. vdo status or cat /proc/vdo/$volname/kernel_stats(once per second)
raeburn commented 3 years ago

6.1.1.8 is still a fairly old version of the driver, and that problem persisted, it turned out (see also https://bugzilla.redhat.com/show_bug.cgi?id=1602151). Please try updating to at least 6.1.1.117 when the second fix was installed -- but ideally, the newest version available for RHEL 7, if you can. Hopefully that should fix the problem for good...

hedongzhang commented 3 years ago

@raeburn Thank you, I solved this problem by updating to the newest version available for RHEL 7

hedongzhang commented 3 years ago

@raeburn Thank you, I solved this problem by updating to the newest version available for RHEL 7