Open-CAS / open-cas-linux

Open CAS Linux
https://open-cas.com
BSD 3-Clause "New" or "Revised" License
216 stars 82 forks source link

kernel crash when remove core #1454

Open inspur-wyq opened 1 year ago

inspur-wyq commented 1 year ago

Description

kernel crash when rm core from cache

open-cas-linux-21.06.5.0555.release

[ 6365.608652] [Open-CAS] Adding device /dev/disk/by-id/dm-name-ceph--8c95bf7f--b3a6--4457--89aa--0c351eabcf26-osd--block--446ee347--3361--4c97--9d8e--b5135487b9bb as core core3 to cache cache1
[ 6370.921584] VFS: Open an exclusive opened block device for write sdd. current [129139 sgdisk]. parent [5935 rook]
[ 6370.953674] VFS: Open an exclusive opened block device for write dm-2. current [129143 sgdisk]. parent [5935 rook]
[ 6371.009815] VFS: Open an exclusive opened block device for write sdb. current [129157 sgdisk]. parent [5935 rook]
[ 6371.043087] VFS: Open an exclusive opened block device for write dm-1. current [129163 sgdisk]. parent [5935 rook]
[ 6371.099131] VFS: Open an exclusive opened block device for write sde. current [129179 sgdisk]. parent [5935 rook]
[ 6371.164891] VFS: Open an exclusive opened block device for write sdc. current [129192 sgdisk]. parent [5935 rook]
[ 6371.198044] VFS: Open an exclusive opened block device for write dm-0. current [129201 sgdisk]. parent [5935 rook]
[ 6371.252286] VFS: Open an exclusive opened block device for write sda. current [129214 sgdisk]. parent [8134 rook]
[ 6449.304742] VFS: Open an exclusive opened block device for write sdd. current [130576 sgdisk]. parent [11689 rook]
[ 6449.333058] VFS: Open an exclusive opened block device for write dm-2. current [130580 sgdisk]. parent [11689 rook]
[ 6449.379080] VFS: Open an exclusive opened block device for write sdb. current [130593 sgdisk]. parent [11689 rook]
[ 6449.408518] VFS: Open an exclusive opened block device for write dm-1. current [130600 sgdisk]. parent [8132 rook]
[ 6449.461321] VFS: Open an exclusive opened block device for write sde. current [130614 sgdisk]. parent [8132 rook]
[ 6449.527263] VFS: Open an exclusive opened block device for write sdc. current [130631 sgdisk]. parent [5935 rook]
[ 6449.556086] VFS: Open an exclusive opened block device for write dm-0. current [130637 sgdisk]. parent [5935 rook]
[ 6449.576621] VFS: Open an exclusive opened block device for write sda. current [130641 sgdisk]. parent [5935 rook]
[ 6449.856675] VFS: Open an write opened block device exclusively cas1-3. current [130648 ceph-volume]. parent [5935 rook]
[ 6471.823245] watchdog: BUG: soft lockup - CPU#44 stuck for 22s! [cas_mngt_1:110582]
[ 6471.824117] Modules linked in: cas_cache(OE) cas_disk(OE) ipt_rpfilter xt_set xt_multiport iptable_raw ip_set_hash_ip ip_set_hash_net ip_set ipip tunnel4 ip_tunnel veth xt_statistic xt_nat xt_addrtype ip6table_nat ip6_tables iptable_mangle xt_physdev xt_conntrack xt_comment xt_mark iptable_filter nf_conntrack_netlink nfnetlink sch_ingress iptable_nat xt_MASQUERADE ip_tables rbd ceph libceph dns_resolver overlay openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c 8021q garp mrp bonding vfat fat dm_multipath ipmi_ssif aes_ce_blk aes_ce_cipher ghash_ce sha1_ce ses enclosure ngbe sg acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler sch_fq_codel br_netfilter bridge stp llc dm_mod fuse ext4 mbcache jbd2 sd_mod t10_pi ahci sha2_ce libahci sha256_arm64 igb libata ixgbe ast smartpqi drm_vram_helper drm_ttm_helper scsi_transport_sas ttm mdio aes_neon_bs aes_neon_blk crypto_simd cryptd [last unloaded: cas_disk]
[ 6471.824255] CPU: 44 PID: 110582 Comm: cas_mngt_1 Kdump: loaded Tainted: G           OE     5.10.0-136.37.0.113.oe2203sp1.aarch64 #1
[ 6471.824258] Hardware name: Inspur     CS5260F     /YZMB-02006-101   , BIOS 4.0.8 12/22/21 16:05:48
[ 6471.824263] pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
[ 6471.824278] pc : downgrade_write+0x2d0/0x350
[ 6471.824343] lr : ocf_hb_id_prot_unlock_wr+0x64/0x90 [cas_cache]
[ 6471.824346] sp : ffff8000268dbcc0
[ 6471.824349] x29: ffff8000268dbcc0 x28: 000000000dd3c376
[ 6471.824353] x27: ffff8000282b6c80 x26: ffffbcc7d75d5ab8
[ 6471.824358] x25: 0000000000000003 x24: ffff8000283f7040
[ 6471.824362] x23: 000000000535b974 x22: 0000000000000001
[ 6471.824366] x21: ffff800aff489d00 x20: ffff8000282b6c80
[ 6471.824370] x19: 00000000000000e8 x18: 0000000000000000
[ 6471.824373] x17: 0000000000000000 x16: ffffbcc84677f060
[ 6471.824377] x15: 0000000000000000 x14: 0000000000000000
[ 6471.824380] x13: 0000000000000000 x12: 0000000000000000
[ 6471.824384] x11: 0000001587515ebc x10: 0000000000000000
[ 6471.824387] x9 : ffffbcc7d75baf24 x8 : 0000000000000000
[ 6471.824391] x7 : ffff1e063ff82740 x6 : ffffbcc847ce7740
[ 6471.824394] x5 : ffff8000268dbc80 x4 : 0000000000000000
[ 6471.824398] x3 : 0000000000000000 x2 : ffffffffffffff00
[ 6471.824401] x1 : 0000000000000000 x0 : ffff8000282b6d40
[ 6471.824405] Call trace:
[ 6471.824410]  downgrade_write+0x2d0/0x350
[ 6471.824439]  cache_mngt_core_deinit_attached_meta+0x1f4/0x284 [cas_cache]
[ 6471.824464]  _ocf_mngt_cache_remove_core_attached+0x40/0x9c [cas_cache]
[ 6471.824486]  _ocf_pipeline_run_step+0x90/0x100 [cas_cache]
[ 6471.824507]  ocf_io_handle+0x30/0x54 [cas_cache]
[ 6471.824531]  ocf_queue_run_single+0x44/0x5c [cas_cache]
[ 6471.824552]  ocf_queue_run+0x3c/0x64 [cas_cache]
[ 6471.824576]  _cas_io_queue_thread+0x74/0x110 [cas_cache]
[ 6471.824582]  kthread+0x108/0x13c
[ 6471.824588]  ret_from_fork+0x10/0x18
[ 6471.824593] Kernel panic - not syncing: softlockup: hung tasks
[ 6471.825279] CPU: 44 PID: 110582 Comm: cas_mngt_1 Kdump: loaded Tainted: G           OEL    5.10.0-136.37.0.113.oe2203sp1.aarch64 #1
[ 6471.826392] Hardware name: Inspur     CS5260F     /YZMB-02006-101   , BIOS 4.0.8 12/22/21 16:05:48
[ 6471.827260] Call trace:
[ 6471.827541]  dump_backtrace+0x0/0x1e4
[ 6471.827912]  show_stack+0x20/0x2c
[ 6471.828288]  dump_stack+0xd8/0x140
[ 6471.828668]  panic+0x168/0x390
[ 6471.829001]  watchdog_timer_fn+0x230/0x290
[ 6471.829499]  __run_hrtimer+0x98/0x2a0
[ 6471.829942]  __hrtimer_run_queues+0xb0/0x134
[ 6471.830415]  hrtimer_interrupt+0x13c/0x3c0
[ 6471.830846]  arch_timer_handler_phys+0x3c/0x50
[ 6471.831307]  handle_percpu_devid_irq+0x90/0x1f4
[ 6471.831769]  __handle_domain_irq+0x84/0xf0
[ 6471.832365]  gic_handle_irq+0x90/0x2b4
[ 6471.832763]  el1_irq+0xb8/0x140
[ 6471.833107]  downgrade_write+0x2d0/0x350
[ 6471.833552]  cache_mngt_core_deinit_attached_meta+0x1f4/0x284 [cas_cache]
[ 6471.834325]  _ocf_mngt_cache_remove_core_attached+0x40/0x9c [cas_cache]
[ 6471.835080]  _ocf_pipeline_run_step+0x90/0x100 [cas_cache]
[ 6471.835715]  ocf_io_handle+0x30/0x54 [cas_cache]
[ 6471.836222]  ocf_queue_run_single+0x44/0x5c [cas_cache]
[ 6471.836770]  ocf_queue_run+0x3c/0x64 [cas_cache]
[ 6471.837282]  _cas_io_queue_thread+0x74/0x110 [cas_cache]
[ 6471.837916]  kthread+0x108/0x13c
[ 6471.838280]  ret_from_fork+0x10/0x18
[ 6471.838678] SMP: stopping secondary CPUs
[ 6471.839165] Kernel Offset: 0x3cc836650000 from 0xffff800010000000
[ 6471.840730] PHYS_OFFSET: 0xffffe355c0000000
[ 6471.841909] CPU features: 0x0000,00000002,61800008
[ 6471.843346] Memory Limit: none
[ 6471.848292] Starting crashdump kernel...
[ 6471.848823] Bye!

open-cas-linux-22.06.2.0723.release

[ 4860.012121] watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [cas_mngt_1:147513]
[ 4860.013055] Modules linked in: cas_cache(OE) cas_disk(OE) ipt_rpfilter xt_set xt_multiport iptable_raw ip_set_hash_ip ip_set_hash_net ip_set ipip tunnel4 ip_tunnel veth xt_statistic xt_nat xt_addrtype ip6table_nat ip6_tables iptable_mangle xt_physdev xt_conntrack xt_comment xt_mark iptable_filter nf_conntrack_netlink nfnetlink iptable_nat xt_MASQUERADE ip_tables rbd ceph libceph dns_resolver overlay sch_ingress openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c 8021q garp mrp bonding vfat fat dm_multipath ipmi_ssif aes_ce_blk aes_ce_cipher ghash_ce sha1_ce ses enclosure sg ngbe acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler sch_fq_codel br_netfilter bridge stp llc dm_mod fuse ext4 mbcache jbd2 sd_mod t10_pi ahci libahci sha2_ce sha256_arm64 igb libata ixgbe ast smartpqi drm_vram_helper drm_ttm_helper ttm scsi_transport_sas mdio aes_neon_bs aes_neon_blk crypto_simd cryptd [last unloaded: cas_disk]
[ 4860.013183] CPU: 1 PID: 147513 Comm: cas_mngt_1 Kdump: loaded Tainted: G        W  OE     5.10.0-136.37.0.113.oe2203sp1.aarch64 #1
[ 4860.013186] Hardware name: Inspur     CS5260F     /YZMB-02006-101   , BIOS 4.0.8 12/22/21 16:05:48
[ 4860.013189] pstate: 20000005 (nzCv daif -PAN -UAO -TCO BTYPE=--)
[ 4860.013198] pc : down_write_killable+0x198/0x270
[ 4860.013236] lr : ocf_hb_id_prot_lock_wr+0x2c/0x60 [cas_cache]
[ 4860.013238] sp : ffff800025643cb0
[ 4860.013240] x29: ffff800025643cb0 x28: 000000000dd3c376
[ 4860.013244] x27: ffff8000270b6c80 x26: ffffb8da72ba2c90
[ 4860.013248] x25: 0000000000000000 x24: ffff8000271f7080
[ 4860.013252] x23: 000000000d8b2c39 x22: 0000000000000003
[ 4860.013256] x21: 00000000374f0dd6 x20: 000000000d8b2c39
[ 4860.013259] x19: ffff8000270b6c80 x18: 0000000000000000
[ 4860.013263] x17: 0000000000000000 x16: ffffb8da9f4c8f60
[ 4860.013266] x15: 0000000000000000 x14: 0000000000000000
[ 4860.013270] x13: 0000000000000000 x12: ffff79d03fffa2c0
[ 4860.013273] x11: 00000188a35bc70e x10: 0000000000000eb0
[ 4860.013277] x9 : ffffb8da72b8987c x8 : ffff79c459cbc8d0
[ 4860.013280] x7 : 0000000000000024 x6 : ffff800010935478
[ 4860.013283] x5 : ffffffffffffffff x4 : 0000000000000100
[ 4860.013287] x3 : ffff8000270b6c80 x2 : 000000000d8b2c39
[ 4860.013290] x1 : 0000000000000000 x0 : 0000000000000000
[ 4860.013294] Call trace:
[ 4860.013298]  down_write_killable+0x198/0x270
[ 4860.013322]  cache_mngt_core_deinit_attached_meta+0x98/0x290 [cas_cache]
[ 4860.013345]  _ocf_mngt_cache_remove_core_mapping+0x40/0x90 [cas_cache]
[ 4860.013369]  _ocf_pipeline_run_step+0x11c/0x1b0 [cas_cache]
[ 4860.013392]  ocf_io_handle+0x30/0x54 [cas_cache]
[ 4860.013414]  ocf_queue_run_single+0x44/0x5c [cas_cache]
[ 4860.013435]  ocf_queue_run+0x3c/0x64 [cas_cache]
[ 4860.013459]  _cas_io_queue_thread+0x74/0x110 [cas_cache]
[ 4860.013464]  kthread+0x108/0x13c
[ 4860.013469]  ret_from_fork+0x10/0x18
[ 4860.013472] Kernel panic - not syncing: softlockup: hung tasks
[ 4860.014127] CPU: 1 PID: 147513 Comm: cas_mngt_1 Kdump: loaded Tainted: G        W  OEL    5.10.0-136.37.0.113.oe2203sp1.aarch64 #1
[ 4860.015225] Hardware name: Inspur     CS5260F     /YZMB-02006-101   , BIOS 4.0.8 12/22/21 16:05:48
[ 4860.016098] Call trace:
[ 4860.016384]  dump_backtrace+0x0/0x1e4
[ 4860.016775]  show_stack+0x20/0x2c
[ 4860.017129]  dump_stack+0xd8/0x140
[ 4860.017481]  panic+0x168/0x390
[ 4860.017838]  watchdog_timer_fn+0x230/0x290
[ 4860.018255]  __run_hrtimer+0x98/0x2a0
[ 4860.018632]  __hrtimer_run_queues+0xb0/0x134
[ 4860.019084]  hrtimer_interrupt+0x13c/0x3c0
[ 4860.019520]  arch_timer_handler_phys+0x3c/0x50
[ 4860.019975]  handle_percpu_devid_irq+0x90/0x1f4
[ 4860.020421]  __handle_domain_irq+0x84/0xf0
[ 4860.020835]  gic_handle_irq+0x90/0x2b4
[ 4860.021231]  el1_irq+0xb8/0x140
[ 4860.021578]  down_write_killable+0x198/0x270
[ 4860.022034]  cache_mngt_core_deinit_attached_meta+0x98/0x290 [cas_cache]
[ 4860.022788]  _ocf_mngt_cache_remove_core_mapping+0x40/0x90 [cas_cache]
[ 4860.023437]  _ocf_pipeline_run_step+0x11c/0x1b0 [cas_cache]
[ 4860.024051]  ocf_io_handle+0x30/0x54 [cas_cache]
[ 4860.024546]  ocf_queue_run_single+0x44/0x5c [cas_cache]
[ 4860.025098]  ocf_queue_run+0x3c/0x64 [cas_cache]
[ 4860.025615]  _cas_io_queue_thread+0x74/0x110 [cas_cache]
[ 4860.026136]  kthread+0x108/0x13c
[ 4860.026486]  ret_from_fork+0x10/0x18
[ 4860.026867] SMP: stopping secondary CPUs
[ 4860.027330] Kernel Offset: 0x38da8e7d0000 from 0xffff800010000000
[ 4860.029006] PHYS_OFFSET: 0xffff873bc0000000
[ 4860.030250] CPU features: 0x0000,00000002,61800008
[ 4860.031900] Memory Limit: none
[ 4860.036446] Starting crashdump kernel...
[ 4860.037002] Bye!

open-cas-linux-21.06.3.0551.release this is special, the cache metadata was old, so load failed and cache_mngt_core_remove_from_cache and crash.

[12618.084423] cache1: Loading cache state...
[12618.125080] cache1: Loading Part config WARNING, invalid checksum
[12618.125298] cache1: Loading Core config WARNING, invalid checksum
[12618.125366] cache1: Loading Core UUID WARNING, invalid checksum
[12618.126362] cache1: ERROR: Cache device size mismatch!
[12618.150001] Unable to handle kernel paging request at virtual address 00000000001420ac
[12618.151013] Mem abort info:
[12618.151393]   ESR = 0x96000004
[12618.151769]   EC = 0x25: DABT (current EL), IL = 32 bits
[12618.152340]   SET = 0, FnV = 0
[12618.152728]   EA = 0, S1PTW = 0
[12618.153128] Data abort info:
[12618.153539]   ISV = 0, ISS = 0x00000004
[12618.154000]   CM = 0, WnR = 0
[12618.154377] user pgtable: 4k pages, 48-bit VAs, pgdp=0000012008c38000
[12618.155015] [00000000001420ac] pgd=0000000000000000, p4d=0000000000000000
[12618.155689] Internal error: Oops: 96000004 [#1] SMP
[12618.156189] Modules linked in: cas_cache(OE) cas_disk(OE) ipt_rpfilter xt_set xt_multiport iptable_raw ip_set_hash_ip ip_set_hash_net ip_set ipip tunnel4 ip_tunnel veth xt_statistic xt_nat xt_addrtype ip6table_nat ip6_tables iptable_mangle xt_physdev xt_conntrack xt_comment xt_mark iptable_filter nf_conntrack_netlink nfnetlink sch_ingress iptable_nat xt_MASQUERADE ip_tables rbd ceph libceph dns_resolver overlay openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c 8021q garp mrp bonding vfat fat dm_multipath ipmi_ssif aes_ce_blk aes_ce_cipher ghash_ce sha1_ce ses enclosure sg ngbe acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler sch_fq_codel br_netfilter bridge stp llc dm_mod fuse ext4 mbcache jbd2 sd_mod t10_pi ahci libahci sha2_ce sha256_arm64 igb libata ixgbe ast smartpqi drm_vram_helper drm_ttm_helper scsi_transport_sas ttm mdio aes_neon_bs aes_neon_blk crypto_simd cryptd [last unloaded: cas_disk]
[12618.164112] CPU: 21 PID: 202913 Comm: casadm Kdump: loaded Tainted: G           OE     5.10.0-136.37.0.113.oe2203sp1.aarch64 #1
[12618.165226] Hardware name: Inspur     CS5260F     /YZMB-02006-101   , BIOS 4.0.8 12/22/21 16:05:48
[12618.166121] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
[12618.166784] pc : cache_mngt_core_remove_from_cache+0x4c/0xa0 [cas_cache]
[12618.167509] lr : cache_mngt_core_remove_from_cache+0x34/0xa0 [cas_cache]
[12618.168199] sp : ffff8000262fbb60
[12618.168548] x29: ffff8000262fbb60 x28: ffffc060b6f9c7c0
[12618.169157] x27: ffff16efe06bdf88 x26: 0000000000000001
[12618.169817] x25: ffff8000262fbd10 x24: ffffc060c97c3798
[12618.170383] x23: ffff8000299a7140 x22: 0000000000000000
[12618.170931] x21: ffff8000299a5000 x20: 0000000000000000
[12618.171473] x19: ffff8000299a7140 x18: 0000000000000000
[12618.172017] x17: 0000000000000000 x16: ffffc060c8a3bc84
[12618.172571] x15: 0000000000000000 x14: 0000000000000000
[12618.173117] x13: 0000000000000000 x12: 0000000000000018
[12618.173664] x11: 00008250d3c1dd20 x10: 0000000000000eb0
[12618.174224] x9 : ffffc060b6f41f64 x8 : ffff16efcd003590
[12618.174774] x7 : 7fffffffffffffff x6 : 000000966e355c45
[12618.175324] x5 : 00000000700f6630 x4 : ffff16efe06bdfa0
[12618.175887] x3 : 0000000000000000 x2 : ffffc060c9d38340
[12618.176435] x1 : 0000000000142000 x0 : 0000000000000000
[12618.176989] Call trace:
[12618.177318]  cache_mngt_core_remove_from_cache+0x4c/0xa0 [cas_cache]
[12618.178003]  _ocf_mngt_cache_stop_remove_cores+0x80/0xc4 [cas_cache]
[12618.178661]  ocf_mngt_cache_stop_detached+0x38/0xf0 [cas_cache]
[12618.179411]  ocf_mngt_cache_stop+0x120/0x140 [cas_cache]
[12618.180006]  cache_mngt_init_instance+0x19c/0x370 [cas_cache]
[12618.180622]  cas_service_ioctl_ctrl+0x2b30/0x49d0 [cas_cache]
[12618.181253]  __arm64_sys_ioctl+0xb0/0xf4
[12618.181688]  invoke_syscall+0x50/0x11c
[12618.182094]  el0_svc_common.constprop.0+0x158/0x164
[12618.182606]  do_el0_svc+0x2c/0x9c
[12618.183001]  el0_svc+0x20/0x30
[12618.183356]  el0_sync_handler+0xb0/0xb4
[12618.183781]  el0_sync+0x160/0x180
[12618.184137] Code: 121d7801 39047261 37080180 91450a81 (79415820)
[12618.184765] ---[ end trace c4448c973eb33765 ]---
[12618.185262] Kernel panic - not syncing: Oops: Fatal exception
[12618.185844] SMP: stopping secondary CPUs
[12618.186331] Kernel Offset: 0x4060b86a0000 from 0xffff800010000000
[12618.187788] PHYS_OFFSET: 0xffffea3040000000
[12618.189136] CPU features: 0x0000,00000002,61800008
[12618.190602] Memory Limit: none
[12618.195577] Starting crashdump kernel...
[12618.196063] Bye!

Expected Behavior

the core was removed

Actual Behavior

kernel crash

Steps to Reproduce

  1. start a cache
  2. add lvs as cores
  3. remove cores
  4. kernel crash

Context

Possible Fix

Logs

Your Environment

hwmanager commented 1 day ago

did you solved it?

I also experienced similar symptoms.

it occured opencas(22.6.3) in rockylinux 8.10

opencas(22.6.1) is worked in rockylinux 8.10