LINBIT / drbd

LINBIT DRBD kernel module
https://docs.linbit.com/docs/users-guide-9.0/
GNU General Public License v2.0
573 stars 96 forks source link

Kernel Panic with 9.1.15 #79

Open qiyuanzhi opened 10 months ago

qiyuanzhi commented 10 months ago

Hello!

I got a kernel panic with 9.1.15

it is a 3 nodes cluster and two nodes got this Panic at the same time.

Here is Call trace:

[16259.065253] drbd pvc-9a4e7d3f-ac7c-4ab1-af10-1209b7c6c13d node-3: Preparing remote state change 1878057944
[16259.067843] drbd pvc-9a4e7d3f-ac7c-4ab1-af10-1209b7c6c13d node-3: Committing remote state change 1878057944 (primary_nodes=0)
[16259.067859] drbd pvc-9a4e7d3f-ac7c-4ab1-af10-1209b7c6c13d/0 drbd1020 node-3: pdsk( UpToDate -> Detaching )
[16259.069646] drbd pvc-9a4e7d3f-ac7c-4ab1-af10-1209b7c6c13d/0 drbd1020 node-3: pdsk( Detaching -> Diskless )
[16259.076779] drbd pvc-9a4e7d3f-ac7c-4ab1-af10-1209b7c6c13d: Preparing cluster-wide state change 277524540 (1->-1 7680/1024)
[16259.077210] drbd pvc-9a4e7d3f-ac7c-4ab1-af10-1209b7c6c13d: State change 277524540: primary_nodes=0, weak_nodes=0
[16259.077214] drbd pvc-9a4e7d3f-ac7c-4ab1-af10-1209b7c6c13d: Committing cluster-wide state change 277524540 (1ms)
[16259.077251] drbd pvc-9a4e7d3f-ac7c-4ab1-af10-1209b7c6c13d/0 drbd1020: disk( UpToDate -> Detaching )
[16259.077489] drbd pvc-9a4e7d3f-ac7c-4ab1-af10-1209b7c6c13d/0 drbd1020: disk( Detaching -> Diskless )
[16259.077951] drbd pvc-9a4e7d3f-ac7c-4ab1-af10-1209b7c6c13d/0 drbd1020: drbd_bm_resize called with capacity == 0
[16259.181762] drbd pvc-9a4e7d3f-ac7c-4ab1-af10-1209b7c6c13d: ASSERTION context->flags & CS_SERIALIZE FAILED in change_cluster_wide_state
[16259.184705] drbd pvc-9a4e7d3f-ac7c-4ab1-af10-1209b7c6c13d: State change failed: State change was refused by peer node
[16259.186149] drbd pvc-9a4e7d3f-ac7c-4ab1-af10-1209b7c6c13d/0 drbd1020 node-3: Failed: pdsk( Diskless -> DUnknown ) repl( Established -> Off )
[16259.186204] drbd pvc-9a4e7d3f-ac7c-4ab1-af10-1209b7c6c13d: ASSERTION context->flags & CS_SERIALIZE FAILED in change_cluster_wide_state
[16259.188355] drbd pvc-9a4e7d3f-ac7c-4ab1-af10-1209b7c6c13d: State change failed: State change was refused by peer node
[16259.189554] drbd pvc-9a4e7d3f-ac7c-4ab1-af10-1209b7c6c13d/0 drbd1020: Failed: quorum( yes -> no )
[16259.189558] drbd pvc-9a4e7d3f-ac7c-4ab1-af10-1209b7c6c13d/0 drbd1020 node-1: Failed: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
[16259.702141] BUG: kernel NULL pointer dereference, address: 0000000000000010
[16259.703513] #PF: supervisor read access in kernel mode
[16259.704930] #PF: error_code(0x0000) - not-present page
[16259.706101] PGD b120ca067 P4D b120ca067 PUD 8566dc067 PMD 0
[16259.707261] Oops: 0000 [#1] SMP NOPTI
[16259.708401] CPU: 19 PID: 2259963 Comm: drbd_r_pvc-9a4e Kdump: loaded Tainted: G           OE     5.15.67-6.cl9.x86_64 #1
[16259.709532] Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW71.00V.18227214.B64.2106252220 06/25/2021
[16259.711804] RIP: 0010:process_twopc+0x8fc/0x1060 [drbd]
[16259.713069] Code: db 48 89 44 24 10 e9 8b fc ff ff 48 c7 44 24 10 00 00 00 00 e9 7d fc ff ff 48 8b 7c 24 10 48 89 f8 48 85 ff 0f 84 32 04 00 00 <48> 8b 78 10 8b 87 38 01 00 00 85 c0 0f 84 d5 01 00 00 f0 ff 87 70
[16259.715535] RSP: 0018:ffff98b519743d40 EFLAGS: 00010046
[16259.716711] RAX: 0000000000000000 RBX: ffff88b43c120800 RCX: 0000000000000000
[16259.717814] RDX: 0000000000000000 RSI: ffff88bd916ec068 RDI: 0000000000000000
[16259.718860] RBP: ffff88bd916ec000 R08: 0000000000000000 R09: ffff88bd916ec060
[16259.720007] R10: 0000000000000000 R11: 0000000000000000 R12: ffff98b519743e70
[16259.721127] R13: 0000000000000000 R14: ffff98b519743da0 R15: ffff88b77c1b5000
[16259.722183] FS:  0000000000000000(0000) GS:ffff88cb3c2c0000(0000) knlGS:0000000000000000
[16259.723186] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[16259.724153] CR2: 0000000000000010 CR3: 00000011537cc005 CR4: 0000000000770ee0
[16259.725140] PKRU: 55555554
[16259.726164] Call Trace:
[16259.727087]  <TASK>
[16259.728015]  ? dtt_recv+0xbb/0x180 [drbd_transport_tcp]
[16259.728932]  receive_twopc+0x97/0x100 [drbd]
[16259.729851]  ? process_twopc+0x1060/0x1060 [drbd]
[16259.730750]  drbdd+0x145/0x290 [drbd]
[16259.731742]  drbd_receiver+0x41/0x60 [drbd]
[16259.732918]  drbd_thread_setup+0x74/0x1e0 [drbd]
[16259.733821]  ? __drbd_next_peer_device_ref+0x120/0x120 [drbd]
[16259.734701]  kthread+0x124/0x150
[16259.735609]  ? set_kthread_struct+0x50/0x50
[16259.736459]  ret_from_fork+0x1f/0x30
[16259.737274]  </TASK>

it showing a NULL ptr at process_twopc .

Thanks!

rck commented 10 months ago

does it reproduce with 9.1.17 (which would be the current version)? https://pkg.linbit.com//downloads/drbd/9/drbd-9.1.17.tar.gz

qiyuanzhi commented 10 months ago

I'm not sure. This is the first time i got this panic, and it works good when create/delete volume in the past.

I try to reproduce this bug, but it doesn't occured.

Is it resolved in 9.1.17 ?

rck commented 10 months ago

I'm sure there have been bugs that got fixed, not sure if that particular issue rings any bells by one of the devs. It is just that most of us don't really bother to spend time on issues for outdated versions. Trying to even find a reproducer is one thing, but spending that time on an old version to find out it got changed/fixed is something else. Sorry, just stating the facts, maybe you are lucky :)