LINBIT / drbd

LINBIT DRBD kernel module
https://docs.linbit.com/docs/users-guide-9.0/
GNU General Public License v2.0
587 stars 100 forks source link

kernel panic list_add corruption #72

Closed ajschorr closed 1 year ago

ajschorr commented 1 year ago

I just set up a 3-way mesh test config on CentOS Stream 9 running kernel 5.14.0-325.el9.x86_64 At some point, the secondary host "ti140" crashed with this error: [Wed Oct 18 12:28:40 2023][2399614.284617] list_add corruption. prev->next should be next (ffff8c02bc209168), but was ffff8c01e769b760. (prev=ffff8c01e769b760).

Here's what I did:

On all 3 hosts: lvcreate -n pool0 -L 30GiB vg_sys lvconvert -y --type thin-pool vg_sys/pool0 lvcreate -n drbd_main -V 10GiB --thinpool pool0 vg_sys lvcreate -n drbd_archive -V 10GiB --thinpool pool0 vg_sys drbdadm create-md test drbdadm up test

And on the primary host "ti128": drbdadm new-current-uuid --clear-bitmap test/0 drbdadm new-current-uuid --clear-bitmap test/1 drbdadm primary test

I'm attaching the test.res config file from /etc/drbd.d and a log of console messages captured by conserver.

After reboot, "ti140" resynced and seems to be working OK.

Regards, Andy ti140.log test.res.txt

ajschorr commented 1 year ago

FYI, I just got the same panic on the other secondary node "ti126".

[Wed Oct 18 12:58:09 2023][1548628.601583] list_add corruption. prev->next should be next (ffff9969389d8968), but was ffff996841b261a0. (prev=ffff996841b261a0).

I'm attaching another console log file.

Regards, Andy ti126.log

dvance commented 1 year ago

Should be fixed with this commit: https://github.com/LINBIT/drbd/commit/f72b60c5ff8af5ee8cd8f1d87257afa86a6e0eb3

JoelColledge commented 9 months ago

For reference: This looks more like a case that is fixed by bc9e239cb4c3e7d898fcdd403555e78ee8d76378. Note the call trace involving w_resync_timer and the presence of csums-alg in the config.

ajschorr commented 9 months ago

Ah, thanks. Good to know. So would disabling csums-alg eliminate this issue? The issue has thankfully not recurred...

JoelColledge commented 9 months ago

So would disabling csums-alg eliminate this issue?

Yes, it should prevent the issue.