dm-vdo / kvdo

A kernel module which provide a pool of deduplicated and/or compressed block storage.
GNU General Public License v2.0
239 stars 45 forks source link

vdo 6.2.6.3 crash on kernel 4.19 #48

Open qqshow opened 2 years ago

qqshow commented 2 years ago

Hi, team, VDO crash system on kernel 4.19, Here is the crash info , [135468.438992] CPU: 28 PID: 0 Comm: swapper/28 Kdump: loaded Tainted: G W OE 4.19.36

[135468.438996] x25: 0000000000000000 [135468.438998] pstate: 40400009 (nZcv daif +PAN -UAO) [135468.439000] pc : list_del_entry_valid+0x60/0xc8 [135468.439002] x24: 0000000000000105 [135468.439004] lr : __list_del_entry_valid+0x60/0xc8 [135468.439006] sp : ffff0000815dbd40 [135468.439009] x29: ffff0000815dbd40 x28: ffffa02f6be1eb80 [135468.439012] x23: ffff0000811d9000 [135468.439017] x22: ffff00000126ad68 [135468.439018] x27: dead000000000100 x26: ffff0000811d9000 [135468.439022] x21: ffffa02fab0cf400 [135468.439025] x25: dead000000000200 [135468.439028] x20: ffff0000811d9000 [135468.439030] x24: ffffa02f69eb6000 [135468.439036] x19: ffffa02f69c4a380 [135468.439038] x23: dead000000000100 x22: ffffa02fab0cf400 [135468.439041] x18: ffffffffffffffff [135468.439047] x17: 0000000000000000 [135468.439049] x21: dead000000000200 x20: ffff0000815dbdc8 [135468.439054] x16: ffff0000804e0508 [135468.439061] x19: ffffa02f69eb6498 x18: ffffffffffffffff [135468.439066] x15: ffff0000811d9708 x14: 6461656420736177 [135468.439069] x17: 0000000000000000 x16: ffff0000804e0508 [135468.439076] x13: 20747562202c3839 [135468.439077] x15: ffff0000811d9708 x14: 3030303030303064 [135468.439081] x12: 3461346339366632 [135468.439086] x13: 6165642820314e4f x12: 53494f505f545349 [135468.439089] x11: 3061666666662065 [135468.439093] x10: 0000000000000000 [135468.439095] x11: 4c20736920747865 x10: 6e3e2d3839343662 [135468.439102] x9 : ffff0000811dbaf0 [135468.439104] x9 : 6539366632306166 x8 : ffff0000813d9a54 [135468.439111] x8 : 00000000001a91e0 [135468.439113] x7 : 0000000000000000 [135468.439116] x7 : ffff0000813b6000 [135468.439118] x6 : 00000c066ce23796 [135468.439121] x6 : 0000000000000001 [135468.439123] x5 : 0000000000000001 x4 : ffff802fbfb9a400 [135468.439131] x5 : 0000000000000001 [135468.439133] x3 : ffff802fbfb9a400 x2 : 0000000000000007 [135468.439139] x4 : ffffa02fbf9e0400 [135468.439144] x1 : 45680c97a7e6cd00 x0 : 0000000000000000 [135468.439147] x3 : ffffa02fbf9e0400 [135468.439150] Call trace: [135468.439152] x2 : 0000000000000007 [135468.439155] list_del_entry_valid+0x60/0xc8 [135468.439176] timeoutIndexOperations+0x14c/0x268 [kvdo] [135468.439180] call_timer_fn+0x34/0x178 [135468.439183] x1 : 45680c97a7e6cd00 [135468.439185] expire_timers+0xec/0x158 [135468.439189] run_timer_softirq+0xc0/0x1f8 [135468.439193] do_softirq+0x120/0x324 [135468.439194] x0 : 0000000000000000 [135468.439199] irq_exit+0x11c/0x140 [135468.439202] handle_domain_irq+0x6c/0xc0 [135468.439205] Call trace: [135468.439207] gic_handle_irq+0x6c/0x150 [135468.439211] __list_del_entry_valid+0xa4/0xc8 [135468.439213] el1_irq+0xb8/0x140 [135468.439217] arch_cpu_idle+0x38/0x1c0 [135468.439221] default_idle_call+0x30/0x4c [135468.439238] finishIndexOperation+0x150/0x208 [kvdo] [135468.439241] do_idle+0x22c/0x300 [135468.439250] handleCallbacks+0x48/0x80 [uds] [135468.439260] requestQueueWorker+0x74/0x400 [uds] [135468.439263] cpu_startup_entry+0x28/0x30 [135468.439273] threadStarter+0xa0/0xd8 [uds] [135468.439277] secondary_start_kernel+0x17c/0x1c8 [135468.439282] ---[ end trace 488bf3b169748bff ]--- [135468.439285] kthread+0x134/0x138

Blow is the vmcore-dmesg file. 20220314vdolog.zip

qqshow commented 2 years ago

The vdo config info :

config: !Configuration vdos: dataDB: !VDOService _operationState: finished ackThreads: 2 activated: enabled bioRotationInterval: 64 bioThreads: 6 blockMapCacheSize: 128M blockMapPeriod: 16380 compression: enabled cpuThreads: 6 deduplication: enabled device: /dev/disk/by-id/dm-uuid-LVM-HHcI02NQf6VP6OHSd84FypcdUkND0fTVZ3qdcw5qKO0M61ruXLRxAHyBLec453x6 hashZoneThreads: 4 indexCfreq: 0 indexMemory: 1 indexSparse: enabled indexThreads: 0 logicalBlockSize: 4096 logicalSize: 10T logicalThreads: 4 maxDiscardSize: 4K name: dataDB physicalSize: 10T physicalThreads: 4 slabSize: 32G writePolicy: async dataFILE: !VDOService _operationState: finished ackThreads: 2 activated: enabled bioRotationInterval: 64 bioThreads: 6 blockMapCacheSize: 128M blockMapPeriod: 16380 compression: enabled cpuThreads: 6 deduplication: enabled device: /dev/disk/by-id/dm-uuid-LVM-HHcI02NQf6VP6OHSd84FypcdUkND0fTV53ChuOvn3e8cDoGoOdHMwgfdjYvePt6x hashZoneThreads: 4 indexCfreq: 0 indexMemory: 1 indexSparse: enabled indexThreads: 0 logicalBlockSize: 4096 logicalSize: 91575992M logicalThreads: 4 maxDiscardSize: 4K name: dataFILE physicalSize: 91575992M physicalThreads: 4 slabSize: 32G writePolicy: async dataVM: !VDOService _operationState: finished ackThreads: 2 activated: enabled bioRotationInterval: 64 bioThreads: 6 blockMapCacheSize: 128M blockMapPeriod: 16380 compression: enabled cpuThreads: 6 deduplication: enabled device: /dev/disk/by-id/dm-uuid-LVM-HHcI02NQf6VP6OHSd84FypcdUkND0fTV5OiUTrBefDxxNTdmCfn9rK0SIN3XtVRu hashZoneThreads: 4 indexCfreq: 0 indexMemory: 1 indexSparse: enabled indexThreads: 0 logicalBlockSize: 4096 logicalSize: 30T logicalThreads: 4 maxDiscardSize: 4K name: dataVM physicalSize: 30T physicalThreads: 4 slabSize: 32G writePolicy: async version: 538380551

The vdostats info :

vdostats

sweettea commented 2 years ago

That seems exciting. Apologies.

[135468.440272] list_del corruption. prev->next should be ffffa02f6baefc98, but was dead000000000100 is a message from your vmcore-dmesg that indicates prev->next's value is LIST_POISON1, which implies the previous member of the list had list_del() called on it, but was still linked to this entry.

https://github.com/dm-vdo/kvdo/blob/1ea1bde8da0b27d675fc34ceef35d375434055b6/vdo/kernel/udsIndex.c#L297 and https://github.com/dm-vdo/kvdo/blob/1ea1bde8da0b27d675fc34ceef35d375434055b6/vdo/kernel/udsIndex.c#L206 seem likely to be the relevant pieces of code. It seems to me like there's a window where the timer could put a timed-out request on the expired list; then, as it processes it, if a request on the expired list comes back at just the wrong moment on a different thread, the request could see a previous pointer which has just been list_del'd without a lock on the timer thread.

Let's call them requests A and B on the expired list. Thread 1 is expiring A and has just deleted A; thread 2 gets the callback for B and list_del's it inside a lock, but views B's old pointer to A and A's freshly deleted state, and therefore explodes.

I'm not sure whether this is feasible though...

@raeburn you might be able to tell me if this is impossible? If it's not, though, this race would still exist on the unstable branch and everything in between.

raeburn commented 2 years ago

I think the isPending flag ought to prevent that, but it's possible there's a bug in that somewhere. I'll take a closer look soon…

sweettea commented 2 years ago

@raeburn you're right, isPending should indeed be protected by the lock, so the scenario I laid out is impossible...

what if thread 2 processes the callback (not deleting it from any lists since its not pending); reuses it for a new request to the index (thus setting isPending again); and then gets a response to that (therefore doing the list_del this time), all before thread 1 gets around to deleting the item off of expiredList? theoretically then I think thread 1 could see a mix of pointer values...

lorelei-sakai commented 2 years ago

@sweettea, that reuse scenario is prevented by dedupeContext->requestState. Even when a data_vio is reused for a subsequent write, its dedupe_context is not (and cannot be) reused until the index request actually completes and calls finishIndexOperation.

@qqstar, We did see similar kinds of list_del corruption in testing last year, between about March 2021 and June 2021. The failures seemed to suggest some problem in the implementation of the list macros themselves. (The problem appeared without code changes in this project, and similarly disappeared without code changes.) Unfortunately I don't think we ever identified a specific kernel bug that might have caused it. It may be worth making sure your kernel is up-to-date, though.

qqshow commented 2 years ago

@lorelei-sakai I did see similar kinds of crash stack several times on different kernel versions and vdo versions. The kernel version 3.10 and 4.19, vdo version 6.1.1.125, 6.1.3.23 and 6.2.6 .

qqshow commented 2 years ago

@raeburn
@sweettea The issue-46 with the simliar list_del corruption with vdo version 6.1 on kernel 3.10. But no luck, without vmcore generated because the system was hanged.

qqshow commented 2 years ago

@raeburn @lorelei-sakai @sweettea Please let me know if you need any other information.

raeburn commented 2 years ago

I'm not so sure that requestState provides sufficient protection there. It will guard against both timeoutIndexOperations and finishIndexOperation trying to make the same request structure usable again, but timeoutIndexOperations is using the pendingList field to manage its expired-entry list (after having already removed them from the actual pending list), and will unconditionally remove all entries from it, regardless of requestState. Meanwhile finishIndexOperation will move the VIO toward completion and reuse, causing the list structure to be reinitialized, without regard for whether timeoutIndexOperations has yet removed it from its local expired-entry list.

There are other variations that could cause faults too. Simply having added the entry back to the pending list could cause the list_del in timeoutIndexOperations to detect an inconsistency in the prev/next pointers.

Once timeoutIndexOperations has grabbed an entry for its expired list, we can't let finishIndexOperation by itself free that entry up for reuse, so long as timeoutIndexOperations is using it in its list structure. (Of course timeoutIndexOperations alone isn't enough either; as now, we still need to wait for the dedupe request to complete before we can reuse the request structure, lest that response scribble over some later request.)

sweettea commented 2 years ago

@qqshow I don't think so personally. If you happened to have a vmcore, and a kernel-debuginfo package (or just a version that's publicly available), that might be nice, but it's not strictly necessary and it can contain sensitive information. (My hope would be that we could look through the vios and see whether the bad list_entry is in fact part of a VIO in progress.)

@raeburn @lorelei-sakai might have an idea of something useful though?

qqshow commented 2 years ago

vmcore-demsg.zip Here is the vmcore-demsg from the machine running Vdo 6.1 on kernel 3.10.

Hope it's useful !

qqshow commented 2 years ago

@raeburn @sweettea How to avoid this problem? Can you give some advices? I try to reduce vdo threads , but the problem still occured. vdo modify --all --vdoAckThreads=1 --vdoBioThreads=2 --vdoCpuThreads=1 --vdoHashZoneThreads=1 --vdoLogicalThreads=1 --vdoPhysicalThreads=1

raeburn commented 2 years ago

This problem came up because of a race condition between two paths for handling deduplication advice and incorrect handling of a data structure on one of those paths. (One path for receiving advice, one for timing out waiting to get the advice.)

I’ve got a fix for the problem for older branches of our code, but the “unstable” branch was in the middle of getting a significant rewrite of some of that code, which has been checked in in our internal repository in the weeks since the last push to github at the end of April. The changes to the code should have fixed this problem along the way, though I haven’t had a chance to look through the code with an eye towards this particular issue and confirm it.

I don’t know how long it’ll be until the newest code is put up on github.

In the meantime, as a workaround, you may be able to reduce the likelihood of hitting the bug by increasing the deduplication timeout via the module parameter “deduplication_timeout_interval” which represents a number of milliseconds, and is set to 5000 (5 seconds) by default. It can be set to values up to 120 seconds or 120000.

Raising the timeout can reduce throughput if VDO (specifically the UDS index portion) is being slow for some reason. On the other hand, if it’s taking much more than 5 seconds to get deduplication advice back, your setup is already having problems…

raeburn commented 2 years ago

Ah, I mis-"spoke" ... the part of the rewrite with the bug fix isn't in yet, even internally; it's still in the works.

Porting the fix applied to older versions forward would be possible but difficult, due to other changes in the code.

qqshow commented 2 years ago

Thank you very much for your reply!

qqshow commented 2 years ago

Raising the timeout can reduce throughput if VDO (specifically the UDS index portion) is being slow for some reason. On the other hand, if it’s taking much more than 5 seconds to get deduplication advice back, your setup is already having problems…

@raeburn As you said, I tried to add "deduplication_timeout_interval" but vdo still stucked my system. Do you have any schedule to push the fix to github ?
Cloud you give me the fix to test on my system ?

raeburn commented 2 years ago

The fix for the 6.2 code should be getting pushed to github within the next day or so, actually. The rewrite going into "unstable" will be a while longer in coming, but if you're still on the 4.19 kernel the 6.2 version is the one that matters.

qqshow commented 2 years ago

@raeburn Thank you very much! I will test the fix. https://github.com/dm-vdo/kvdo/commit/72f94c2d858e4baabeba72bc3551287a2dd45d87

qqshow commented 1 year ago

@raeburn The kvdo 6.2 version hang my system occasionally. My system runs 4.19 kernel. The problem always occurs when reading a lot.

raeburn commented 1 year ago

On 2/9/23 09:55, x-star wrote:

@raeburn https://github.com/raeburn The kvdo 6.2 version hang my system occasionally. My system runs 4.19 kernel. The problem always occurs when reading a lot.

I don't know of anything that would cause a hang during reads, but there's a bug (fixed in 6.2.8.1) that can cause a hang after a suspend-resume cycle followed by a lot of writes. Of course, if you're reading files from a file system mounted read-write, there may be access-time updates for each file, so depending on your usage, a huge number of file reads can still translate to a lot of writes. A suspend-resume cycle can be generated by adding physical storage, extending the logical address range, or some other configuration tweaks; I believe there should be kernel log messages generated when that happens. Unfortunately, if this is the bug you trip over, there are no clear signs to indicate it short of groveling over VDO's internal data structures.

If you're not using 6.2.8.1 or later, could you try updating and see if the problem persists?

If you have updated already, then it must be something else... what kernel messages has VDO generated up to that point? Any other interesting kernel messages generated before the hang?

Ken

Message ID: @.***>

qqshow commented 1 year ago

I had updated to 6.2.8.6. The kernel messages has nothings releated information when hang.

raeburn commented 1 year ago

Hmm, this sounds like something new we haven’t seen.

Is it hanging everything, or can you still issue commands to parts of the system not hung up waiting on I/O to VDO? It would be useful to know if some of the VDO threads are still running (“top”), and if I/O is still happening to the backing storage device (“sar -d 1”) or if it’s sitting idle.

If you can run commands, these should dump some possibly-useful info into the kernel log:

dmsetup message 0 dump pools dmsetup message 0 dump queues dmsetup message 0 dump vdo

The “dmsetup message” command should be able to run on a VDO device even if I/O is hung up somehow.