dolohow / uksm

Ultra Kernel Samepage Merging
242 stars 35 forks source link

BUG ? Hit a possible endless loop in uksm_do_scan() #22

Closed colo-ft closed 7 years ago

colo-ft commented 7 years ago

Hi @naixia & @dolohow,

Sorry to bother both of you again. :)

We seemed to catch a new bug in uksm, there seems to be an endless loop in uksm which caught soft lockup, Here is the backtrace we got from crash core:

crash> bt -a PID: 8 TASK: ffff881029455080 CPU: 0 COMMAND: "migration/0"

0 [ffff88103fe03f70] stop_this_cpu at ffffffff8101e70d

1 [ffff88103fe03f80] smp_reboot_interrupt at ffffffff81046e40

2 [ffff88103fe03fb0] reboot_interrupt at ffffffff81655d1d

--- ---

3 [ffff8810294f3cd8] reboot_interrupt at ffffffff81655d1d

[exception RIP: multi_cpu_stop+74]
RIP: ffffffff81105dba  RSP: ffff8810294f3d80  RFLAGS: 00000246
RAX: 0000000000000001  RBX: ffffffff810b4c20  RCX: 0000000000ffffff
RDX: 0000000000000000  RSI: 0000000000ffffff  RDI: 0000000000000018
RBP: ffff8810294f3da8   R8: ffffffff81a7dee0   R9: 0000000000000000
R10: 000000000000b7bf  R11: 0000000000000001  R12: ffff8810294f3e78
R13: ffffffff81105d70  R14: ffff8818788f7df8  R15: ffff8818788f7e80
ORIG_RAX: ffffffffffffff07  CS: 0010  SS: 0018

4 [ffff8810294f3db0] cpu_stopper_thread at ffffffff81106086

5 [ffff8810294f3e80] smpboot_thread_fn at ffffffff810aedff

6 [ffff8810294f3ec8] kthread at ffffffff810a61ff

7 [ffff8810294f3f50] ret_from_fork at ffffffff81655118

PID: 61 TASK: ffff88102979c500 CPU: 1 COMMAND: "migration/1"

0 [ffff88103fe43f70] stop_this_cpu at ffffffff8101e70d

1 [ffff88103fe43f80] smp_reboot_interrupt at ffffffff81046e40

2 [ffff88103fe43fb0] reboot_interrupt at ffffffff81655d1d

--- ---

3 [ffff8810297c3cd8] reboot_interrupt at ffffffff81655d1d

[exception RIP: multi_cpu_stop+142]
RIP: ffffffff81105dfe  RSP: ffff8810297c3d80  RFLAGS: 00000246
RAX: 0000000000000001  RBX: ffffffff810b4c20  RCX: 0000000000ffffff
RDX: 0000000000000000  RSI: 0000000000ffffff  RDI: 0000000000000018
RBP: ffff8810297c3da8   R8: ffffffff81a7dee0   R9: 0000000000000000
R10: 000000000000000f  R11: 0000000000000001  R12: ffff8810297c3e78
R13: ffffffff81105d70  R14: ffff8818788f7df8  R15: ffff8818788f7e80
ORIG_RAX: ffffffffffffff07  CS: 0010  SS: 0018

4 [ffff8810297c3db0] cpu_stopper_thread at ffffffff81106086

5 [ffff8810297c3e80] smpboot_thread_fn at ffffffff810aedff

6 [ffff8810297c3ec8] kthread at ffffffff810a61ff

7 [ffff8810297c3f50] ret_from_fork at ffffffff81655118

... ...

PID: 236 TASK: ffff88102730ae00 CPU: 9 COMMAND: "uksmd"

0 [ffff88203e2c3f70] stop_this_cpu at ffffffff8101e70d

1 [ffff88203e2c3f80] smp_reboot_interrupt at ffffffff81046e40

2 [ffff88203e2c3fb0] reboot_interrupt at ffffffff81655d1d

--- ---

3 [ffff881026ce3ce8] reboot_interrupt at ffffffff81655d1d

[exception RIP: _raw_spin_lock+18]
RIP: ffffffff8164c492  RSP: ffff881026ce3d98  RFLAGS: 00000282
RAX: 00000000e220e220  RBX: 0000000000000000  RCX: 0000000000000100
RDX: ffff881ff509b290  RSI: 0000000000000001  RDI: ffffffff81efe8f8
RBP: ffff881026ce3d98   R8: 00000000e6200000   R9: 0000000000000001
R10: 0000000000000001  R11: ffff880e661fa088  R12: ffffffff81656df1
R13: ffff881026ce3e60  R14: ffff882007254cf0  R15: ffffffff81656df1
ORIG_RAX: ffffffffffffff07  CS: 0010  SS: 0018

4 [ffff881026ce3da0] try_down_read_slot_mmap_sem at ffffffff811bef59

5 [ffff881026ce3db8] uksm_do_scan at ffffffff811c6590

6 [ffff881026ce3e68] uksm_scan_thread at ffffffff811c7e75

7 [ffff881026ce3ec8] kthread at ffffffff810a61ff

8 [ffff881026ce3f50] ret_from_fork at ffffffff81655118

PID: 107 TASK: ffff882027fe0000 CPU: 10 COMMAND: "migration/10"

0 [ffff88203e303f70] stop_this_cpu at ffffffff8101e70d

1 [ffff88203e303f80] smp_reboot_interrupt at ffffffff81046e40

2 [ffff88203e303fb0] reboot_interrupt at ffffffff81655d1d

--- ---

3 [ffff882027febcd8] reboot_interrupt at ffffffff81655d1d

[exception RIP: multi_cpu_stop+74]
RIP: ffffffff81105dba  RSP: ffff882027febd80  RFLAGS: 00000246
RAX: 0000000000000001  RBX: ffffffff810b4c20  RCX: 0000000000ffffff
RDX: 0000000000000000  RSI: 0000000000ffffff  RDI: 0000000000000018
RBP: ffff882027febda8   R8: ffffffff81a7dee0   R9: 0000000000000001
R10: 000000000000b64b  R11: 0000000000000001  R12: ffff882027febe78
R13: ffffffff81105d70  R14: ffff8818788f7df8  R15: ffff8818788f7e80
ORIG_RAX: ffffffffffffff07  CS: 0010  SS: 0018

4 [ffff882027febdb0] cpu_stopper_thread at ffffffff81106086

5 [ffff882027febe80] smpboot_thread_fn at ffffffff810aedff

6 [ffff882027febec8] kthread at ffffffff810a61ff

7 [ffff882027febf50] ret_from_fork at ffffffff81655118

I only kept parts of backtrace here, other cpus except CPU 9 which seems to be loop in uksm_do_scan were offline.

In the function uksm_do_scan(), if try_down_read_slot_mmap_sem() always return -EBUSY for a period of time. It will goes into an endless loop during the time: ( Is this possbile ? ) while (rung->pages_to_scan &&rung->vma_root.num && likely(!freezing(current)) {

... ...

err = try_down_read_slot_mmap_sem(slot)

... ...

if (err = -EBUSY) {
    do {
       reset = advance_current_scan(rung);
       iter = rung->current_scan;
       ... ...
    } while (1)
}
if (inter->vma->vm_mm != busy_mm) {
    continue;
}

We analysed the core, and it was scanning the rung[2] area at that time, and there was 2 vma_slot in this rung. The mmap_sem.count of these two vma_slot's mm struct are -4294967295(0xFFFFFFFF00000001) and -4294967291 (0xFFFFFFFF00000005), which means we can't got the related read semaphore lock at that time, It means the return value of try_down_read_slot_mmap_sem() will be -EBUSY. Please see the bellow for details.

In our test, we enabled swap in the system, and in some special situation we may can not got the mm_sem read semaphore lock for a peroid of time.

I'm not quite sure if our anaylsis is right or not, Is it possbile to run endlessly in the while(rung->pages_to_scan &&rung->vma_root.num ? Any idea about this problem ?

Thanks.

crash> p uksm_scan_ladder[2] $6 = { $6 = { vma_root = { height = 1, rnode = 0xffff881fe2f6a8a8, enter_node = 0xffff881fe2f6a8a8, shift = 8, stores_size = 256, mask = 255, min = 2, num = 2, ... ... },

... ...

current_scan = 0xffff882007254cf0, current_offset = 241, offset_init = 0, step = 606, flags = 1, pages_to_scan = 3212, cpu_ratio = -2500, cover_msecs = 200 } crash> sradix_tree_node.stores struct sradix_tree_node { [24] void *stores[]; } crash> x/16g 0xffff881fe2f6a8a8+24 0xffff881fe2f6a8c0: 0xffff881fdd284170 0xffff882007254cf0 0xffff881fe2f6a8d0: 0x0000000000000000 0x0000000000000000 0xffff881fe2f6a8e0: 0x0000000000000000 0x0000000000000000 0xffff881fe2f6a8f0: 0x0000000000000000 0x0000000000000000 0xffff881fe2f6a900: 0x0000000000000000 0x0000000000000000 0xffff881fe2f6a910: 0x0000000000000000 0x0000000000000000 0xffff881fe2f6a920: 0x0000000000000000 0x0000000000000000 0xffff881fe2f6a930: 0x0000000000000000 0x0000000000000000 crash> vma_slot 0xffff881fdd284170 struct vma_slot { snode = 0xffff881fe2f6a8a8, sindex = 0, slot_list = { next = 0xffff881fdd284180, prev = 0xffff881fdd284180 }, fully_scanned_round = 576, dedup_num = 0, pages_scanned = 2635, last_scanned = 2635, pages_to_scan = 0, rung = 0xffffffff81efeb70 <uksm_scan_ladder+496>, rmap_list_pool = 0xffff8809ca928000, pool_counts = 0xffff880e88110000, pool_size = 4096, vma = 0xffff881e22dec948, mm = 0xffff881d1b6ef6c0, ctime_j = 4463343148, pages = 2097152, flags = 19, pages_cowed = 0, pages_merged = 0, pages_bemerged = 0, dedup_list = { next = 0xffff881fdd284218, prev = 0xffff881fdd284218 } }

crash> mm_struct 0xffff881d1b6ef6c0 struct mm_struct { mmap = 0xffff881de0020438, mm_rb = { rb_node = 0xffff8818321a2608 }, mmap_cache = 0xffff881d09571878, get_unmapped_area = 0xffffffff81019af0 , unmap_area = 0xffffffff811a0ae0 , mmap_base = 140047777320960, mmap_legacy_base = 47584893067264, task_size = 140737488351232, cached_hole_size = 0, free_area_cache = 140047777320960, highest_vm_end = 140731066691584, pgd = 0xffff881839c52000, mm_users = { counter = 60 }, mm_count = { counter = 7 },

... ...

mmap_sem = { count = -4294967295, wait_lock = { raw_lock = { { head_tail = 163580352, tickets = { head = 2496, tail = 2496 } } } }, wait_list = { next = 0xffff880f51dcfe70, prev = 0xffff880045373dd0 } }, mmlist = { next = 0xffff881d1b6ef758, prev = 0xffff881d1b6ef758 }, ... ...

crash> vma_slot 0xffff882007254cf0 struct vma_slot { snode = 0xffff881fe2f6a8a8, sindex = 1, slot_list = { next = 0xffff882007254d00, prev = 0xffff882007254d00 }, fully_scanned_round = 0, dedup_num = 0, pages_scanned = 5465, last_scanned = 5465, pages_to_scan = 0, rung = 0xffffffff81efeb70 <uksm_scan_ladder+496>, rmap_list_pool = 0xffff881c37150000, pool_counts = 0xffff881ac8848000, pool_size = 8192, vma = 0xffff881ff509b290, mm = 0xffff88047d5cc4c0, ctime_j = 4463606846, pages = 4194304, flags = 19, pages_cowed = 0, pages_merged = 0, pages_bemerged = 0, dedup_list = { next = 0xffff882007254d98, prev = 0xffff882007254d98 } } crash> mm_struct 0xffff88047d5cc4c0 struct mm_struct { mmap = 0xffff88066954cf30, mm_rb = { rb_node = 0xffff8820030a51d8 }, mmap_cache = 0xffff881ff509b290, get_unmapped_area = 0xffffffff81019af0 , unmap_area = 0xffffffff811a0ae0 , mmap_base = 139674410536960, mmap_legacy_base = 47958259851264, task_size = 140737488351232, cached_hole_size = 0, free_area_cache = 139674410536960, highest_vm_end = 140732348211200, pgd = 0xffff882001615000, mm_users = { counter = 47 }, mm_count = { counter = 8 },

... ...

mmap_sem = { count = -4294967291, wait_lock = { raw_lock = { { head_tail = 14155992, tickets = { head = 216, tail = 216 } } } }, wait_list = { next = 0xffff88033386fe70, prev = 0xffff880aca73baf0 } },

... ...

owner = 0xffff880cdcf4c500, exe_file = 0xffff880f187f4200,

colo-ft commented 7 years ago

I forgot to mention that, the reason why other cpus are all running into stopping status , is we have executed 'rmmod modules' command, which will force all cpus go into stopping.

So as we analysed above if some process hold the write sem lock has been scheduled out because of the rmmod process. uksm will never get the read sem lock of mm_struct, and it may got stuck in the while() loop.

naixia commented 7 years ago

There are several conditions to break the two while loops. A simple and direct hack is to always break the outer loop whenever busy_retry drops to zero so that uksmd is less aggressive racing with busy mms.

colo-ft commented 7 years ago

@naixia Yes, that is one way to fix this problem, actually, we fix it by checking if or not all the vm_slots in this rung are busy to decide whether or not to break the outer loop ~

Thanks

naixia commented 7 years ago

Will be fixed in the next release