Closed colo-ft closed 7 years ago
ksm.c does not check the return value of page_stable_node() at this place either.
lock_page(kpage); stable_tree_append(rmap_item, page_stable_node(kpage)); unlock_page(kpage);
It's definitely a race condition of the stable page. The cause might be NUMA page migration or memory hotplug, which uksm.c does not support for the time being. And I might not have the time to support that in the near future, very sorry.
However, if you can patch uksm to support it, I am very open to patch submission. :)
@naixia Thanks for your quick reply :) Yes, you are right, ksm.c doesn't do that either, so ksm has the same problem ? This problem is a little critical, can we just check the return value of page_stable_node() to avoid this problem temporarily, does it has any side effect ?
@naixia @dolohow We didn't do memory hotplug. :)
@colo-ft I am not saying ksm has the same problem and I have no idea what you are running. Anyway, without a clear construction of the race condition, any modification to the code is risky. You need to first make this bug reproducible. Maybe then I can tell if I can help you in this scenario.
@naixia We analysed the core dump and confirmed that this problem is caused by page migrate, just as your surmise. We believe the follow patch can fix this bug, it is based on 3.10 version uksm. We followed the ksm's modification which supports page migration, we have tested it, i'd like you to help reviewing it, thanks.
Subject: [PATCH 1/1] uksm: Lock page in get_uksm_page() to serialize with page
migration
Kernel panic when uksm tries to merge a page with a kpage in stable
tree while the kpage has been migrated:
PID: 427 TASK: ffff881026561700 CPU: 10 COMMAND: "uksmd"
#0 [ffff881026587878] machine_kexec at ffffffff8105282b
#1 [ffff8810265878d8] crash_kexec at ffffffff810f4a82
#2 [ffff8810265879a8] panic at ffffffff8163bdeb
#3 [ffff881026587a28] oops_end at ffffffff8164b19b
#4 [ffff881026587a50] die at ffffffff8101870b
#5 [ffff881026587a80] do_trap at ffffffff8164a840
#6 [ffff881026587ad0] do_invalid_op at ffffffff810151d4
#7 [ffff881026587b80] invalid_op at ffffffff8165401e
[exception RIP: stable_tree_append+658]
RIP: ffffffff811bda52 RSP: ffff881026587c38 RFLAGS: 00010246
RAX: ffffffff81efac68 RBX: ffff880ab4c4aaa0 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff880ab4c4aaa0
RBP: ffff881026587c70 R8: 2e00000000000000 R9: a801eba897000000
R10: 57fdfe57e3ea25c0 R11: ffff881eba897000 R12: ffff8820180d2730
R13: 0000000000000001 R14: ffff880ab4c4aaa0 R15: ffffea007aea25c0
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#8 [ffff881026587c78] cmp_and_merge_page at ffffffff811c1f33
#9 [ffff881026587d30] scan_vma_one_page at ffffffff811c4911
#10 [ffff881026587db8] uksm_do_scan at ffffffff811c59ab
#11 [ffff881026587e68] uksm_scan_thread at ffffffff811c7325
#12 [ffff881026587ec8] kthread at ffffffff810a5d9f
#13 [ffff881026587f50] ret_from_fork at ffffffff816527d8
According to the two commits in upstream from Hugh Dickins below,
commit 8aafa6a485ae77ce4a49eb1280f3d2c6074a03fb
commit c8d6553b9580188a1324486173d79c0f8642e870
uksm has to be serialized with page migration path.
1. When KSM migration is fully enabled, we shall want that to make sure that
the page just acquired cannot be migrated beneath us. Raised page count is
only effective when there is serialization to make sure migration notices.
So we should lock page in get_uksm_page() to serialize with move_to_new_page()
2. When the kpage in stable tree has been migrated and stable_node->kpfn has
changed, we have to set old_page->mapping to NULL and remove it from stable
tree the next time get_uksm_page() finds it.
---
mm/uksm.c | 38 +++++++++++++++++++++++++++++++-------
1 file changed, 31 insertions(+), 7 deletions(-)
diff --git a/mm/uksm.c b/mm/uksm.c
index 60c9789..d889b44 100644
--- a/mm/uksm.c
+++ b/mm/uksm.c
@@ -871,8 +871,7 @@ static void remove_node_from_stable_tree(struct stable_node *stable_node,
* but this is different - made simpler by uksm_thread_mutex being held, but
* interesting for assuming that no other use of the struct page could ever
* put our expected_mapping into page->mapping (or a field of the union which
- * coincides with page->mapping). The RCU calls are not for KSM at all, but
- * to keep the page_count protocol described with page_cache_get_speculative.
+ * coincides with page->mapping).
*
* Note: it is possible that get_uksm_page() will return NULL one moment,
* then page the next, if the page is in between page_freeze_refs() and
@@ -892,11 +891,14 @@ static struct page *get_uksm_page(struct stable_node *stable_node,
{
struct page *page;
void *expected_mapping;
+ unsigned long kpfn;
- page = pfn_to_page(stable_node->kpfn);
+again:
+ kpfn = stable_node->kpfn;
+ page = pfn_to_page(kpfn);
expected_mapping = (void *)stable_node +
(PAGE_MAPPING_ANON | PAGE_MAPPING_KSM);
- rcu_read_lock();
+
if (page->mapping != expected_mapping)
goto stale;
if (!get_page_unless_zero(page))
@@ -905,12 +907,26 @@ static struct page *get_uksm_page(struct stable_node *stable_node,
put_page(page);
goto stale;
}
- rcu_read_unlock();
+
+ lock_page(page);
+ if (page->mapping != expected_mapping) {
+ unlock_page(page);
+ put_page(page);
+ goto stale;
+ }
+ unlock_page(page);
return page;
stale:
- rcu_read_unlock();
+ /*
+ * We come here from above when page->mapping or !PageSwapCache
+ * suggests that the node is stale; but it might be under migration.
+ * We need smp_rmb(), matching the smp_wmb() in ksm_migrate_page(),
+ * before checking whether node->kpfn has been changed.
+ */
+ smp_rmb();
+ if (stable_node->kpfn != kpfn)
+ goto again;
remove_node_from_stable_tree(stable_node, unlink_rb, remove_tree_node);
-
return NULL;
}
@@ -4858,6 +4874,14 @@ void ksm_migrate_page(struct page *newpage, struct page *oldpage)
if (stable_node) {
VM_BUG_ON(stable_node->kpfn != page_to_pfn(oldpage));
stable_node->kpfn = page_to_pfn(newpage);
+ /*
+ * newpage->mapping was set in advance; now we need smp_wmb()
+ * to make sure that the new stable_node->kpfn is visible
+ * to get_ksm_page() before it can see that oldpage->mapping
+ * has gone stale (or that PageSwapCache has been cleared).
+ */
+ smp_wmb();
+ set_page_stable_node(oldpage, NULL);
}
}
#endif /* CONFIG_MIGRATION */
---
@colo-ft Looks good to me. I'll merge your patch back to my repo later. Thanks!
Hi @naixia . Do you have any plan to include this patch to all supported kernels?
@wongsyrone I will in a few days.
Thanks.
@colo-ft It's good to include the patch as an attachment next time since the pasted code replaces all tabs with spaces.
@naixia OK ~
Hi,
We got a kernel panic issue with uksm. The kernel log is: 4075.166236] ------------[ cut here ]------------ [ 4075.171023] kernel BUG at mm/uksm.c:2535! [ 4075.175169] invalid opcode: 0000 [#1] SMP [ 4075.180312] collected_len = 150989, LOG_BUF_LEN_LOCAL = 1048576 [ 4075.202826] kbox: no notify die func register. no need to notify [ 4075.208986] do nothing after die! [ 4075.212470] Modules linked in: ipt_REJECT sch_netem bridge stp llc dm_multipath kboxdriver(O) kbox(O) iptable_filter ocfs2_lockfs(OE) ocfs2(OE) ocfs2_adl(OE) jbd2 ocfs2_stack_o2cb(OE) ocfs2_dlm(OE) dev_connlimit(O) vhba(OE) ocfs2_nodemanager(OE) ocfs2_stackglue(OE) bum(O) ip_set nfnetlink prio(O) hotpatch(OE) nat(O) vport_vxlan(O) openvswitch(O) nf_defrag_ipv6 gre ipmi_devintf signo_catch(O) ipmi_si ipmi_msghandler pmcint(O) iTCO_wdt iTCO_vendor_support coretemp intel_rapl crc32_pclmul crc32c_intel be2net ghash_clmulni_intel aesni_intel vxlan kvm_intel(O) lrw gf128mul sb_edac glue_helper ip6_udp_tunnel ablk_helper cryptd edac_core sg kvm(O) lpc_ich i2c_i801 i2c_core udp_tunnel mfd_core pcspkr shpchp acpi_power_meter mei_me mei nf_conntrack_ipv4 nf_defrag_ipv4 vhost_net(O) tun(O) vhost(O) macvtap [ 4075.285575] macvlan vfio_pci irqbypass vfio_iommu_type1 vfio xt_sctp nf_conntrack_proto_sctp nf_nat_proto_sctp nf_nat nf_conntrack sctp libcrc32c ip_tables ext3 mbcache jbd dm_mod sd_mod lpfc ahci crc_t10dif crct10dif_generic libahci crct10dif_pclmul mpt3sas libata scsi_transport_fc raid_class scsi_tgt crct10dif_common scsi_transport_sas nbd(OE) [last unloaded: kbox] [ 4075.318225] CPU: 10 PID: 427 Comm: uksmd Tainted: G OE ---- ------- 3.10.0-327.36.58.7_4.x86_64 #1 [ 4075.328627] Hardware name: Huawei CH121 V3/IT11SGCA1, BIOS 3.18 04/05/2016 [ 4075.335657] task: ffff881026561700 ti: ffff881026584000 task.ti: ffff881026584000 [ 4075.343458] RIP: 0010:[] [] stable_tree_append+0x292/0x2a0
[ 4075.352499] RSP: 0018:ffff881026587c38 EFLAGS: 00010246
[ 4075.357976] RAX: ffffffff81efac68 RBX: ffff880ab4c4aaa0 RCX: 0000000000000000
[ 4075.365263] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff880ab4c4aaa0
[ 4075.372550] RBP: ffff881026587c70 R08: 2e00000000000000 R09: a801eba897000000
[ 4075.379839] R10: 57fdfe57e3ea25c0 R11: ffff881eba897000 R12: ffff8820180d2730
[ 4075.387127] R13: 0000000000000001 R14: ffff880ab4c4aaa0 R15: ffffea007aea25c0
[ 4075.394408] FS: 0000000000000000(0000) GS:ffff88103fe80000(0000) knlGS:0000000000000000
[ 4075.402819] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4075.408725] CR2: 00000000ffffffff CR3: 000000000195a000 CR4: 00000000001427e0
[ 4075.416012] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4075.423302] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 4075.430591] Stack:
[ 4075.432774] ffffea007ff703f0 ffffea007aea25c0 ffff880ab4c4aaa0 ffffea007aea3080
[ 4075.440566] 00000000df349a74 ffff881fde5d3cb0 ffffea007aea25c0 ffff881026587d28
[ 4075.448362] ffffffff811c1f33 ffff881026125720 ffff881ffcf46a40 ffff881026587cd0
[ 4075.456153] Call Trace:
[ 4075.458767] [] cmp_and_merge_page+0x623/0x2b40
[ 4075.465019] [] ? __mmu_notifier_invalidate_range_end+0xc6/0xe0
[ 4075.472828] [] ? follow_trans_huge_pmd+0x1e7/0x200
[ 4075.479421] [] ? follow_page_mask+0x475/0x600
[ 4075.485584] [] ? kmem_cache_alloc+0x1ba/0x1d0
[ 4075.491747] [] scan_vma_one_page+0x4c1/0x1430
[ 4075.497915] [] uksm_do_scan+0x12b/0x18f0
[ 4075.503649] [] uksm_scan_thread+0x1b5/0x1e0
[ 4075.509638] [] ? wake_up_atomic_t+0x30/0x30
[ 4075.515631] [] ? uksm_do_scan+0x18f0/0x18f0
[ 4075.521595] [] kthread+0xcf/0xe0
[ 4075.526636] [] ? kthread_create_on_node+0x140/0x140
[ 4075.533323] [] ret_from_fork+0x58/0x90
[ 4075.538875] [] ? kthread_create_on_node+0x140/0x140
[ 4075.545562] Code: 08 48 85 f6 48 89 70 08 74 04 48 89 7e 08 48 89 79 20 48 83 c1 20 48 89 48 10 e9 9c fe ff ff 0f 0b 48 89 d8 31 d2 e9 90 fe ff ff <0f> 0b 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48
[ 4075.566165] RIP [] stable_tree_append+0x292/0x2a0
[ 4075.572695] RSP
[ 4075.576856] ---[ end trace 23d1b2dfe6b9c6e2 ]---
[ 4075.691407] Kernel panic - not syncing: Fatal exception
[ 4075.822401] die even has been record!
The releated codes of our version are: 2523 /* 2524 stable_tree_append() - append a rmap_item to a stable node. Deduplication 2525 ratio statistics is done in this function. 2526 2527 / 2528 static void stable_tree_append(struct rmap_item rmap_item, 2529 struct stable_node stable_node, int logdedup) 2530 { 2531 struct node_vma node_vma = NULL, new_node_vma, node_vma_cont = NULL; 2532 unsigned long key = (unsigned long)rmap_item->slot; 2533 unsigned long factor = rmap_item->slot->rung->step; 2534 2535 BUG_ON(!stable_node); -----------------------------> We hit here. 2536 rmap_item->address |= STABLE_FLAG; 2537 2538 if (hlist_empty(&stable_node->hlist)) { 2539 uksm_pages_shared++; 2540 goto node_vma_new; 2541 } else { 2542 uksm_pages_sharing++; 2543 }
The upper caller is: 2777 static void cmp_and_merge_page(struct rmap_item rmap_item, u32 hash) 2778 { 2779 struct rmap_item tree_rmap_item; 2780 struct page page; 2781 struct page kpage = NULL; 2782 u32 hash_max; 2783 int err; 2784 unsigned int success1, success2; 2785 struct stable_node snode; 2786 int cmp; 2787 struct rb_node parent = NULL, *new; 2788 2789 remove_rmap_item_from_tree(rmap_item); 2790 page = rmap_item->page; 2791 2792 / We first start with searching the page inside the stable tree / 2793 kpage = stable_tree_search(rmap_item, hash); 2794 if (kpage) { 2795 err = try_to_merge_with_uksm_page(rmap_item, kpage, 2796 hash); 2797 if (!err) { 2798 / 2799 The page was successfully merged, add 2800 its rmap_item to the stable tree. 2801 page lock is needed because it's 2802 racing with try_to_unmap_ksm(), etc. 2803 / 2804 lock_page(kpage); 2805 snode = page_stable_node(kpage); 2806 stable_tree_append(rmap_item, snode, 1); ---> Here ;) 2807 unlock_page(kpage); 2808 put_page(kpage); 2809 return; / success */ 2810 } 2811 put_page(kpage)
I didn't find how to reproduce it, but i noticed that for ksm, every places it calles page_stable_node(), it will check its return value before uses it, but here, we doesn't checkout the snode and use it directly, Is this a problem ? Or does the codes has ensured that snode here can't be NULL ?
Thanks, Hailiang