dolohow / uksm

Ultra Kernel Samepage Merging
242 stars 35 forks source link

kernel BUG in UKSM gets triggered often #3

Closed TheCrazyLex closed 7 years ago

TheCrazyLex commented 8 years ago

I am using the UKSM patch for Kernel 4.7 and applying it on a clean 4.7.7 base. The patch applies cleanly.

The problem is that uksm crashes pretty often for me, since a "BUG_ON" in the code gets triggered. This happens mostly during compiling Android and aborts the compilation with a "memory allocation" failure.

The output in dmesg is as follows:

[ 49.269641] ------------[ cut here ]------------ [ 49.269701] kernel BUG at mm/uksm.c:4137! [ 49.269740] invalid opcode: 0000 [#1] PREEMPT SMP [ 49.269781] Modules linked in: ccm rfcomm fuse nf_conntrack_netbios_ns nf_conntrack_broadcast ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_broute bridge stp llc ebtable_nat ip6table_mangle ip6table_raw ip6table_security ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 iptable_mangle iptable_raw iptable_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ebtable_filter ebtables ip6table_filter ip6_tables bnep binfmt_misc arc4 vfat fat iwlmvm intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp mac80211 kvm snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic irqbypass crct10dif_pclmul iTCO_wdt crc32_pclmul iTCO_vendor_support uvcvideo snd_hda_intel hp_wmi iwlwifi sparse_keymap snd_hda_codec ghash_clmulni_intel [ 49.270619] videobuf2_vmalloc intel_cstate videobuf2_memops videobuf2_v4l2 btusb snd_hda_core intel_uncore videobuf2_core btrtl btbcm videodev btintel intel_rapl_perf snd_hwdep cfg80211 snd_seq bluetooth snd_seq_device media snd_pcm rtsx_pci_ms memstick rfkill snd_timer mei_me snd mei soundcore shpchp processor_thermal_device i2c_i801 joydev intel_soc_dts_iosf hp_wireless hp_accel int3403_thermal int340x_thermal_zone tpm_crb lis3lv02d input_polldev int3400_thermal acpi_thermal_rel tpm_tis acpi_pad tpm nfsd auth_rpcgss nfs_acl lockd grace sunrpc nouveau i915 rtsx_pci_sdmmc mmc_core mxm_wmi ttm i2c_algo_bit drm_kms_helper crc32c_intel drm r8169 serio_raw rtsx_pci mii wmi video fjes [ 49.271358] CPU: 1 PID: 171 Comm: uksmd Not tainted 4.7.7-203.alex.fc24.x86_64 #1 [ 49.271419] Hardware name: HP HP Pavilion Gaming Notebook/80A9, BIOS F.80 06/14/2016 [ 49.276016] task: ffff880275aa9e00 ti: ffff880065b64000 task.ti: ffff880065b64000 [ 49.280594] RIP: 0010:[] [] uksm_del_vma_slot+0x529/0x680 [ 49.285069] RSP: 0018:ffff880065b67d28 EFLAGS: 00010286 [ 49.289162] RAX: 00000000ffffffff RBX: ffff88006b31c000 RCX: 0000000000000000 [ 49.293260] RDX: 0000000080000001 RSI: 000000000001b3e0 RDI: 00000000ffffffff [ 49.297370] RBP: ffff880065b67d80 R08: ffff880234f52cd0 R09: 0000000180330032 [ 49.300988] R10: 0000000000000000 R11: 000000000001b201 R12: 0000000000000000 [ 49.303780] R13: ffff88006b31c000 R14: 0000000000000000 R15: ffff880066616600 [ 49.306177] FS: 0000000000000000(0000) GS:ffff880282440000(0000) knlGS:0000000000000000 [ 49.308307] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 49.310344] CR2: 00007fa52341a000 CR3: 0000000021c06000 CR4: 00000000003406e0 [ 49.312074] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 49.313787] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 49.315412] Stack: [ 49.316826] 0000000000000000 0000000000000000 00000000a17e9340 ffff880065b67d70 [ 49.318256] ffffffffa10caaa9 ffff8800665f1240 ffffffffa1c6a320 ffffffffa201ed38 [ 49.319691] ffffffffa201edf8 0000000000000020 ffffffffa201ef80 ffff880065b67e60 [ 49.320970] Call Trace: [ 49.322184] [] ? task_sched_runtime.part.81+0x59/0x140 [ 49.323404] [] uksm_do_scan+0xf89/0x2b40 [ 49.324616] [] ? del_timer_sync+0x48/0x50 [ 49.325728] [] ? schedule_timeout+0x136/0x280 [ 49.326798] [] uksm_scan_thread+0x154/0x180 [ 49.327856] [] ? uksm_do_scan+0x2b40/0x2b40 [ 49.328897] [] ? uksm_do_scan+0x2b40/0x2b40 [ 49.329931] [] kthread+0xd8/0xf0 [ 49.330888] [] ret_from_fork+0x1f/0x40 [ 49.331842] [] ? kthread_worker_fn+0x160/0x160 [ 49.332847] Code: 00 49 8b 76 40 48 8d 7e 30 48 89 75 d0 e8 00 c4 1e 00 48 8b 75 d0 48 8b 3d 55 88 e1 00 e8 f0 ce 00 00 49 8b 46 10 e9 77 fc ff ff <0f> 0b 49 8b 87 88 00 00 00 48 29 05 27 84 e1 00 e9 ca fe ff ff [ 49.333903] RIP [] uksm_del_vma_slot+0x529/0x680 [ 49.334963] RSP [ 49.340614] ---[ end trace 9e85fc3c702dfbb8 ]---

Thanks in advance!

TheCrazyLex commented 8 years ago

@naixia @dolohow please help :)

naixia commented 8 years ago

@TheCrazyLex Ok, I'll look into this.
Have you ever tried another kernel version like v4.6 or v4.5 and seen the same bug? I suspect it is related to some upstream kernel change. Can you help me to narrow it down?

naixia commented 8 years ago

@TheCrazyLex

It's weird that this code path is actually very hot for every uksm user. I think the logic itself should have been tested by many many users during last several years.

So I want to know if you are running under special cases like OOM? or caused by ill RAM banks ? or other strange setup or system config related to memory?

TheCrazyLex commented 8 years ago

@naixia Thank you for your very fast answer! I'll be glad to help narrowing it down with you.

I didn't try any other kernel versions, I might try 4.8 later.

I wouldn't think my setup is special actually, 8GB RAM and 5,2GB swap. As far as I saw i wasn't near an OOM when this happened. UKSM helps me a lot during Android compilation, it seems there are a lot of duplicated pages produced.

I ran some tests on my RAM banks today, they seem to be healthy. And the system runs normally when I disable UKSM, just that the memory usage is pretty high then.

Thank you!

TheCrazyLex commented 8 years ago

@naixia Please let me know whether you think it is worth testing the 4.8 patch on 4.8.1 :)

naixia commented 8 years ago

@TheCrazyLex I would suggest you to start from v4.4 trying to find a version number (e.g. N ) that does not trigger the bug and its next version (N+1) will trigger. And if you failed to find a good kernel for your workload, I would suggest you to create a lxc rootfs or docker image of your build system for me so that I can reproduce the bug on my machine.

dolohow commented 8 years ago

I think I hit similar bug:

[ 8214.542310] ------------[ cut here ]------------
[ 8214.542315] WARNING: CPU: 1 PID: 70 at mm/page_alloc.c:3430 __alloc_pages_nodemask+0xc2e/0xda0
[ 8214.542316] Modules linked in: overlay ctr ccm arc4 xt_conntrack iptable_filter iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack fuse amdkfd amd_iommu_v2 mousedev joydev input_leds radeon iTCO_wdt iTCO_vendor_support ppdev intel_rapl evdev x86_pkg_temp_thermal intel_powerclamp coretemp ath5k kvm_intel led_class mac80211 snd_hda_codec_realtek snd_hda_codec_generic i2c_algo_bit kvm drm_kms_helper snd_hda_codec_hdmi irqbypass syscopyarea ath snd_hda_intel sysfillrect sysimgblt cfg80211 snd_hda_codec fb_sys_fops snd_hwdep ttm psmouse snd_hda_core pcspkr drm rfkill snd_pcm snd_timer r8169 snd mei_me mei soundcore mii i2c_i801 i2c_smbus shpchp lpc_ich thermal fan battery parport_pc parport video button sch_fq_codel ip_tables
[ 8214.542345]  x_tables ext4 crc16 jbd2 mbcache hid_generic usbhid hid sd_mod serio_raw atkbd libps2 crct10dif_pclmul crc32_pclmul crc32c_intel ahci aesni_intel aes_x86_64 libahci glue_helper lrw libata gf128mul ablk_helper cryptd xhci_pci scsi_mod ehci_pci xhci_hcd ehci_hcd usbcore usb_common i8042 serio jitterentropy_rng sha256_ssse3 sha256_generic hmac drbg ansi_cprng
[ 8214.542360] CPU: 1 PID: 70 Comm: uksmd Not tainted 4.8.4-ck3-bfq-uksm #4
[ 8214.542361] Hardware name: MSI MS-7820/H81-P33(MS-7820), BIOS V1.6 03/30/2015
[ 8214.542362]  0000000000000286 00000000f6319a93 ffff88019440f930 ffffffff812d2160
[ 8214.542363]  0000000000000000 0000000000000000 ffff88019440f970 ffffffff81075b3b
[ 8214.542365]  00000d669440fa50 ffff880197279880 0000000000000000 0000000000000000
[ 8214.542366] Call Trace:
[ 8214.542370]  [<ffffffff812d2160>] dump_stack+0x63/0x83
[ 8214.542373]  [<ffffffff81075b3b>] __warn+0xcb/0xf0
[ 8214.542374]  [<ffffffff81075c6d>] warn_slowpath_null+0x1d/0x20
[ 8214.542376]  [<ffffffff8115452e>] __alloc_pages_nodemask+0xc2e/0xda0
[ 8214.542377]  [<ffffffff81056ab2>] ? __x2apic_send_IPI_dest+0x32/0x40
[ 8214.542378]  [<ffffffff8104e08b>] ? native_send_call_func_single_ipi+0x1b/0x20
[ 8214.542381]  [<ffffffff810d7d59>] ? generic_exec_single+0x79/0x120
[ 8214.542382]  [<ffffffff81069410>] ? tlbflush_read_file+0x80/0x80
[ 8214.542384]  [<ffffffff811b1145>] new_slab+0xa5/0x620
[ 8214.542385]  [<ffffffff810d7edb>] ? smp_call_function_single+0xdb/0x150
[ 8214.542386]  [<ffffffff81069410>] ? tlbflush_read_file+0x80/0x80
[ 8214.542387]  [<ffffffff811b37b9>] ___slab_alloc.constprop.28+0x2e9/0x3c0
[ 8214.542388]  [<ffffffff811a96e1>] ? cmp_and_merge_page+0x1431/0x2860
[ 8214.542389]  [<ffffffff81069cdc>] ? flush_tlb_page+0x5c/0xb0
[ 8214.542391]  [<ffffffff8116ed35>] ? __dec_node_page_state+0x15/0x20
[ 8214.542392]  [<ffffffff811a96e1>] ? cmp_and_merge_page+0x1431/0x2860
[ 8214.542393]  [<ffffffff811b38bb>] __slab_alloc.isra.22.constprop.27+0x2b/0x40
[ 8214.542394]  [<ffffffff811b3a2e>] kmem_cache_alloc+0x15e/0x1a0
[ 8214.542395]  [<ffffffff811a96a4>] ? cmp_and_merge_page+0x13f4/0x2860
[ 8214.542396]  [<ffffffff811a96e1>] cmp_and_merge_page+0x1431/0x2860
[ 8214.542397]  [<ffffffff811ab116>] scan_vma_one_page+0x606/0x15d0
[ 8214.542398]  [<ffffffff81037989>] ? sched_clock+0x9/0x10
[ 8214.542399]  [<ffffffff811ac241>] uksm_do_scan+0x161/0x2c40
[ 8214.542401]  [<ffffffff810c1ae8>] ? del_timer_sync+0x48/0x50
[ 8214.542403]  [<ffffffff815a8597>] ? schedule_timeout+0x237/0x420
[ 8214.542404]  [<ffffffff811aee74>] uksm_scan_thread+0x154/0x180
[ 8214.542405]  [<ffffffff811aed20>] ? uksm_do_scan+0x2c40/0x2c40
[ 8214.542406]  [<ffffffff811aed20>] ? uksm_do_scan+0x2c40/0x2c40
[ 8214.542408]  [<ffffffff81094af8>] kthread+0xd8/0xf0
[ 8214.542410]  [<ffffffff8109d768>] ? finish_task_switch+0x88/0x330
[ 8214.542411]  [<ffffffff815a987f>] ret_from_fork+0x1f/0x40
[ 8214.542412]  [<ffffffff81094a20>] ? kthread_worker_fn+0x170/0x170
[ 8214.542413] ---[ end trace 5c744655a6541f9f ]---
naixia commented 8 years ago

@dolohow It's not the same bug as @TheCrazyLex encountered. It's an abuse of kmem_cache_alloc() GFP flag warning. It's easy to fix. I'll update the patch for v4.8 soon.

naixia commented 8 years ago

@dolohow I'v updated the patch. You may have a try and see if it's fixed.

dolohow commented 8 years ago

Thanks, I will test it and I will provide you with a feedback.

naixia commented 8 years ago

@dolohow Have you hit the kernel warning again?

dolohow commented 8 years ago

Not yet, hopefully it won't show up.

naixia commented 7 years ago

@TheCrazyLex any updates?

TheCrazyLex commented 7 years ago

I got a little distracted by other things. I'll retry soon

Am 31.12.2016 2:06 vorm. schrieb "naixia" notifications@github.com:

@TheCrazyLex https://github.com/TheCrazyLex any updates?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dolohow/uksm/issues/3#issuecomment-269840854, or mute the thread https://github.com/notifications/unsubscribe-auth/AICdm2gUsjrcAPwfBw-qVamPZDUE4Sdeks5rNaqQgaJpZM4KS1c6 .

naixia commented 7 years ago

Closed due to no feedbacks for a long time and I cannot reproduce it.