elasticsearch proccess causes CPU soft lockup (causing the server to hung)

oded-dd commented 6 years ago

Elasticsearch version: Version: 6.2.4-SNAPSHOT, Build: afa7fec/2018-03-23T00:31:31.900531Z

Plugins installed: [ ingest-attachment ingest-geoip mapper-murmur3 mapper-size repository-azure ]

JVM version: openjdk 1.8.0_171

OS version: Linux prod-elasticsearch-data-010 4.13.0-1011-azure #14-Ubuntu SMP Thu Feb 15 16:15:39 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior: We have an elasticsearch cluster running. After a while (2-3 months the process is up and running) we are experiencing a CPU soft lockup. A soft lockup is the symptom of a task or kernel thread using and not releasing a CPU for a longer period of time than allowed.

The technical reason behind a soft lock involves CPU interrupts and nmi-watchdog. For each online CPU on the system, a watchdog process gets created. This kernel thread is created with highest scheduling priority possible. This process "wakes up" once per second, gets the current time stamp for the CPU it is responsible for, and saves this into the PER-CPU data structure. There is a separate interrupt that calls a function softlockup_tick(), which is responsible for comparing the current time to the saved time as last recorded by the watchdog. If the current time is greater than the watchdog_thresh (softlockup_thresh in version 5), then a soft lock is reported because the real-time watchdog thread could not get onto the CPU. One common example of this is if a thread with a higher priority than the watchdog thread is attempting to acquire a spin lock, it can hold the CPU long enough for soft locks to be reported. We will further discuss this example, but please note that there are many conditions that can lead to a soft lock: they do not always involve spinlock waits.

A spinlock is a synchronization mechanism used to protect a resource -- typically a data structure -- from concurrent access by multiple threads. Unlike other synchronization mechanisms, a thread will continuously poll the lock until it obtains the lock. The spinlock is held until the thread releases it -- presumably because the thread no longer needs access to the resource.

Spinlocks are efficient when they are blocked for only a brief period of time, but as they prevent the CPU core from performing other work while the thread is waiting for the spinlock, they are inefficient when the thread must wait for an extended period of time.

The Kernel will report a soft lockup warning when it detects that the CPU was busy for a long time without releasing a spinlock. This is because threads should not sleep while holding the spinlock, as it might lead to deadlock. Because the threads do not sleep, the nmiwatchdog thread never gets to run, the CPU data structure timestamp never gets updated, and the kernel detects the condition and produces the warning

In our case we see that the watchdog daemon is sending an non maskable interrupt (NMI) to all CPUs in the system who in turn print the stack traces of their currently running tasks. When the interrupt is served, the interrupt is given high priority and all the processes will b stopped and the entire CPU will be used to serve the interrupt and this is the reason why you are experiencing the vm hung issue.

The VM hangs for 23s and after this immediately Softlockup is triggered and it takes another 23s for the VM to be free from hung and this is going on and on until there comes a stage when we have to manually reboot the server

Steps to reproduce: N/A. It happens already for 5 of our machines in the cluster.

Provide logs (if relevant):

May 14 05:24:21 localhost kernel: [6006808.160001] watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [java:5783]

May 14 05:24:21 localhost kernel: [6006808.160055] Modules linked in: ufs msdos xfs ip6table_filter ip6_tables iptable_filter nf_conntrack_ipv4 nf_defrag_ipv4 xt_owner xt_conntrack nf_conntrack iptable_security ip_tables x_tables udf crc_itu_t i2c_piix4 hv_balloon joydev i2c_core serio_raw ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi autofs4 btrfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd glue_helper hid_hyperv cryptd hyperv_fb pata_acpi hv_utils cfbfillrect cfbimgblt ptp cfbcopyarea hid hyperv_keyboard hv_netvsc pps_core

May 14 05:24:21 localhost kernel: [6006808.160055] CPU: 5 PID: 5783 Comm: java Not tainted 4.13.0-1011-azure #14-Ubuntu

May 14 05:24:21 localhost kernel: [6006808.160055] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090007 06/02/2017

May 14 05:24:21 localhost kernel: [6006808.160055] task: ffff8b91a48fc5c0 task.stack: ffffb5c4cd014000

May 14 05:24:21 localhost kernel: [6006808.160055] RIP: 0010:fsnotify+0x1f9/0x4f0

May 14 05:24:21 localhost kernel: [6006808.160055] RSP: 0018:ffffb5c4cd017e08 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff0c

May 14 05:24:21 localhost kernel: [6006808.160055] RAX: 0000000000000001 RBX: ffff8ba0f6246020 RCX: 00000000ffffffff

May 14 05:24:21 localhost kernel: [6006808.160055] RDX: ffff8ba0f6246048 RSI: 0000000000000000 RDI: ffffffff9bc57020

May 14 05:24:21 localhost kernel: [6006808.160055] RBP: ffffb5c4cd017ea8 R08: 0000000000000000 R09: 0000000000000000

May 14 05:24:21 localhost kernel: [6006808.160055] R10: ffffe93042d21080 R11: 0000000000000000 R12: 0000000000000000

May 14 05:24:21 localhost kernel: [6006808.160055] R13: ffff8ba0f6246048 R14: 0000000000000000 R15: 0000000000000000

May 14 05:24:21 localhost kernel: [6006808.160055] FS: 00007f154838a700(0000) GS:ffff8ba0fd740000(0000) knlGS:0000000000000000

May 14 05:24:21 localhost kernel: [6006808.160055] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

May 14 05:24:21 localhost kernel: [6006808.160055] CR2: 00007f3fcc254000 CR3: 0000000165c38000 CR4: 00000000001406e0

May 14 05:24:21 localhost kernel: [6006808.160055] Call Trace:

May 14 05:24:21 localhost kernel: [6006808.160055] ? new_sync_write+0xe5/0x140

May 14 05:24:21 localhost kernel: [6006808.160055] vfs_write+0x15a/0x1b0

May 14 05:24:21 localhost kernel: [6006808.160055] ? syscall_trace_enter+0xcd/0x2f0

May 14 05:24:21 localhost kernel: [6006808.160055] SyS_write+0x55/0xc0

May 14 05:24:21 localhost kernel: [6006808.160055] do_syscall_64+0x61/0xd0

May 14 05:24:21 localhost kernel: [6006808.160055] entry_SYSCALL64_slow_path+0x25/0x25

May 14 05:24:21 localhost kernel: [6006808.160055] RIP: 0033:0x7f489076a2dd. #“This indicates Softlockup error message stored in the RIP Register”

May 14 05:24:21 localhost kernel: [6006808.160055] RSP: 002b:00007f15483872a0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001

May 14 05:24:21 localhost kernel: [6006808.160055] RAX: ffffffffffffffda RBX: 00007f1548389380 RCX: 00007f489076a2dd

May 14 05:24:21 localhost kernel: [6006808.160055] RDX: 0000000000001740 RSI: 00007f1548387310 RDI: 000000000000063c

May 14 05:24:21 localhost kernel: [6006808.160055] RBP: 00007f15483872d0 R08: 00007f1548387310 R09: 00007f418a55b0b8

May 14 05:24:21 localhost kernel: [6006808.160055] R10: 00000000005b31ee R11: 0000000000000293 R12: 0000000000001740

May 14 05:24:21 localhost kernel: [6006808.160055] R13: 00007f1548387310 R14: 000000000000063c R15: 00007f40000051e0

jasontedor commented 6 years ago

This does not appear to be an Elasticsearch issue, instead a kernel/hypervisor issue. See this issue for a similar situation on the same kernel. You can consider increasing kernel.watchdog_thresh (this is only a guess). I am closing this is not an Elasticsearch issue, and not actionable for us.

oded-dd commented 6 years ago

@jasontedor thank you for your reply. It appears that it is a kernel issue on Azure which should be fixed in the next kernel update. FYI, in case you have other customers who are having the same behavior. https://bugs.launchpad.net/ubuntu/+source/linux-azure/+bug/1772264

elastic / elasticsearch

elasticsearch proccess causes CPU soft lockup (causing the server to hung) #30667