elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.12k stars 24.83k forks source link

Elasticsearch 1.5.2/2.3.4 periodic CPU spikes on windows 2012 #21834

Closed oldenbur closed 7 years ago

oldenbur commented 7 years ago

Elasticsearch version: 1.5.2 and 2.3.4

JVM version: 1.8.0_92

OS version: Windows 2012

Description of the problem including expected versus actual behavior: Periodically, an elasticsearch instance running on some windows systems experience a sudden spike in CPU activity in the elasticsearch process and the indexing rate slows. Here is a Java Mission Control flight recording that includes such a period: https://dl.dropboxusercontent.com/u/90795372/flight_recording_Elasticsearch4000.jfr

The configuration for the affected instances uses the default value for index.store.type.

Using information from the flight recording, we found a netty issue that might be relevant: https://github.com/netty/netty/issues/3857

We suspect that the problem is related to other processes on the system interacting with the WMI system, because when these systems are stopped the CPU spikes stop occurring.

Steps to reproduce:

  1. Install elasticsearch 1.5.2 or 2.3.4 on a windows and apply indexing load
  2. Initiate another process to interact with the WMI system
  3. Observe elasticsearch process CPU periodically increase dramatically and indexing rate slow

Provide logs (if relevant): In the attached log usaseclm01.log.txt, the user reported two periods where CPU activity spiked to 100%: between 16:35 and 16:44, then again between 16:55 and 17:08. During both of these windows, elasticsearch appears to have been performing 20ish minute segment merges and was otherwise unresponsive. We know this because during normal activity, a set of indices is deleted and re-created every 5 minutes (this can be seen elsewhere in the log file).

usaseclm01.log.txt

jasontedor commented 7 years ago

I do not think this is Elasticsearch-specific, but rather due to WMI and SCOM.