elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.12k stars 24.83k forks source link

Bizzare 100% CPU Utilization on 5.X - Windows 2008R2 #22202

Closed trevorndodds closed 7 years ago

trevorndodds commented 7 years ago

Hi, Elasticsearch 5.0.0 now 5.1.1 Running JVM 1.8.102 and then updated to JVM 1.8.112 - problem occurs on both. Windows 2008 R2

I updated from 2.4.1 to 5.0 then to 5.1.1 since I've moved to 5.0 I've been having nodes drop out of the cluster. I have a 6 data nodes, each one exhibits the exact same problem but at random times. My ingest and master nodes do not exhibit this issue, but as soon as I enable a ingest node as a data node it starts to drop. I never had this issue before when on 2.4.1.

Logging doesn't show anything other than the node left and joined, I turned the logging up to debug but the same problem logging stops outputting anything, even the usual 10s marvel metrics shipping stops logging. I'm not able to bring up the hot threads via 9200 as that connection times out. Even trying to do a jstack dump hangs until the java process starts to normalize. The 100% cpu spike can last a few seconds to a few minutes, but it does recover on it's own. Disk I/O is at nothing. I've reduced my cluster down to 2 data nodes made them mangers and removed 95% of my indexes to accommodate only the current daily indexes, the problem persists. No searches are occurring. I'm now starting to disable all other windows services to see if there's some conflict.

I only have process explorer, and the thread dump from as soon as the jstack was able to connect and dump.

These are all the threads in the java.exe when running at 100% image

Some stack views from the TIDs if these will help any. 10224 2936

image

threadDumps.txt

clintongormley commented 7 years ago

Hiya @trevorndodds

You say:

as soon as I enable a ingest node as a data node

Do you mean ingest node as in the ingest processor? Or do you just mean the node receiving the request? If you mean the ingest processor, do you see the same problem when you have data only nodes or ingest only nodes? What do your ingest pipelines look like?

I assume you don't have any other processes running on the box which could produce this CPU spike?

Are you about to capture a hot-threads dump while the box is at 100% CPU?

trevorndodds commented 7 years ago

@clintongormley my ingest nodes are not doing any processing just receiving the requests. What I meant was that if I turn a perfectly working node into a data node the instability issues occur. So the problem only seems to occur on nodes that also have the data role.

I assume you don't have any other processes running on the box which could produce this CPU spike?

No the spike is from the elasticsearch java process only.

Are you about to capture a hot-threads dump while the box is at 100% CPU?

Nope, access to the java process as a whole is locked up. jstack isn't able to even connect. I've been running java visualVM to see if I can capture anything live while I'm connected.

On the positive side, I disabled many windows services and so far I've been stable for 16 hours. It seems there could be some issue related to WMI but I have no idea how WMI could be causing the JAVA process to max out the CPU.

Trevor

jasontedor commented 7 years ago

I think this duplicates #21834?

trevorndodds commented 7 years ago

Very possible, but I had no issues on 2.4.1

Also for me indexing completely stops until the node is evicted from the cluster (after timeout).

trevorndodds commented 7 years ago

Defiantly somehow related to the Windows Management Instrumentation service. I started that service back up and within a few minutes the elasticsearch java process shot up.

image

trevorndodds commented 7 years ago

The problem in my case seems to be related to SCOM (System Center Operations Manager). I'm able to have the WMI service running without issues, but as soon as the SCOM healthservice starts running the elasticsearch java process shoots up. So similar to #21834 whereby SCOM is querying the WMI service a lot.

s1monw commented 7 years ago

@jasontedor do you think we can close this issue? It seems @trevorndodds found a reason for this on his side?

s1monw commented 7 years ago

@trevorndodds a colleague pointed out to me that SCOM has some java monitoring capabilities, do you have that one enabled?

trevorndodds commented 7 years ago

@s1monw you can close it thanks, unfortunately looking into SCOM is out of my control.