elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.19k stars 24.85k forks source link

Client nodes don't join cluster back post OOM exception #8451

Closed satishmallik closed 9 years ago

satishmallik commented 10 years ago

In our setup we are always seeing when client nodes throw OOM, it is booted out of cluster. It is never able to join cluster back post OOM exception.

We saw the following from logs

java.lang.OutOfMemoryError: Java heap space at java.nio.HeapByteBuffer.(HeapByteBuffer.java:57) at java.nio.ByteBuffer.allocate(ByteBuffer.java:331) at sun.nio.cs.StreamEncoder.(StreamEncoder.java:195) at sun.nio.cs.StreamEncoder.(StreamEncoder.java:175) at sun.nio.cs.StreamEncoder.forOutputStreamWriter(StreamEncoder.java:68) at java.io.OutputStreamWriter.(OutputStreamWriter.java:133) at java.io.PrintStream.(PrintStream.java:111) at java.io.PrintStream.(PrintStream.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:384) at sun.net.www.http.HttpClient.openServer(HttpClient.java:473) at sun.net.www.http.HttpClient.(HttpClient.java:203) at sun.net.www.http.HttpClient.New(HttpClient.java:290) at sun.net.www.http.HttpClient.New(HttpClient.java:306) at sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:995) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:931) at sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:849) at org.elasticsearch.marvel.agent.exporter.ESExporter.openConnection(ESExporter.java:325) at org.elasticsearch.marvel.agent.exporter.ESExporter.openExportingConnection(ESExporter.java:182) at org.elasticsearch.marvel.agent.exporter.ESExporter.exportXContent(ESExporter.java:248) at org.elasticsearch.marvel.agent.exporter.ESExporter.exportNodeStats(ESExporter.java:130) at org.elasticsearch.marvel.agent.AgentService$ExportingWorker.exportNodeStats(AgentService.java:349) at org.elasticsearch.marvel.agent.AgentService$ExportingWorker.run(AgentService.java:236) at java.lang.Thread.run(Thread.java:724)

............... [2014-11-04 09:34:50,844][DEBUG][action.search.type ] [ES-NODE-Q02] [3312] Failed to execute fetch phase org.elasticsearch.transport.NodeDisconnectedException: [ES-NODE-D02][inet[/10.0.0.8:9300]][search/phase/fetch/id] disconnected ...

[2014-11-04 09:41:35,957][DEBUG][action.admin.indices.alias.get] [ES-NODE-Q02] connection exception while trying to forward request to master node [[ES-NODE-M01][K13SBxlOQ06-axx0_UL1bg][es-node-m01][inet[/10.0.0.10:9300]]{updateDomain=0, tag=masternode, data=false, faultDomain=0, master=true}], scheduling a retry. Error: [org.elasticsearch.transport.NodeDisconnectedException: [ES-NODE-M01][inet[/10.0.0.10:9300]][indices/get/aliases] disconnected]

Post OOM exception client node is never able to join cluster back. We needed to restart es service on client node to get this node join back to cluster.

Is it a known issue? Is it expected behavior?

clintongormley commented 10 years ago

Hi @satishmallik

After an OOM, the JVM is in an undefined state, and you should restart it. There is no point in continuing after that. The more important question that you need to answer is: why are you getting OOMs in the client?

It looks like you are running the Marvel exported inside your node client, is that correct?

clintongormley commented 10 years ago

@satishmallik what else are you doing on this client node? how much heap does it have? I'd love to figure out what is using up all of the memory.

Could you send the output of these requests please:

curl localhost:9200/_nodes > nodes_info.json
curl localhost:9200/_nodes/stats > nodes_stats.json
satishmallik commented 10 years ago

Hi clinton, We have 9 node production cluster and 3 node marvel cluster. Out of 9 nodes 3 are master, 3 are data and 3 are client nodes. We hit OOM on one of client nodes. We have allocated 2G of RAM for JVM on a 3.5G RAM client machine.

I am not able to attach json file here. Can you share me any location where I can upload it,

clintongormley commented 10 years ago

Hi @satishmallik

You can send them to me at clinton dot gormley at elasticsearch dot com

satishmallik commented 10 years ago

I sent it to your email id. Please have a look,

satishmallik commented 10 years ago

Marvel exporter is running on each of 9 nodes including client on which we saw OOM.

clintongormley commented 10 years ago

Hi @satishmallik

I looked at your node info/stats. Couple of things to note:

And that's about all I could see. I didn't see anything unusual in the clients. Are you sending massive bulk requests through the clients? Or returning massive result sets?

I'd suggest doing a heap dump of a client that has full memory, to see what is taking up all of the heap.

satishmallik commented 10 years ago

Hi Clinton, Thanks for looking into this,

Few questions 1) Which JVMs are officially supported by Elasticsearch? With Azule we are seeing GC running quite frequently and CPU/Memory usage remains quite high

2) I have to experiment with disabling SWAP on ES nodes,

3) We had earlier deployed ES 1.1.3 which was throttling merges. We already have moved to ES 1.3.4, but we didn't delete our earlier index which was created by ES1.1.3. We are using Azure Blob store for our index storage.

Our client node is currently acting both as ingestion and query nodes. We are doing heavy bulk indexing and yes we do return 1000 results. We are doing multi term aggregation. But we are planning to move away from multi term aggregation model.

In general how can we figure that merge is throttling or, not during indexing? Apart our segments count and deleted documents (25%) are quite high. We are not doing any optimization on index. Any suggestions on controlling segment count and deleted documents?

How can we control CPU/Mem usage on data nodes? Data nodes are dual core nodes having 14G of RAM.

clintongormley commented 10 years ago

1) Which JVMs are officially supported by Elasticsearch? With Azule we are seeing GC running quite frequently and CPU/Memory usage remains quite high

OpenJDK and Oracle's Java

3) We had earlier deployed ES 1.1.3 which was throttling merges. We already have moved to ES 1.3.4, but we didn't delete our earlier index which was created by ES1.1.3. We are using Azure Blob store for our index storage.

Not sure what throughput you'll get on Azure, but it is worth trying to play with the throttling. By default we throttle to 20MB/s, but you can increase this value (see http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-store.html#store-throttling). You want to keep an eye on your search latencies, because search will fight with indexing for I/O. That's why we throttle by default.

Our client node is currently acting both as ingestion and query nodes. We are doing heavy bulk indexing and yes we do return 1000 results. We are doing multi term aggregation. But we are planning to move away from multi term aggregation model.

ThousandS of results? or 1,000 total? If thousands, then you should look at using the scroll API instead.

In general how can we figure that merge is throttling or, not during indexing? Apart our segments count and deleted documents (25%) are quite high. We are not doing any optimization on index. Any suggestions on controlling segment count and deleted documents?

You can measure how much throttling is happening by looking at:

GET /_nodes/stats/indices/store,indexing

You shouldn't need to optimize. As long as there is enough I/O available, then the background merge process will handle things just fine.

How can we control CPU/Mem usage on data nodes? Data nodes are dual core nodes having 14G of RAM.

Use them less? :) You need to figure out where the memory is being used, etc. There's no simple answer to this. I suggest reading the Administration section of the Definitive Guide: http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/administration.html

clintongormley commented 9 years ago

No more info provided. Closing