High CPU and Memory usage by elasticsearch

darshanmehta10 commented 9 years ago

We are using java client to communicate with Elasticsearch (search requests only). Once started, CPU and Memory usage keeps going up and doesn't come down unless the process is restarted.

Threaddump shows around 120 blocked threads with below stack trace:

Thread 26356: (state = BLOCKED)
 - java.lang.Thread.sleep(long) @bci=0 (Compiled frame; information may be imprecise)
 - org.elasticsearch.common.netty.util.HashedWheelTimer$Worker.waitForNextTick() @bci=81, line=445 (Compiled frame)
 - org.elasticsearch.common.netty.util.HashedWheelTimer$Worker.run() @bci=43, line=364 (Compiled frame)
 - org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run() @bci=55, line=108 (Interpreted frame)
 - java.lang.Thread.run() @bci=11, line=724 (Interpreted frame)

We are using TranspostClient object and close it once the search is done. Following is the hot_threads output:

0.2% (795.3micros out of 500ms) cpu usage by thread 'elasticsearch[es1][scheduler][T#1]'
 10/10 snapshots sharing following 9 elements
   sun.misc.Unsafe.park(Native Method)
   java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
   java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
   java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:1090)
   java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:807)
   java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
   java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   java.lang.Thread.run(Thread.java:745)

0.1% (647.2micros out of 500ms) cpu usage by thread 'elasticsearch[es1][management][T#4]'
 10/10 snapshots sharing following 9 elements
   sun.misc.Unsafe.park(Native Method)
   java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
   java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:731)
   java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:644)
   java.util.concurrent.LinkedTransferQueue.poll(LinkedTransferQueue.java:1145)
   java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
   java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   java.lang.Thread.run(Thread.java:745)

0.1% (489.6micros out of 500ms) cpu usage by thread 'elasticsearch[es1][[transport_server_worker.default]][T#5]{New I/O worker #14}'
 10/10 snapshots sharing following 15 elements
   sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
   sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
   sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
   sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
   sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
   org.elasticsearch.common.netty.channel.socket.nio.SelectorUtil.select(SelectorUtil.java:68)
   org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.select(AbstractNioSelector.java:415)
   org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:212)
   org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
   org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
   org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
   org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
   java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
   java.lang.Thread.run(Thread.java:745)

Before closing TransportClient, we are also shutting down associated threadpool (shutdownNow()) to kill the threads, still there are too many blocked threads in thread dump.

Let me know if anything else is required.

Thanks

clintongormley commented 9 years ago

Hi @darshanmehta10

What version are you on, and what does your code look like?

darshanmehta10 commented 9 years ago

Hi,

We are using elasticsearch 1.4.2. We have a Java API which connects with remote elasticsearch server, executes search and closes Transport client. The transport client object is cached for some time and is closed when evicted from cache. Following is the code to create transport client:

Settings settings = ImmutableSettings.settingsBuilder().put("cluster.name",     elasticSearch.getClusterName()).build();
TransportClient tc = new TransportClient(settings);
tc.addTransportAddress(new InetSocketTransportAddress(elasticSearch.getHost(), elasticSearch.getPort()));

Following is the code to close Transport Client:

tc.threadPool().shutdownNow();
tc.close();

Below is the screen shot of heapdump taken 3 hours after application restart (CPU : 11%, Memory 18%): hd1 It has 60k objects of HashedWheelBucket and thread dump showed 120 blocked threads of same.

Below is the screen shot of heapdump taken 6 hours after application restart (CPU 18%, Memory : 21%): hd2 94k objects and 180 blocked threads now.

It seems closing Transportclient doesn't kill off threads even though we explicitly shut the threadpool down. This causes memory and cpu usage to reach 100% and then requires restart.

clintongormley commented 9 years ago

I know that a number of improvements have been made to the transport client in recent versions. Not sure if this is the same thing or not. /cc @spinscale ?

@darshanmehta10 Would probably be useful to have some examples of what you're doing between opening and closing the connection too.

darshanmehta10 commented 9 years ago

Hello,

Below is the method which uses transport client object to query:

public Object execute(SearchQuery query, TransportClient client) {
    Object result;
    ElasticsearchTemplate elasticSearchTemplate = new ElasticsearchTemplate(client);
    SearchResponse response = elasticSearchTemplate.query(query, new ResultsExtractor<SearchResponse>() {
        @Override
        public SearchResponse extract(SearchResponse response) {
            return response;
        }
    });

    result = processResponse(response);//Extracts data from response
    return result;

}

We are using spring implementation of EhCache (EhCacheCache) to store TransportClient object and have configured listener to close the client in overridden notifyElementRemoved, notifyElementExpired and notifyElementEvicted methods.

A day after the last restart (24 hours), the API is using 32% CPU and 22% memory.

Please let me know if anything else is required.

spinscale commented 9 years ago

Are you sure that you are instantiating only a single TransportClient? Can you try the same by reusing some TransportClient object? I just want to rule out that you have started dozens of TransportClients.

You should also check with netstat how many open connections there are.

The reason why I have the feeling that instantiate thousands of TransportClients is that, usually each Client only has one HashedWheelTimer, but you have tons of those.

darshanmehta10 commented 9 years ago

We are not using a single TransportClient. We can't really use the same TransportClient as we get the elasticsearch cluster details at runtime. So, we do the following steps when a request to query elasticsearch is received to our api:

Create TransportClient object
Cache TransportClient object
Execute search
Send back results

We are using Spring EhCache with 30 mins timeout and close the TransportClient object when it is evacuated. However, we believe that closing TransportClient (and explicitly shutting down threadpool) doesn't kill off all the request threads and that's why there are tons of HashedWheelTimer objects. Netstat also shows plenty of connections which ideally should not be the case.

We found the following in close() method of TransportClient: injector.getInstance(ThreadPool.class).shutdown(); This means threadpool won't accept the new thread but it will still execute the existing threads. May be existing threads are hung/blocked and the count of such threads keeps on increasing every time a new TransportClient is created?

spinscale commented 9 years ago

You can try with Elasticsearch 1.5 and above. We changed the closing of threadpools in there, see here

eskibars commented 8 years ago

I'm going to close this since we haven't heard from this in over 6 months and it was potentially solved with #7868. If there's still an issue, we can reopen.

elastic / elasticsearch

High CPU and Memory usage by elasticsearch #11822