when I restarted cluster, some of the primaries are properly restored, but some recovery task tasks hang
they are in the 'TRANSLOG' stage and are staying there for more than an hour, which is not expected since I run _flush before restart meaning there should be no remaining translog.
hot threads api shows below
0.0% (97.7micros out of 500ms) cpu usage by thread 'elasticsearch[IP][transport_client_timer][T#1]{Hashed wheel timer #1}'
10/10 snapshots sharing following 5 elements
java.lang.Thread.sleep(Native Method)
org.jboss.netty.util.HashedWheelTimer$Worker.waitForNextTick(HashedWheelTimer.java:445)
org.jboss.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:364)
org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
while /_cat/recovery shows a lot of lines just like below
some of the relevant error logs in data nodes are as below.
[2016-05-12 11:54:30,666][WARN ][cluster.service ] [IP] failed to connect to node [{IP}{P9z8Kl0SSV6d8VaZWViNLg}{IP}{IP:9300}{box_type=hdd, river=_none, master=false, node_id=IP}]
ConnectTransportException[[IP][IP:9300] connect_timeout[30s]]; nested: SocketException[연결이 �~A�~@편�~W~P �~X해 �~J어짐];
at org.elasticsearch.transport.netty.NettyTransport.connectToChannels(NettyTransport.java:940)
at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:855)
at org.elasticsearch.transport.netty.NettyTransport.connectToNode(NettyTransport.java:828)
at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:243)
at org.elasticsearch.cluster.service.InternalClusterService$UpdateTask.run(InternalClusterService.java:474)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:225)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:188)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.SocketException: 연결이 �~A�~@편�~W~P �~X해 �~J어짐
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.jboss.netty.channel.socket.nio.NioClientBoss.connect(NioClientBoss.java:152)
at org.jboss.netty.channel.socket.nio.NioClientBoss.processSelectedKeys(NioClientBoss.java:105)
at org.jboss.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:79)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
at org.jboss.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42)
at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
... 3 more
[2016-05-12 11:59:37,265][INFO ][indices.breaker ] [IP] Updated breaker settings fielddata: [fielddata,type=MEMORY,limit=644245094/614.3mb,overhead=1.03]
[2016-05-12 11:59:37,266][INFO ][cluster.routing.allocation.decider] [IP] updating [cluster.routing.allocation.enable] from [ALL] to [NONE]
[2016-05-12 12:00:01,034][WARN ][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the selector loop.
java.lang.OutOfMemoryError: Java heap space
at java.lang.Integer.valueOf(Integer.java:832)
at sun.nio.ch.EPollSelectorImpl.updateSelectedKeys(EPollSelectorImpl.java:106)
at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:84)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at org.jboss.netty.channel.socket.nio.SelectorUtil.select(SelectorUtil.java:68)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.select(AbstractNioSelector.java:434)
at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:212)
at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)```
[2016-05-12 12:00:54,273][WARN ][netty.channel.socket.nio.AbstractNioSelector] Unexpected exception in the selector loop.
java.lang.OutOfMemoryError: Java heap space
2016-05-12 12:00:54,940][DEBUG][action.admin.indices.stats] [IP] [indices:monitor/stats] failed to execute operation for shard [[log-2016-04-13][3], node[kji2YtUgQ_O6MCh40Tb26A], [P], v[15], s[INITIALIZING], a[id=ZggyMnC7Sfm3ZHhYv2YxUg], unassigned_info[[reason=CLUSTER_RECOVERED], at[2016-05-12T02:59:33.040Z]]]
[log-2016-04-13][[log-2016-04-13][3]] BroadcastShardOperationFailedException[operation indices:monitor/stats failed]; nested: IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]];
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:399)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:376)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.messageReceived(TransportBroadcastByNodeAction.java:365)
at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.doRun(MessageChannelHandler.java:299)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: [log-2016-04-13][[log-2016-04-13][3]] IllegalIndexShardStateException[CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED, RELOCATED]]
at org.elasticsearch.index.shard.IndexShard.readAllowed(IndexShard.java:957)
at org.elasticsearch.index.shard.IndexShard.acquireSearcher(IndexShard.java:791)
at org.elasticsearch.index.shard.IndexShard.docStats(IndexShard.java:612)
at org.elasticsearch.action.admin.indices.stats.CommonStats.<init>(CommonStats.java:131)
at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:165)
at org.elasticsearch.action.admin.indices.stats.TransportIndicesStatsAction.shardOperation(TransportIndicesStatsAction.java:47)
at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$BroadcastByNodeTransportRequestHandler.onShardOperation(TransportBroadcastByNodeAction.java:395)
... 7 more
I tried again to upgrade from 1.7.4 to 2.0.0, just out of curiosity, but similar problem occurred. Do you have any idea on why this is happening?
when I restarted cluster, some of the primaries are properly restored, but some recovery task tasks hang
they are in the 'TRANSLOG' stage and are staying there for more than an hour, which is not expected since I run _flush before restart meaning there should be no remaining translog.
hot threads api shows below
while /_cat/recovery shows a lot of lines just like below
some of the relevant error logs in data nodes are as below.
I tried again to upgrade from 1.7.4 to 2.0.0, just out of curiosity, but similar problem occurred. Do you have any idea on why this is happening?