Open kkewwei opened 2 years ago
Pinging @elastic/es-search (Team:Search)
hi, @romseygeek, can you please confirm it? Too much timeout
are used in ES, some request is very important(such as leader check and follower check), if in the case, It's very difficult to troubleshooting.
In our product, the count of the worker threads's pending tasks in the data node is: 793, 387, 852, 43, 988, 119, 115, 809, 761, 46, 47, which causes the master to detect the data node timeout, and the data node will be kicked out of the cluster by the master,
The removed operation happened in the early morning, since there was no indicator before, the kicking operation was very strange, If we can add api to get those metrics in es.
Of cause I will continue to check why there are so many tasks in the work queue.
Pinging @elastic/es-distributed (Team:Distributed)
What version are you using @kkewwei and how are you counting the number of tasks pending on each worker thread?
The version: 7.9.1
The related code: TransportService.getEventLoop()
public List<Integer> getEventLoop() {
if (transport instanceof TcpTransport) {
TcpTransport tcpTransport = (TcpTransport)transport;
return tcpTransport.getEventLoops();
}
return null;
}
Netty4Transport.getEventLoops():
public List<Integer> getEventLoops() {
List<Integer> list = new ArrayList<>();
for (EventExecutor eventExecutor : sharedGroup.getLowLevelGroup()) {
if (eventExecutor instanceof SingleThreadEventExecutor) {
int pendingTasks = ((SingleThreadEventExecutor) eventExecutor).pendingTasks();
if (pendingTasks > 0) {
list.add(pendingTasks);
}
}
}
return list;
}
We add a api to call TransportService.getEventLoop()
every a second.
We don't set http.netty.worker_count
, so the transport and http share the same workers.
The version: 7.9.1
This version is past EOL. Newer versions have better logging about slow execution of transport worker tasks, and there have also been significant changes in the allocation of work to these threads which should solve the problem you're seeing.
IMO it would be reasonable to report these queue lengths in the node stats, but I doubt we'll work on it in the near future so I'm marking this as help wanted
.
@DaveCTurner I would like to do it. Can you assign it to me?
Sure, go ahead. I don't think I can assign it to you in Github but consider it assigned :)
The reason why there are so many tasks in the work queue is that we open transport.ssl, set xpack.security.transport.ssl.enabled=true
, It seems that SslHandler.decode
is costly.
"elasticsearch[node1][transport_worker][T#13]" #57 daemon prio=5 os_prio=0 tid=0x00007f2a84013800 nid=0x7855 waiting for monitor entry [0x00007f2a737f7000]
java.lang.Thread.State: BLOCKED (on object monitor)
at sun.security.util.MemoryCache.put(Cache.java:361)
- waiting to lock <0x00007f2c2a848a48> (a sun.security.util.MemoryCache)
at sun.security.ssl.SSLSessionContextImpl.put(SSLSessionContextImpl.java:178)
at sun.security.ssl.ServerHandshaker.clientFinished(ServerHandshaker.java:1741)
at sun.security.ssl.ServerHandshaker.processMessage(ServerHandshaker.java:307)
at sun.security.ssl.Handshaker.processLoop(Handshaker.java:979)
at sun.security.ssl.Handshaker.process_record(Handshaker.java:914)
at sun.security.ssl.SSLEngineImpl.readRecord(SSLEngineImpl.java:1025)
- locked <0x00007f2c14992888> (a sun.security.ssl.SSLEngineImpl)
at sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:907)
at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:781)
- locked <0x00007f2c14992948> (a java.lang.Object)
at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624)
at io.netty.handler.ssl.SslHandler$SslEngineType$3.unwrap(SslHandler.java:282)
at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1372)
at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1267)
at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1314)
at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:501)
at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:440)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:615)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:578)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.lang.Thread.run(Thread.java:745)
Locked ownable synchronizers:
- None
What exact JDK version are you using? For instance JDK-8259886 indicates that a performance issue in this area was found and fixed recently.
java version is "1.8.0_112", it's older than 8u291, looks like it's the reason. very thank you for your reminder.
@DaveCTurner, I have open pr to report the metrics, hoping you can review it.
We find the
took
in response is smaller than the external statistics timeout in product, part of the reason is that some threads of worker thread pool are relatively busy, the search requests are pending on the worker queue. Yet there is not available api, to get the statistics of worker queue, we have to add metrics and build the source code, If fact, There is a serious queuing here:97, 64, 68, 108, 133, 61, 62, 123, 70, 66, 118, 110, 65, 52, 95
, each number represents one of the worker threads's pending tasks.If we can add api to get those metrics.