elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.91k stars 24.73k forks source link

we should add monitor about whether the netty worker thread pool are busy or not #80340

Open kkewwei opened 2 years ago

kkewwei commented 2 years ago

We find the took in response is smaller than the external statistics timeout in product, part of the reason is that some threads of worker thread pool are relatively busy, the search requests are pending on the worker queue. Yet there is not available api, to get the statistics of worker queue, we have to add metrics and build the source code, If fact, There is a serious queuing here: 97, 64, 68, 108, 133, 61, 62, 123, 70, 66, 118, 110, 65, 52, 95, each number represents one of the worker threads's pending tasks.

If we can add api to get those metrics.

elasticmachine commented 2 years ago

Pinging @elastic/es-search (Team:Search)

kkewwei commented 2 years ago

hi, @romseygeek, can you please confirm it? Too much timeout are used in ES, some request is very important(such as leader check and follower check), if in the case, It's very difficult to troubleshooting.

kkewwei commented 2 years ago

In our product, the count of the worker threads's pending tasks in the data node is: 793, 387, 852, 43, 988, 119, 115, 809, 761, 46, 47, which causes the master to detect the data node timeout, and the data node will be kicked out of the cluster by the master,

The removed operation happened in the early morning, since there was no indicator before, the kicking operation was very strange, If we can add api to get those metrics in es.

Of cause I will continue to check why there are so many tasks in the work queue.

elasticmachine commented 2 years ago

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner commented 2 years ago

What version are you using @kkewwei and how are you counting the number of tasks pending on each worker thread?

kkewwei commented 2 years ago

The version: 7.9.1

The related code: TransportService.getEventLoop()

public List<Integer> getEventLoop() {
        if (transport instanceof TcpTransport) {
            TcpTransport tcpTransport = (TcpTransport)transport;
            return tcpTransport.getEventLoops();
        }
        return null;
    }

Netty4Transport.getEventLoops():

public List<Integer> getEventLoops() {
        List<Integer> list = new ArrayList<>();
        for (EventExecutor eventExecutor : sharedGroup.getLowLevelGroup()) {
            if (eventExecutor instanceof SingleThreadEventExecutor) {
                int pendingTasks = ((SingleThreadEventExecutor) eventExecutor).pendingTasks();
                if (pendingTasks > 0) {
                    list.add(pendingTasks);
                }
            }
        }
        return list;
    }

We add a api to call TransportService.getEventLoop() every a second.

We don't set http.netty.worker_count, so the transport and http share the same workers.

DaveCTurner commented 2 years ago

The version: 7.9.1

This version is past EOL. Newer versions have better logging about slow execution of transport worker tasks, and there have also been significant changes in the allocation of work to these threads which should solve the problem you're seeing.

IMO it would be reasonable to report these queue lengths in the node stats, but I doubt we'll work on it in the near future so I'm marking this as help wanted.

kkewwei commented 2 years ago

@DaveCTurner I would like to do it. Can you assign it to me?

DaveCTurner commented 2 years ago

Sure, go ahead. I don't think I can assign it to you in Github but consider it assigned :)

kkewwei commented 2 years ago

The reason why there are so many tasks in the work queue is that we open transport.ssl, set xpack.security.transport.ssl.enabled=true, It seems that SslHandler.decode is costly.

"elasticsearch[node1][transport_worker][T#13]" #57 daemon prio=5 os_prio=0 tid=0x00007f2a84013800 nid=0x7855 waiting for monitor entry [0x00007f2a737f7000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at sun.security.util.MemoryCache.put(Cache.java:361)
        - waiting to lock <0x00007f2c2a848a48> (a sun.security.util.MemoryCache)
        at sun.security.ssl.SSLSessionContextImpl.put(SSLSessionContextImpl.java:178)
        at sun.security.ssl.ServerHandshaker.clientFinished(ServerHandshaker.java:1741)
        at sun.security.ssl.ServerHandshaker.processMessage(ServerHandshaker.java:307)
        at sun.security.ssl.Handshaker.processLoop(Handshaker.java:979)
        at sun.security.ssl.Handshaker.process_record(Handshaker.java:914)
        at sun.security.ssl.SSLEngineImpl.readRecord(SSLEngineImpl.java:1025)
        - locked <0x00007f2c14992888> (a sun.security.ssl.SSLEngineImpl)
        at sun.security.ssl.SSLEngineImpl.readNetRecord(SSLEngineImpl.java:907)
        at sun.security.ssl.SSLEngineImpl.unwrap(SSLEngineImpl.java:781)
        - locked <0x00007f2c14992948> (a java.lang.Object)
        at javax.net.ssl.SSLEngine.unwrap(SSLEngine.java:624)
        at io.netty.handler.ssl.SslHandler$SslEngineType$3.unwrap(SslHandler.java:282)
        at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1372)
        at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1267)
        at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1314)
        at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:501)
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:440)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:615)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:578)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
        - None
DaveCTurner commented 2 years ago

What exact JDK version are you using? For instance JDK-8259886 indicates that a performance issue in this area was found and fixed recently.

kkewwei commented 2 years ago

java version is "1.8.0_112", it's older than 8u291, looks like it's the reason. very thank you for your reminder.

kkewwei commented 2 years ago

@DaveCTurner, I have open pr to report the metrics, hoping you can review it.