nioEventLoopGroup thread prevents JVM from exiting

Luluno01 commented 2 years ago

Hi, I had a similar problem with #408 with both 1.18.2-1 and 1.18.2-2-SNAPSHOT. Not sure what I did wrong but here is what happened:

The dependencies in my build.gradle:

implementation("com.github.GeyserMC:opennbt:1.4")
implementation("com.github.steveice10:packetlib:2.1")
implementation("com.github.GeyserMC:mcauthlib:6f3d6aada5")
implementation("com.github.steveice10:mcprotocollib:1.18.2-1")  // Same with 1.18.2-2-SNAPSHOT

Code snippet:

val threadsBefore = Thread.getAllStackTraces().keys

val protocol = MinecraftProtocol("whosyourdaddy")
val sessionService = SessionService()
val client = TcpClientSession("localhost", 25565, protocol, null)
client.setFlag(MinecraftConstants.SESSION_SERVICE_KEY, sessionService)
client.addListener(object: SessionAdapter() {
    override fun packetReceived(session: Session, packet: Packet) {
        if (packet is ClientboundLoginPacket) {
            session.send(ServerboundChatPacket("Dummy client connected!"))
        } else if (packet is ClientboundChatPacket) {
            println("Received chat message: ${packet.message}")
            println("Session alive: ${session.isConnected}")
            session.disconnect("Done")
        }
    }

    override fun disconnected(event: DisconnectedEvent) {
        println("Client disconnected: ${event.reason}")
        event.cause?.printStackTrace()
    }
})
client.connect()
delay(2000)
val threadsAfter = Thread.getAllStackTraces().keys
delay(1000)
println(threadsAfter.subtract(threadsBefore))

And the output:

Received chat message: TranslatableComponentImpl{key="multiplayer.player.joined", args=[TextComponentImpl{content="whosyourdaddy", style=StyleImpl{color=null, obfuscated=not_set, bold=not_set, strikethrough=not_set, underlined=not_set, italic=not_set, clickEvent=ClickEvent{action=suggest_command, value="/tell whosyourdaddy "}, hoverEvent=HoverEvent{action=show_entity, value=ShowEntity{type=KeyImpl{namespace="minecraft", value="player"}, id=91c854ef-9614-3273-bf16-73b2960d59ac, name=TextComponentImpl{content="whosyourdaddy", style=StyleImpl{color=null, obfuscated=not_set, bold=not_set, strikethrough=not_set, underlined=not_set, italic=not_set, clickEvent=null, hoverEvent=null, insertion=null, font=null}, children=[]}}}, insertion="whosyourdaddy", font=null}, children=[]}], style=StyleImpl{color=NamedTextColor{name="yellow", value="#ffff55"}, obfuscated=not_set, bold=not_set, strikethrough=not_set, underlined=not_set, italic=not_set, clickEvent=null, hoverEvent=null, insertion=null, font=null}, children=[]}
Session alive: true
Received chat message: TranslatableComponentImpl{key="chat.type.text", args=[TextComponentImpl{content="whosyourdaddy", style=StyleImpl{color=null, obfuscated=not_set, bold=not_set, strikethrough=not_set, underlined=not_set, italic=not_set, clickEvent=null, hoverEvent=null, insertion=null, font=null}, children=[]}, TextComponentImpl{content="Dummy client connected!", style=StyleImpl{color=null, obfuscated=not_set, bold=not_set, strikethrough=not_set, underlined=not_set, italic=not_set, clickEvent=null, hoverEvent=null, insertion=null, font=null}, children=[]}], style=StyleImpl{color=null, obfuscated=not_set, bold=not_set, strikethrough=not_set, underlined=not_set, italic=not_set, clickEvent=null, hoverEvent=null, insertion=null, font=null}, children=[]}
Session alive: false
Client disconnected: Done
[Thread[DefaultDispatcher-worker-6,5,main], Thread[DefaultDispatcher-worker-7,5,main], Thread[nioEventLoopGroup-3-1,10,main], Thread[defaultEventLoopGroup-2-1,10,main]]

Note the extra NIO thread in the end which prevents JVM from exiting.

nikijaz commented 2 years ago

👀

petersv5 commented 1 year ago

There seems to be two non-daemon threads that are still active in the hung state. I suspect that the TcpClientSession has created an EventLoopGroup that is never shut down. There is a hook to stop it on Runtime exit, but the runtime will only exit if all non-daemon threads have stopped.

I think an explicit shutdown of the EventLoopGroup in TcpClientSession may be needed.

petersv5 commented 1 year ago

Working ugly hack in #744 as a proof of concept.

petersv5 commented 1 year ago

See a proof of concept patch in #744.

Luluno01 commented 1 year ago

See a proof of concept patch in #744.

Hi, thanks for your PoC. I tested your patch by inserting a TcpClientSession.shutdown() call at the end of the original reproducer. It does shutdown the nioEventLoopGroup. However, defaultEventLoopGroup still left running. Maybe a similar patch that allows explicit shutting down of all event loop groups is needed?

Luluno01 commented 1 year ago

Here is the updated reproducer:

// build.gradle.kts
plugins {
    kotlin("jvm") version "1.8.20"
}

group = "com.example"
version = "1.0-SNAPSHOT"

repositories {
    mavenCentral()
    maven("https://jitpack.io")
    maven("https://repo.opencollab.dev/maven-releases")
}

dependencies {
    implementation("com.github.GeyserMC:opennbt:1.5")
    implementation("com.github.GeyserMC:mcauthlib:6621fd081c")
    implementation("com.github.petersv5:MCProtocolLib:expose_eventloopgroup_shutdown-SNAPSHOT")
    testImplementation(kotlin("test"))
}

tasks.test {
    useJUnitPlatform()
}

kotlin {
    jvmToolchain(11)
}

// Main.kt
import com.github.steveice10.mc.protocol.MinecraftProtocol
import com.github.steveice10.mc.auth.service.SessionService
import com.github.steveice10.mc.protocol.MinecraftConstants
import com.github.steveice10.mc.protocol.packet.ingame.clientbound.ClientboundLoginPacket
import com.github.steveice10.mc.protocol.packet.ingame.clientbound.ClientboundPlayerChatPacket
import com.github.steveice10.mc.protocol.packet.ingame.clientbound.ClientboundSystemChatPacket
import com.github.steveice10.packetlib.Session
import com.github.steveice10.packetlib.event.session.DisconnectedEvent
import com.github.steveice10.packetlib.event.session.SessionAdapter
import com.github.steveice10.packetlib.packet.Packet
import com.github.steveice10.packetlib.tcp.TcpClientSession

fun delay(millis: Long) {
    Thread.sleep(millis)
}

fun main() {
    val threadsBefore = Thread.getAllStackTraces().keys

    val protocol = MinecraftProtocol("whosyourdaddy")
    val sessionService = SessionService()
    val client = TcpClientSession("localhost", 25565, protocol, null)
    client.setFlag(MinecraftConstants.SESSION_SERVICE_KEY, sessionService)
    client.addListener(object: SessionAdapter() {
        override fun packetReceived(session: Session, packet: Packet) {
            when (packet) {
                is ClientboundLoginPacket -> println("Client connected")
                is ClientboundPlayerChatPacket -> {
                    println("Received player chat message: ${packet.content}")
                    println("Session alive: ${session.isConnected}")
                    session.disconnect("Done")
                }
                is ClientboundSystemChatPacket -> {
                    println("Received system chat message: ${packet.content}")
                    println("Session alive: ${session.isConnected}")
                    session.disconnect("Done")
                }
            }
        }

        override fun disconnected(event: DisconnectedEvent) {
            println("Client disconnected: ${event.reason}")
            event.cause?.printStackTrace()
        }
    })
    client.connect()
    delay(2000)
    val threadsAfter = Thread.getAllStackTraces().keys
    delay(1000)
    println("After disconnect, before shutdown")
    println(threadsAfter.subtract(threadsBefore))
    // [Thread[nioEventLoopGroup-3-1,10,main], Thread[defaultEventLoopGroup-2-1,10,main]]
    TcpClientSession.shutdown()
    delay(2000)
    val threadsFinal = Thread.getAllStackTraces().keys
    println("After shutdown")
    println(threadsFinal.subtract(threadsBefore))
    // [Thread[defaultEventLoopGroup-2-1,10,main]]
    // Process does not terminate
}

petersv5 commented 1 year ago

All EventLoopGroups have to be shut down to remove all non-daemon threads. I am not sure who creates the defaultEventLoopGroup thread.

For the fix for Geyser (where the above patch is sufficent, in combination with the other PR to Geyser) I used jdb and put a breakpoint of Thread.start() to find the cause of each EventLoopGroup. Can you try that with the test code above?

petersv5 commented 1 year ago

I think this is caused by the inheritance from TcpSession on construction of an instance except if the public class memeber USE_EVENT_LOOP_FOR_PACKETS has ben set to false first. This is the case with Geyser, but not the reproduction test above. As far as I can tell TcpClientSession never uses this part or TcpSession, but still inherits it.

In general there is a lot of infrastructure in TcpSession that is not reused in TcpClientSession. I chose to ignore them as I did not understand what was intended and made the minimum change needed for Geyser to work. As I said in the PR, I do not understand the reason for choises that were originally made so the PR is very "reactive" on my part.

Luluno01 commented 1 year ago

All EventLoopGroups have to be shut down to remove all non-daemon threads. I am not sure who creates the defaultEventLoopGroup thread.

For the fix for Geyser (where the above patch is sufficent, in combination with the other PR to Geyser) I used jdb and put a breakpoint of Thread.start() to find the cause of each EventLoopGroup. Can you try that with the test code above?

I added a breakpoint at Thread constructor and another breakpoint at Thread.start(). When running my reproducer above, the defaultEventLoopGroup-2-1 thread is instantiated by com.github.steveice10.packetlib.tcp.TcpSession#channelRead0 in thread nioEventLoopGroup-3-1. Here is the call stack:

<init>:395, Thread (java.lang)
<init>:708, Thread (java.lang)
<init>:629, Thread (java.lang)
<init>:60, FastThreadLocalThread (io.netty.util.concurrent)
newThread:121, DefaultThreadFactory (io.netty.util.concurrent)
newThread:105, DefaultThreadFactory (io.netty.util.concurrent)
execute:32, ThreadPerTaskExecutor (io.netty.util.concurrent)
execute:57, ThreadExecutorMap$1 (io.netty.util.internal)
doStartThread:975, SingleThreadEventExecutor (io.netty.util.concurrent)
startThread:944, SingleThreadEventExecutor (io.netty.util.concurrent)
execute:827, SingleThreadEventExecutor (io.netty.util.concurrent)
execute:815, SingleThreadEventExecutor (io.netty.util.concurrent)
channelRead0:386, TcpSession (com.github.steveice10.packetlib.tcp)
channelRead0:29, TcpSession (com.github.steveice10.packetlib.tcp)
channelRead:99, SimpleChannelInboundHandler (io.netty.channel)
invokeChannelRead:379, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:365, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:357, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:324, ByteToMessageDecoder (io.netty.handler.codec)
channelRead:296, ByteToMessageDecoder (io.netty.handler.codec)
channelRead:103, ByteToMessageCodec (io.netty.handler.codec)
invokeChannelRead:379, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:365, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:357, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:324, ByteToMessageDecoder (io.netty.handler.codec)
channelRead:296, ByteToMessageDecoder (io.netty.handler.codec)
channelRead:103, ByteToMessageCodec (io.netty.handler.codec)
invokeChannelRead:379, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:365, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:357, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:324, ByteToMessageDecoder (io.netty.handler.codec)
fireChannelRead:311, ByteToMessageDecoder (io.netty.handler.codec)
callDecode:432, ByteToMessageDecoder (io.netty.handler.codec)
channelRead:276, ByteToMessageDecoder (io.netty.handler.codec)
channelRead:103, ByteToMessageCodec (io.netty.handler.codec)
invokeChannelRead:379, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:365, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:357, AbstractChannelHandlerContext (io.netty.channel)
channelRead:286, IdleStateHandler (io.netty.handler.timeout)
invokeChannelRead:379, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:365, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:357, AbstractChannelHandlerContext (io.netty.channel)
channelRead:1410, DefaultChannelPipeline$HeadContext (io.netty.channel)
invokeChannelRead:379, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:365, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:919, DefaultChannelPipeline (io.netty.channel)
read:166, AbstractNioByteChannel$NioByteUnsafe (io.netty.channel.nio)
processSelectedKey:719, NioEventLoop (io.netty.channel.nio)
processSelectedKeysOptimized:655, NioEventLoop (io.netty.channel.nio)
processSelectedKeys:581, NioEventLoop (io.netty.channel.nio)
run:493, NioEventLoop (io.netty.channel.nio)
run:986, SingleThreadEventExecutor$4 (io.netty.util.concurrent)
run:74, ThreadExecutorMap$2 (io.netty.util.internal)
run:30, FastThreadLocalRunnable (io.netty.util.concurrent)
run:833, Thread (java.lang)

So I guess the event loop is used in my case.

Luluno01 commented 1 year ago

I think this is caused by the inheritance from TcpSession on construction of an instance except if the public class memeber USE_EVENT_LOOP_FOR_PACKETS has ben set to false first. This is the case with Geyser, but not the reproduction test above. As far as I can tell TcpClientSession never uses this part or TcpSession, but still inherits it.

In general there is a lot of infrastructure in TcpSession that is not reused in TcpClientSession. I chose to ignore them as I did not understand what was intended and made the minimum change needed for Geyser to work. As I said in the PR, I do not understand the reason for choises that were originally made so the PR is very "reactive" on my part.

The group is created here

https://github.com/GeyserMC/MCProtocolLib/blob/98a66e8e2d623f589a3b35fe40f8fefa7060d4b7/src/main/java/com/github/steveice10/packetlib/tcp/TcpSession.java#L39

And started here

https://github.com/GeyserMC/MCProtocolLib/blob/98a66e8e2d623f589a3b35fe40f8fefa7060d4b7/src/main/java/com/github/steveice10/packetlib/tcp/TcpSession.java#L386

But is never shut down.

Because TcpSession.eventLoop is a private member, no one except TcpSession itself can shut down the event loop. It is, however, never shut down in TcpSession. So I believe this is a bug in the implementation of TcpSession. My proposal would be making TcpSession.eventLoop protected or adding a TcpSession.shutdown that shuts down TcpSession.eventLoop. Then anyone inherits from TcpSession can override TcpSession.shutdown and call super.shutdown in it to have everything shut down properly.

petersv5 commented 1 year ago

Yeah, the TcpSession EventLoopGroup is used if the USE_EVENT_LOOP_FOR_PACKETS is true.

If this is to be a more general fix then the entire use of a class variable should be reconsidered for a generic framework like TcpSession etc. Or make it reference counted such that it is shut down on the last use. The current pattern is not very good, as it relies on all users ending at the same time. It is true only when Geyser is the only user in the process.

The best solution is perhaps to make the application (caller) supply the EventLoopGroup. Then the lifetime of the EventLoopGroup and indeed which functions share a thread pool is up to the application.

This is a bigger change than the minimum to make Geyser work thoguh. It is up to someone else to decide the scope of the MCProtocolLib.

petersv5 commented 1 year ago

An alternative solution is in #748 . That makes the threads daemon threads and thus does not require exposing a new API.

Luluno01 commented 1 year ago

An alternative solution is in #748 . That makes the threads daemon threads and thus does not require exposing a new API.

Tested and it does shutdown the event loop when the main thread finishes. However, I guess it is still necessary to allow the application to fully control when to stop the threads considering the application might want to do something else after connecting to and disconnecting from an MC server. In that case, the application may not want the threads to keep running in the background.

petersv5 commented 1 year ago

At the moment it is a global pool for the TcpClientSession and one for TcpSession (if used). These are created on first use and then remain, as desgined.

The current API offers no way to terminate these worker threads. They will lie dormant, consuming mostly swappable memory for their stacks. The threads should not be running if there is no work.

Luluno01 commented 1 year ago

I changed the tail of the reproducer to

// Sleep for 7 seconds
println("After disconnect")
val newThreads = threadsAfter.subtract(threadsBefore)
println(newThreads)
newThreads.forEach {
    println(it.name + it.state)
}

And the output is

[Thread[tcpClientSession-3-1,5,main], Thread[tcpClientSession-1-1,5,main]]
tcpClientSession-3-1RUNNABLE
tcpClientSession-1-1WAITING

While when inspected in debugger the tcpClientSession-3-1 thread is in RUNNING state.

Is it safe or a good practice to ignore these two threads?

petersv5 commented 1 year ago

It would be intersting to see where the two threads are at this point. Is it the packet handling event loop from TcpSession that is not blocking or is it the io handling event loop from TcpClientSession?

Also interesting is which strategy was selected in TcpClientSession.createTcpEventLoopGroup().

Luluno01 commented 1 year ago

It would be intersting to see where the two threads are at this point. Is it the packet handling event loop from TcpSession that is not blocking or is it the io handling event loop from TcpClientSession?

Also interesting is which strategy was selected in TcpClientSession.createTcpEventLoopGroup().

The one waiting (tcpClientSession-1-1) is awaiting new tasks:

park:-1, Unsafe (jdk.internal.misc)
park:341, LockSupport (java.util.concurrent.locks)
block:506, AbstractQueuedSynchronizer$ConditionNode (java.util.concurrent.locks)
unmanagedBlock:3463, ForkJoinPool (java.util.concurrent)
managedBlock:3434, ForkJoinPool (java.util.concurrent)
await:1623, AbstractQueuedSynchronizer$ConditionObject (java.util.concurrent.locks)
take:435, LinkedBlockingQueue (java.util.concurrent)
takeTask:243, SingleThreadEventExecutor (io.netty.util.concurrent)
run:52, DefaultEventLoop (io.netty.channel)
run:986, SingleThreadEventExecutor$4 (io.netty.util.concurrent)
run:74, ThreadExecutorMap$2 (io.netty.util.internal)
run:30, FastThreadLocalRunnable (io.netty.util.concurrent)
run:833, Thread (java.lang)

The other one tcpClientSession-3-1 is in epoll wait:

wait:-1, WEPoll (sun.nio.ch)
doSelect:111, WEPollSelectorImpl (sun.nio.ch)
lockAndDoSelect:129, SelectorImpl (sun.nio.ch)
select:146, SelectorImpl (sun.nio.ch)
select:68, SelectedSelectionKeySetSelector (io.netty.channel.nio)
select:810, NioEventLoop (io.netty.channel.nio)
run:457, NioEventLoop (io.netty.channel.nio)
run:986, SingleThreadEventExecutor$4 (io.netty.util.concurrent)
run:74, ThreadExecutorMap$2 (io.netty.util.internal)
run:30, FastThreadLocalRunnable (io.netty.util.concurrent)
run:833, Thread (java.lang)

Not sure why this thread is in RUNNABLE/RUNNING state (sorry I'm not familiar with how epoll works in NIO)

petersv5 commented 1 year ago

WEPoll should block the thread. The Java runtime and debugger is not going to be aware that the thread is blocked in a system call, but the kernel should be. If you look with "ps -L" I expect that this thread (if you can identify it) is in "initerruptable sleep" (S) which more or less means it is suspended until it either reaches its specified timeout ir receives a unix signal. It should not be scheduled on any CPU core. It will be in a blocking syscall "epoll".

Luluno01 commented 1 year ago

Okayyyy, that makes sense now. Thanks for the explanation.

petersv5 commented 1 year ago

I guess this can be closed now that https://github.com/GeyserMC/MCProtocolLib/pull/748 was merged?

onebeastchris commented 1 year ago

Yup! Thanks again ;)

GeyserMC / MCProtocolLib

nioEventLoopGroup thread prevents JVM from exiting #699