Closed Luluno01 closed 1 year ago
👀
There seems to be two non-daemon threads that are still active in the hung state. I suspect that the TcpClientSession has created an EventLoopGroup that is never shut down. There is a hook to stop it on Runtime exit, but the runtime will only exit if all non-daemon threads have stopped.
I think an explicit shutdown of the EventLoopGroup in TcpClientSession may be needed.
Working ugly hack in #744 as a proof of concept.
See a proof of concept patch in #744.
See a proof of concept patch in #744.
Hi, thanks for your PoC. I tested your patch by inserting a TcpClientSession.shutdown()
call at the end of the original reproducer. It does shutdown the nioEventLoopGroup
. However, defaultEventLoopGroup
still left running. Maybe a similar patch that allows explicit shutting down of all event loop groups is needed?
Here is the updated reproducer:
// build.gradle.kts
plugins {
kotlin("jvm") version "1.8.20"
}
group = "com.example"
version = "1.0-SNAPSHOT"
repositories {
mavenCentral()
maven("https://jitpack.io")
maven("https://repo.opencollab.dev/maven-releases")
}
dependencies {
implementation("com.github.GeyserMC:opennbt:1.5")
implementation("com.github.GeyserMC:mcauthlib:6621fd081c")
implementation("com.github.petersv5:MCProtocolLib:expose_eventloopgroup_shutdown-SNAPSHOT")
testImplementation(kotlin("test"))
}
tasks.test {
useJUnitPlatform()
}
kotlin {
jvmToolchain(11)
}
// Main.kt
import com.github.steveice10.mc.protocol.MinecraftProtocol
import com.github.steveice10.mc.auth.service.SessionService
import com.github.steveice10.mc.protocol.MinecraftConstants
import com.github.steveice10.mc.protocol.packet.ingame.clientbound.ClientboundLoginPacket
import com.github.steveice10.mc.protocol.packet.ingame.clientbound.ClientboundPlayerChatPacket
import com.github.steveice10.mc.protocol.packet.ingame.clientbound.ClientboundSystemChatPacket
import com.github.steveice10.packetlib.Session
import com.github.steveice10.packetlib.event.session.DisconnectedEvent
import com.github.steveice10.packetlib.event.session.SessionAdapter
import com.github.steveice10.packetlib.packet.Packet
import com.github.steveice10.packetlib.tcp.TcpClientSession
fun delay(millis: Long) {
Thread.sleep(millis)
}
fun main() {
val threadsBefore = Thread.getAllStackTraces().keys
val protocol = MinecraftProtocol("whosyourdaddy")
val sessionService = SessionService()
val client = TcpClientSession("localhost", 25565, protocol, null)
client.setFlag(MinecraftConstants.SESSION_SERVICE_KEY, sessionService)
client.addListener(object: SessionAdapter() {
override fun packetReceived(session: Session, packet: Packet) {
when (packet) {
is ClientboundLoginPacket -> println("Client connected")
is ClientboundPlayerChatPacket -> {
println("Received player chat message: ${packet.content}")
println("Session alive: ${session.isConnected}")
session.disconnect("Done")
}
is ClientboundSystemChatPacket -> {
println("Received system chat message: ${packet.content}")
println("Session alive: ${session.isConnected}")
session.disconnect("Done")
}
}
}
override fun disconnected(event: DisconnectedEvent) {
println("Client disconnected: ${event.reason}")
event.cause?.printStackTrace()
}
})
client.connect()
delay(2000)
val threadsAfter = Thread.getAllStackTraces().keys
delay(1000)
println("After disconnect, before shutdown")
println(threadsAfter.subtract(threadsBefore))
// [Thread[nioEventLoopGroup-3-1,10,main], Thread[defaultEventLoopGroup-2-1,10,main]]
TcpClientSession.shutdown()
delay(2000)
val threadsFinal = Thread.getAllStackTraces().keys
println("After shutdown")
println(threadsFinal.subtract(threadsBefore))
// [Thread[defaultEventLoopGroup-2-1,10,main]]
// Process does not terminate
}
All EventLoopGroups have to be shut down to remove all non-daemon threads. I am not sure who creates the defaultEventLoopGroup thread.
For the fix for Geyser (where the above patch is sufficent, in combination with the other PR to Geyser) I used jdb and put a breakpoint of Thread.start() to find the cause of each EventLoopGroup. Can you try that with the test code above?
I think this is caused by the inheritance from TcpSession on construction of an instance except if the public class memeber USE_EVENT_LOOP_FOR_PACKETS has ben set to false first. This is the case with Geyser, but not the reproduction test above. As far as I can tell TcpClientSession never uses this part or TcpSession, but still inherits it.
In general there is a lot of infrastructure in TcpSession that is not reused in TcpClientSession. I chose to ignore them as I did not understand what was intended and made the minimum change needed for Geyser to work. As I said in the PR, I do not understand the reason for choises that were originally made so the PR is very "reactive" on my part.
All EventLoopGroups have to be shut down to remove all non-daemon threads. I am not sure who creates the defaultEventLoopGroup thread.
For the fix for Geyser (where the above patch is sufficent, in combination with the other PR to Geyser) I used jdb and put a breakpoint of Thread.start() to find the cause of each EventLoopGroup. Can you try that with the test code above?
I added a breakpoint at Thread
constructor and another breakpoint at Thread.start()
. When running my reproducer above, the defaultEventLoopGroup-2-1
thread is instantiated by com.github.steveice10.packetlib.tcp.TcpSession#channelRead0
in thread nioEventLoopGroup-3-1
. Here is the call stack:
<init>:395, Thread (java.lang)
<init>:708, Thread (java.lang)
<init>:629, Thread (java.lang)
<init>:60, FastThreadLocalThread (io.netty.util.concurrent)
newThread:121, DefaultThreadFactory (io.netty.util.concurrent)
newThread:105, DefaultThreadFactory (io.netty.util.concurrent)
execute:32, ThreadPerTaskExecutor (io.netty.util.concurrent)
execute:57, ThreadExecutorMap$1 (io.netty.util.internal)
doStartThread:975, SingleThreadEventExecutor (io.netty.util.concurrent)
startThread:944, SingleThreadEventExecutor (io.netty.util.concurrent)
execute:827, SingleThreadEventExecutor (io.netty.util.concurrent)
execute:815, SingleThreadEventExecutor (io.netty.util.concurrent)
channelRead0:386, TcpSession (com.github.steveice10.packetlib.tcp)
channelRead0:29, TcpSession (com.github.steveice10.packetlib.tcp)
channelRead:99, SimpleChannelInboundHandler (io.netty.channel)
invokeChannelRead:379, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:365, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:357, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:324, ByteToMessageDecoder (io.netty.handler.codec)
channelRead:296, ByteToMessageDecoder (io.netty.handler.codec)
channelRead:103, ByteToMessageCodec (io.netty.handler.codec)
invokeChannelRead:379, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:365, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:357, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:324, ByteToMessageDecoder (io.netty.handler.codec)
channelRead:296, ByteToMessageDecoder (io.netty.handler.codec)
channelRead:103, ByteToMessageCodec (io.netty.handler.codec)
invokeChannelRead:379, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:365, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:357, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:324, ByteToMessageDecoder (io.netty.handler.codec)
fireChannelRead:311, ByteToMessageDecoder (io.netty.handler.codec)
callDecode:432, ByteToMessageDecoder (io.netty.handler.codec)
channelRead:276, ByteToMessageDecoder (io.netty.handler.codec)
channelRead:103, ByteToMessageCodec (io.netty.handler.codec)
invokeChannelRead:379, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:365, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:357, AbstractChannelHandlerContext (io.netty.channel)
channelRead:286, IdleStateHandler (io.netty.handler.timeout)
invokeChannelRead:379, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:365, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:357, AbstractChannelHandlerContext (io.netty.channel)
channelRead:1410, DefaultChannelPipeline$HeadContext (io.netty.channel)
invokeChannelRead:379, AbstractChannelHandlerContext (io.netty.channel)
invokeChannelRead:365, AbstractChannelHandlerContext (io.netty.channel)
fireChannelRead:919, DefaultChannelPipeline (io.netty.channel)
read:166, AbstractNioByteChannel$NioByteUnsafe (io.netty.channel.nio)
processSelectedKey:719, NioEventLoop (io.netty.channel.nio)
processSelectedKeysOptimized:655, NioEventLoop (io.netty.channel.nio)
processSelectedKeys:581, NioEventLoop (io.netty.channel.nio)
run:493, NioEventLoop (io.netty.channel.nio)
run:986, SingleThreadEventExecutor$4 (io.netty.util.concurrent)
run:74, ThreadExecutorMap$2 (io.netty.util.internal)
run:30, FastThreadLocalRunnable (io.netty.util.concurrent)
run:833, Thread (java.lang)
So I guess the event loop is used in my case.
I think this is caused by the inheritance from TcpSession on construction of an instance except if the public class memeber USE_EVENT_LOOP_FOR_PACKETS has ben set to false first. This is the case with Geyser, but not the reproduction test above. As far as I can tell TcpClientSession never uses this part or TcpSession, but still inherits it.
In general there is a lot of infrastructure in TcpSession that is not reused in TcpClientSession. I chose to ignore them as I did not understand what was intended and made the minimum change needed for Geyser to work. As I said in the PR, I do not understand the reason for choises that were originally made so the PR is very "reactive" on my part.
The group is created here
And started here
But is never shut down.
Because TcpSession.eventLoop
is a private member, no one except TcpSession
itself can shut down the event loop. It is, however, never shut down in TcpSession
. So I believe this is a bug in the implementation of TcpSession
. My proposal would be making TcpSession.eventLoop
protected or adding a TcpSession.shutdown
that shuts down TcpSession.eventLoop
. Then anyone inherits from TcpSession
can override TcpSession.shutdown
and call super.shutdown
in it to have everything shut down properly.
Yeah, the TcpSession EventLoopGroup is used if the USE_EVENT_LOOP_FOR_PACKETS is true.
If this is to be a more general fix then the entire use of a class variable should be reconsidered for a generic framework like TcpSession etc. Or make it reference counted such that it is shut down on the last use. The current pattern is not very good, as it relies on all users ending at the same time. It is true only when Geyser is the only user in the process.
The best solution is perhaps to make the application (caller) supply the EventLoopGroup. Then the lifetime of the EventLoopGroup and indeed which functions share a thread pool is up to the application.
This is a bigger change than the minimum to make Geyser work thoguh. It is up to someone else to decide the scope of the MCProtocolLib.
An alternative solution is in #748 . That makes the threads daemon threads and thus does not require exposing a new API.
An alternative solution is in #748 . That makes the threads daemon threads and thus does not require exposing a new API.
Tested and it does shutdown the event loop when the main thread finishes. However, I guess it is still necessary to allow the application to fully control when to stop the threads considering the application might want to do something else after connecting to and disconnecting from an MC server. In that case, the application may not want the threads to keep running in the background.
At the moment it is a global pool for the TcpClientSession and one for TcpSession (if used). These are created on first use and then remain, as desgined.
The current API offers no way to terminate these worker threads. They will lie dormant, consuming mostly swappable memory for their stacks. The threads should not be running if there is no work.
I changed the tail of the reproducer to
// Sleep for 7 seconds
println("After disconnect")
val newThreads = threadsAfter.subtract(threadsBefore)
println(newThreads)
newThreads.forEach {
println(it.name + it.state)
}
And the output is
[Thread[tcpClientSession-3-1,5,main], Thread[tcpClientSession-1-1,5,main]]
tcpClientSession-3-1RUNNABLE
tcpClientSession-1-1WAITING
While when inspected in debugger the tcpClientSession-3-1
thread is in RUNNING
state.
Is it safe or a good practice to ignore these two threads?
It would be intersting to see where the two threads are at this point. Is it the packet handling event loop from TcpSession that is not blocking or is it the io handling event loop from TcpClientSession?
Also interesting is which strategy was selected in TcpClientSession.createTcpEventLoopGroup().
It would be intersting to see where the two threads are at this point. Is it the packet handling event loop from TcpSession that is not blocking or is it the io handling event loop from TcpClientSession?
Also interesting is which strategy was selected in TcpClientSession.createTcpEventLoopGroup().
The one waiting (tcpClientSession-1-1
) is awaiting new tasks:
park:-1, Unsafe (jdk.internal.misc)
park:341, LockSupport (java.util.concurrent.locks)
block:506, AbstractQueuedSynchronizer$ConditionNode (java.util.concurrent.locks)
unmanagedBlock:3463, ForkJoinPool (java.util.concurrent)
managedBlock:3434, ForkJoinPool (java.util.concurrent)
await:1623, AbstractQueuedSynchronizer$ConditionObject (java.util.concurrent.locks)
take:435, LinkedBlockingQueue (java.util.concurrent)
takeTask:243, SingleThreadEventExecutor (io.netty.util.concurrent)
run:52, DefaultEventLoop (io.netty.channel)
run:986, SingleThreadEventExecutor$4 (io.netty.util.concurrent)
run:74, ThreadExecutorMap$2 (io.netty.util.internal)
run:30, FastThreadLocalRunnable (io.netty.util.concurrent)
run:833, Thread (java.lang)
The other one tcpClientSession-3-1
is in epoll wait:
wait:-1, WEPoll (sun.nio.ch)
doSelect:111, WEPollSelectorImpl (sun.nio.ch)
lockAndDoSelect:129, SelectorImpl (sun.nio.ch)
select:146, SelectorImpl (sun.nio.ch)
select:68, SelectedSelectionKeySetSelector (io.netty.channel.nio)
select:810, NioEventLoop (io.netty.channel.nio)
run:457, NioEventLoop (io.netty.channel.nio)
run:986, SingleThreadEventExecutor$4 (io.netty.util.concurrent)
run:74, ThreadExecutorMap$2 (io.netty.util.internal)
run:30, FastThreadLocalRunnable (io.netty.util.concurrent)
run:833, Thread (java.lang)
Not sure why this thread is in RUNNABLE
/RUNNING
state (sorry I'm not familiar with how epoll works in NIO)
WEPoll should block the thread. The Java runtime and debugger is not going to be aware that the thread is blocked in a system call, but the kernel should be. If you look with "ps -L" I expect that this thread (if you can identify it) is in "initerruptable sleep" (S) which more or less means it is suspended until it either reaches its specified timeout ir receives a unix signal. It should not be scheduled on any CPU core. It will be in a blocking syscall "epoll".
Okayyyy, that makes sense now. Thanks for the explanation.
I guess this can be closed now that https://github.com/GeyserMC/MCProtocolLib/pull/748 was merged?
Yup! Thanks again ;)
Hi, I had a similar problem with #408 with both 1.18.2-1 and 1.18.2-2-SNAPSHOT. Not sure what I did wrong but here is what happened:
The dependencies in my
build.gradle
:Code snippet:
And the output:
Note the extra NIO thread in the end which prevents JVM from exiting.