H-uru / Plasma

Cyan Worlds's Plasma game engine
http://h-uru.github.io/Plasma/
GNU General Public License v3.0
202 stars 80 forks source link

Attempt to fix crash after receiving a KickedOff message #1521

Open dgelessus opened 8 months ago

dgelessus commented 8 months ago

The client reliably crashes after the server sends a Auth2Cli_KickedOff message. I assume this went unnoticed because DIRTSAND never sends that message and even on other servers it doesn't occur in normal gameplay.

I spent a while staring at this in the debugger. Apparently the underlying issue is that the auth server socket is closed by its own notify callback (by the NetCliAuthDisconnect call in RecvMsg<Auth2Cli_KickedOff>). The async socket code doesn't notice the disconnect in this case, so the disconnect notify callback is never called and the auth connection isn't shut down properly. This causes the client to hang on exit, until it tries to send something to the auth server (usually a ping), which then crashes from trying to write to a closed socket.

This PR works around the issue by closing the auth connection asynchronously outside the notify callback, and tightening the locking in the async socket code so nothing else can manipulate a socket while its callbacks are running. This seems to fix the hang/crash, but I'm not confident that this is a complete or good solution, so I've made this a draft for now.

I feel like the right solution would be to fire the close notify callback some other way, without relying on the read operation to detect that the socket was closed, but I have no idea how...

dgelessus commented 8 months ago

Welp, I just hit a deadlock with these changes active: the main thread was sending a transaction that sends an auth message (i. e. plNglTrans s_critsect locked and waiting on the auth socket fCritsect), while one of the socket IO worker threads was dispatching an auth message whose handler sends a transaction (i. e. auth socket fCritsect locked and waiting on plNglTrans s_critsect).

So we can't just keep the socket locked during the entire notify callback. I suppose I could revert the locking changes - in my testing, 8a73bfc9d9e48cfbdce55924be9347c0aecc0079 on its own already avoids the crash, but without the extra locking, the crash could still occur with some unlucky timing/scheduling...