Open ctm opened 3 days ago
I have not been able to reproduce it, but code inspection strongly suggests this is a harmless race condition that gets hit if the client receives an Ok
LoginResult
and sends a ping that is receive by NickMapper
before NickMapper
updates *act
at the end of make_initial_contact
.
This race-condition is a side-effect of how I'm using actix-web's actor system. What I'm doing is safe, because a ping doesn't do anything and is sent asynchronously with respect to everything else. As such, I can simply detect the Ping as a special case in NewConnection
and not call warn!
.
Currently, lobby connections only have two states: NewConnection
and LoggedIn
and Lobby::process
doesn't transition from one to the other until it receives info back from NickMapper
, but before NickMapper
returns success or failure to process
, it has already sent the status to the client. That's where the race comes in, because after success or failure is sent to the client, we also send some bookkeeping info. Without sending the bookkeeping there's still be a race, but essentially it would never be hit because the time it takes for the success packet to travel through the network to the client is always going to be longer than non-I/O CPU cycles. However, gathering and sending the bookkeeping info is I/O and as mb2 has added features, the amount of bookkeeping info we send has increased. So…
If we aren't comfortable with simply ignoring a Ping
that arrives when we're still a NewConnection
we could choose not to ping until after we've received the end of the bookkeeping info which is currently not demarked, IIRC. FWIW, we could also change things so the ping timer itself doesn't start running until we receive a LoginResult
, although that would require a minor change to the Pingable
trait because Pingable
starts the timer in new_pinger
and it's less complicated for Lobby
wants to have a Pinger
, than an Option<Pinger>
.
D'oh! Turns out we send a DoneLoadingState
message when we've sent the last of the bookkeeping information, so I'll just make the lobby not send a ping until it's received that message, and I'll also move the sending of DoneLoadingState
from the bottom of login_and_create_common
to its own handler and only invoke that handler after *act
has been updated.
Some of what I wrote this morning was incorrect. The biggest thing is that right after the call to warn!
was a call to ctx.close(None);
, which meant that the other side got hosed and the player would have had to log in again. So, I could have removed that close
and special cased receiving a Ping
, except that has a subtle bug that's a little time consuming to explain, but I'll try.
The subtle but is a side-effect of us having three different classes of requests: pre-login, lobby and table. There is nothing that guarantees that the byte-representation of a lobby ping wouldn't happen to be the byte-representation of a valid pre-login request. If that were to happen, then an errant lobby ping would be invoke this hypothetical pre-login request. So, I went whole hog and refactored the state-loading into a separate LoadState
request and—most importantly—put the sending of the "you have successfully logged in" message into LoadState
. That completely eliminates the race condition, but…
…I forgot to even send the "you have failed" message and didn't even test a failing login, so although I've deployed my "fix" I'm reverting it and will have to include the error path before I can deploy. Ugh.
I've fixed the error path (and tested it!) and it's in master, waiting to deploy. It'll go out after this evening's game is over.
Ugh. It happened again after my fix:
2024-07-01T23:46:57.910527526Z WARN [mb2] NewConnection: SerdeDeCustom, as lobby: Ok(Ping), as table: Err(DeserializeUnexpectedEnd)
That's 5:46 pm MDT, i.e. about twenty minutes before the start of this evening's game.
FWIW, looking at the complete log file, I see:
2024-07-01T23:46:57.985828900Z INFO [actix_web::middleware::logger] FA83616 891.361964 101 18 "GET /ws/ HTTP/1.1" INM:- E:-
i.e. a WebSocket connection closed .07 seconds after that Ping
, and mb2 closes the websocket that sent the ping, so FA83616
identifies the connection (without revealing the IP address). Slogging around in the database shows an ExpiredSession
message being sent shortly before the errant ping:
mb2=> select id, created_at, username, substr(message::text, 0, 160) from login_result_messages where created_at >= '2024-07-01T23:25:57.812622618Z' and created_at <= '2024-07-01T23:46:57.910527526Z' order by created_at;
id | created_at | username | substr
-------+-------------------------------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------
...
83960 | 2024-07-01 23:46:57.809962+00 | | {"LoginResult": {"Err": {"ExpiredSession": [1948681912099405644, 13]}}}
So, this strongly suggests that there's a race condition in the other direction, where NickMapper
says: "hey, that session is no longer valid; therefore you're no longer logged in" and before that message gets to the client, the client (thinking its still logged in) sends a Ping
.
It's a little late in the day for me to fix this today, but at least I have a good idea of what's going on.
Fix problem that caused Ping to be received in NewConnection.
That's 6:00:37 pm Saturday, June 29th in MDT, and I was watching WSER, so I know my laptop wasn't closed (although I could have had an internet interruption), so it probably didn't come from me.
My guess is that this is easily reproducible, in which case it's trivial to fix. However, it points out that we may want to start emitting the connection digest with at least some warning (like this one), since that would make it easier to figure out what happened. (For the reader: the connection digest uniquely identifies a connection in a way that does not allow the ip address of the other side to be revealed or to be recovered).