NewConnection received a Ping

ctm commented 3 days ago

Fix problem that caused Ping to be received in NewConnection.

2024-06-30T00:00:37.447727851Z WARN  [mb2] NewConnection: SerdeDeCustom, as lobby: Ok(Ping), as table: Err(DeserializeUnexpectedEnd)

That's 6:00:37 pm Saturday, June 29th in MDT, and I was watching WSER, so I know my laptop wasn't closed (although I could have had an internet interruption), so it probably didn't come from me.

My guess is that this is easily reproducible, in which case it's trivial to fix. However, it points out that we may want to start emitting the connection digest with at least some warning (like this one), since that would make it easier to figure out what happened. (For the reader: the connection digest uniquely identifies a connection in a way that does not allow the ip address of the other side to be revealed or to be recovered).

ctm commented 3 days ago

I have not been able to reproduce it, but code inspection strongly suggests this is a harmless race condition that gets hit if the client receives an Ok LoginResult and sends a ping that is receive by NickMapper before NickMapper updates *act at the end of make_initial_contact.

This race-condition is a side-effect of how I'm using actix-web's actor system. What I'm doing is safe, because a ping doesn't do anything and is sent asynchronously with respect to everything else. As such, I can simply detect the Ping as a special case in NewConnection and not call warn!.

Currently, lobby connections only have two states: NewConnection and LoggedIn and Lobby::process doesn't transition from one to the other until it receives info back from NickMapper, but before NickMapper returns success or failure to process, it has already sent the status to the client. That's where the race comes in, because after success or failure is sent to the client, we also send some bookkeeping info. Without sending the bookkeeping there's still be a race, but essentially it would never be hit because the time it takes for the success packet to travel through the network to the client is always going to be longer than non-I/O CPU cycles. However, gathering and sending the bookkeeping info is I/O and as mb2 has added features, the amount of bookkeeping info we send has increased. So…

If we aren't comfortable with simply ignoring a Ping that arrives when we're still a NewConnection we could choose not to ping until after we've received the end of the bookkeeping info which is currently not demarked, IIRC. FWIW, we could also change things so the ping timer itself doesn't start running until we receive a LoginResult, although that would require a minor change to the Pingable trait because Pingable starts the timer in new_pinger and it's less complicated for Lobby wants to have a Pinger, than an Option<Pinger>.

ctm commented 3 days ago

D'oh! Turns out we send a DoneLoadingState message when we've sent the last of the bookkeeping information, so I'll just make the lobby not send a ping until it's received that message, and I'll also move the sending of DoneLoadingState from the bottom of login_and_create_common to its own handler and only invoke that handler after *act has been updated.

ctm commented 3 days ago

Some of what I wrote this morning was incorrect. The biggest thing is that right after the call to warn! was a call to ctx.close(None);, which meant that the other side got hosed and the player would have had to log in again. So, I could have removed that close and special cased receiving a Ping, except that has a subtle bug that's a little time consuming to explain, but I'll try.

The subtle but is a side-effect of us having three different classes of requests: pre-login, lobby and table. There is nothing that guarantees that the byte-representation of a lobby ping wouldn't happen to be the byte-representation of a valid pre-login request. If that were to happen, then an errant lobby ping would be invoke this hypothetical pre-login request. So, I went whole hog and refactored the state-loading into a separate LoadState request and—most importantly—put the sending of the "you have successfully logged in" message into LoadState. That completely eliminates the race condition, but…

…I forgot to even send the "you have failed" message and didn't even test a failing login, so although I've deployed my "fix" I'm reverting it and will have to include the error path before I can deploy. Ugh.

ctm commented 3 days ago

I've fixed the error path (and tested it!) and it's in master, waiting to deploy. It'll go out after this evening's game is over.

ctm commented 2 days ago

Ugh. It happened again after my fix:

2024-07-01T23:46:57.910527526Z WARN  [mb2] NewConnection: SerdeDeCustom, as lobby: Ok(Ping), as table: Err(DeserializeUnexpectedEnd)

That's 5:46 pm MDT, i.e. about twenty minutes before the start of this evening's game.

ctm commented 1 day ago

FWIW, looking at the complete log file, I see:

2024-07-01T23:46:57.985828900Z INFO  [actix_web::middleware::logger] FA83616 891.361964 101 18 "GET /ws/ HTTP/1.1" INM:- E:-

i.e. a WebSocket connection closed .07 seconds after that Ping, and mb2 closes the websocket that sent the ping, so FA83616 identifies the connection (without revealing the IP address). Slogging around in the database shows an ExpiredSession message being sent shortly before the errant ping:

mb2=> select id, created_at, username, substr(message::text, 0, 160) from login_result_messages where created_at >= '2024-07-01T23:25:57.812622618Z' and created_at <= '2024-07-01T23:46:57.910527526Z' order by created_at;
  id   |          created_at           |  username  |                                                                             substr                                                                              
-------+-------------------------------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------
...
 83960 | 2024-07-01 23:46:57.809962+00 |            | {"LoginResult": {"Err": {"ExpiredSession": [1948681912099405644, 13]}}}

So, this strongly suggests that there's a race condition in the other direction, where NickMapper says: "hey, that session is no longer valid; therefore you're no longer logged in" and before that message gets to the client, the client (thinking its still logged in) sends a Ping.

It's a little late in the day for me to fix this today, but at least I have a good idea of what's going on.

ctm / mb2-doc

NewConnection received a Ping #1453