Handle dropped gamestate after UDP download

Chomenor commented 4 weeks ago

Currently, when a UDP download completes, there is a possibility that the gamestate message from the server is dropped due to network packet loss. Normally dropped gamestates are detected and resent by checking if client messages contain an old serverid value, but after downloads the client can actually have the correct serverid (if the map didn't change). As a result, the gamestate is never resent and the client can hang with "awaiting gamestate" message.

This fix adds an extra check for dropped gamestate for clients that just finished a UDP download, based on detecting if the client is sending messages with no move commands. Since this check is not quite as standard as the serverid method, I added some extra checks to minimize any chance of false positives in non-standard clients. This fix has no effect on non-UDP downloading clients.

ec- commented 3 weeks ago

I'm reworked retransmission and bring back serverId update, please check https://github.com/ec-/Quake3e/commit/f8e7b429b9098e3fc009967cb5f6e5dc4dec9260

Chomenor commented 3 weeks ago

One issue I can see is that this forces a gamestate retransmit if a map restart occurs while the client is loading the map. For example, on a server with a short warmup, this may cause slow loading clients to be sent for a second map load after the warmup completes.

May I ask what is the compatibility reason for bringing back the serverid change during map restart?

ec- commented 3 weeks ago

May I ask what is the compatibility reason for bringing back the serverid change during map restart?

It is needed to discard outdated commands for CS_ACTIVE clients

One issue I can see is that this forces a gamestate retransmit if a map restart occurs while the client is loading the map

This is expected and intentional: client must acknowledge gamestate, new map / map restart == new gamestate

For example, on a server with a short warmup, this may cause slow loading clients to be sent for a second map load after the warmup completes

Is this a real issue? Because there is no map restart from server-side at warmup end, it just sends map_restart command to all clients

Chomenor commented 2 weeks ago

It is needed to discard outdated commands for CS_ACTIVE clients

I thought we had already addressed this with client->lastUsercmd.serverTime = sv.time - 1 to prevent pre-restart move commands being executed. Can you be more specific about what was wrong with that approach?

Is this a real issue? Because there is no map restart from server-side at warmup end, it just sends map_restart command to all clients

That behavior appears to be specific to baseq3a. I assume it is an issue to some degree for most other mods.

ec- commented 2 weeks ago

I thought we had already addressed this with client->lastUsercmd.serverTime = sv.time - 1 to prevent pre-restart move commands being executed. Can you be more specific about what was wrong with that approach?

Move commands only, there's also client commands clc_clientCommand so for example you might execute outdated kill command when you shouldn't, or execute SV_ClientEnterWorld() relying on old serverId/wrong gamestate

That behavior appears to be specific to baseq3a. I assume it is an issue to some degree for most other mods.

That's correct, although that extra re-load is very unlikely and not a big deal comparing to maintaining overall coherency

Chomenor commented 2 weeks ago

you might execute outdated kill command when you shouldn't

It's a reliable command, so if you happened to (coincidentally) call it right before a map restart, won't the client just keep resending and it will still be executed post-restart anyway?

execute SV_ClientEnterWorld() relying on old serverId/wrong gamestate

Commands from wrong gamestate were already dropped, I don't understand how it relates to map restart.

Chomenor commented 1 week ago

The current q3e version does seem to have fixed the issue with dropped gamestates after downloads, but it does it in a way that may not be ideal for client compatibility. The server currently expects the client to acknowledge the exact message number of the gamestate, or a gamestate retransmit is forced. Normally the client does sent the correct acknowledge, due to these lines;

https://github.com/ec-/Quake3e/blob/9fd02e419dbb1f5689fd989c66f17001e8ad44dd/code/client/cl_main.c#L2056-L2058

However, if you comment out these lines, the client will go into an infinite loop receiving gamestates when trying to join a server running the current (post f8e7b429b9098e3fc009967cb5f6e5dc4dec9260) version of q3e. If you add NET_Sleep(0); ahead of these lines it also causes a loop.

I'm concerned this is adding an unnecessary dependency on the client, where potentially subtle changes can break compatibility with newer q3e versions (but still work on every other server). Sometimes players have old versions of unusual clients and it's hard to account for the behavior of every possible client.

I've updated this PR to restore a more traditional serverid-based gamestate check for non-downloading clients, with a separate message acknowledge-based check for downloading clients (with guardrails to avoid gamestate loops).

ec- commented 1 week ago

At User_Move() you might have serverId = sv.serverId but incorrect messageAcknowledge which will place client into game so exact gamestate acknowledge is important
Issue with multiple CL_WritePacket() has been fixed there https://github.com/ec-/Quake3e/commit/8567b041edbea781648e5e8698d680c33e2a76d6
Issue with drifting chan->outgoingSequence which complicates gamestate acknowledge has beed fixed there https://github.com/ec-/Quake3e/commit/909ac2b369ad813586907fdb90dc3ae57b7b1b9d

Chomenor commented 1 week ago

I'm just looking at the first point to make sure we're on the same page, ignoring the no-snapshot or client changes for now. Are you saying the current version of this PR has a problem with incorrectly placing clients into the game?

As far as I know correct serverID + clc_move should always mean client is ready to be placed in the game. It's not apparent to me there is any need or even value in additional messageAcknowledge based checks before spawning the client. Other engines (vanilla, ioq3, etc.) have no such check either and I'm not aware of any issues linked to it.

Could you provide steps for how this "serverId = sv.serverId but incorrect messageAcknowledge" scenario would unfold? It's possible I missed something but at the moment I don't understand how this would happen.

ec- commented 5 days ago

It is not correct to allow client to join by simple serverId == sv.serverId check as it may be a reply to any previous gamestate client received - so exact serverId == sv.serverId && cl->messageAcknowledge == cl->gamestateMessageNum check is mandatory

Chomenor commented 5 days ago

My understanding of the connection process is like this:

1) Client connects or map changes 2) Client is set to CS_PRIMED. Gamestate is sent, and resent 1 at a time if previous one is confirmed dropped. 3) Correct serverId + clc_move confirms gamestate for current map has been loaded. Client is set to CS_ACTIVE. Pending configstring updates are sent.

I still don't understand the problem. The serverId does not allow just any gamestate, it has to be one from the current map. The way the connection process works, it should generally always be the most recent one sent, but even if an older version did somehow get loaded it would likely still be fine because of the way SV_UpdateConfigstrings takes care of modified configstrings.

ec- commented 4 days ago

client could receive incorrect configstrings because csUpdated[] now is reset after sending (to avoid potential overflows)
dedicated gamestateAcked variable is needed to skip sending snapshots to clients

Chomenor commented 4 days ago

client could receive incorrect configstrings because csUpdated[] now is reset after sending (to avoid potential overflows)

Seems questionable this is very useful. For an overflow to happen because of extra csUpdated values accumulated prior to a gamestate resend, but not due to the (likely much longer) map loading time, seems very unlikely...

dedicated gamestateAcked variable is needed to skip sending snapshots to clients

Why is it needed to skip sending snapshots to clients? Why is dedicated variable needed instead of just skipping for non CS_ACTIVE?

ec- commented 4 days ago

Seems questionable this is very useful. For an overflow to happen because of extra csUpdated values accumulated prior to a gamestate resend, but not due to the (likely much longer) map loading time, seems very unlikely...

Gamestate1: CS 1 "0" ... CS 1 "1" - csUpdated[1] = qtrue ... CS X "1" - csUpdated[X] = qtrue -> potential overflow till client confirms Gamestate#1 so all csUpdated[X] must be reset after sending Gamestate1
Gamestate2: CS 1 "1"
Client received Gamestate1, achnowledge to server is lost, server sends Gamestate2, if server accepts Gamestate1 acknowledge then client will have incorrect CS 1 "0"

Why is it needed to skip sending snapshots to clients?

To avoid drifting netchan.outgoingSequence

Why is dedicated variable needed instead of just skipping for non CS_ACTIVE?

It might be not needed but potential side-effects needs to be analyzed

Chomenor commented 4 days ago

potential overflow till client confirms Gamestate#1 so all csUpdated[X] must be reset after sending Gamestate1

Could you explain what you mean by the potential overflow? I know there is always a remote chance of an overflow when using csUpdated, since it causes commands to be sent when the client goes to CS_ACTIVE which technically could cause a command count or snapshot size overflow. Clearing csUpdated values between gamestate retransmits could help reduce the amount of commands in heavy packet loss (repeated gamestate retransmit) situations, but I suspect the chance of an overflow is so rare to begin with that it doesn't make more than a marginal difference.

Client received Gamestate1, achnowledge to server is lost, server sends Gamestate2, if server accepts Gamestate1 acknowledge

With the original gamestate check it shouldn't really be possible for the acknowledge to be "lost". The way the check works is:

Server sends gamestate. cl->gamestateMessageNum is set to the outgoing message number of the gamestate.
Server sends a stream of snapshots containing higher message numbers to the client. Client sends back a stream of messages containing the serverID and acknowledge number of most recent server message.
If the client sends a message with the acknowledge number of one of the snapshots rather than the gamestate (higher than cl->gamestateMessageNum), and the wrong serverID, then the server knows the gamestate is dropped. There is no way for the client to receive the old gamestate at this point because the network code ignores out of order messages and the old gamestate has a lower message number than the acknowledged snapshot.

To avoid drifting netchan.outgoingSequence

That is part of how the gamestate drop detection works. Without some form of incremented message number you can't reliably tell the difference between a dropped gamestate and one that is just delayed. It may be possible to do a different implementation that periodically retransmits the gamestate (with the same message number) instead of relying on drop detection, but I wouldn't assume it's worth the trouble of using a new untested approach if the original way works well enough.

ec- commented 4 days ago

Could you explain what you mean by the potential overflow?

every csUpdated is +1 to cl->reliableSequence -> more chance to overflow

With the original gamestate check it shouldn't really be possible for the acknowledge to be "lost"

Packet drop could happen on client -> server side and vice versa so server could send first gamestate, client receives that, ack to server is lost, server sends second gamestate, second gamestate is lost but client sends command again outgoingSequence + 1 -- so in total client received gamestate1 but server thinks its acknowledged gamestate2 - which is totally wrong so exact gamestate ack is mandatory

That is part of how the gamestate drop detection works

There is no need to send snapshots till client acknowledges gamestate, it just disrupts gamestate acknowledge, it also seems to help with delta compression disruption as I no longer see Delta request from out of date packet messages during map spawn/map restart

Chomenor commented 4 days ago

every csUpdated is +1 to cl->reliableSequence -> more chance to overflow

Yes but there is still the question of rarity. Clearing csUpdated between gamestate retransmit may make negligible practical difference in overflow chances.

ack to server is lost, server sends second gamestate

In general, server sending second gamestate should imply the first gamestate is actually lost, not just the "ack" is lost. I thought I explained that in my message?

There is no need to send snapshots till client acknowledges gamestate

Dropped gamestate handling appears to be broken since 909ac2b369ad813586907fdb90dc3ae57b7b1b9d. Any dropped gamestate results in a stuck client, as retransmit condition cl->messageAcknowledge - cl->gamestateMessageNum > 0 is unreachable. As noted an alternative implementation could work without sending snapshots but that hasn't been done.

I no longer see Delta request from out of date packet messages

I suspect the issue may be somewhere else. It could be related to changes in q3e to checks for creating delta frames here.

ensiform commented 3 days ago

Wasn't csupdated created because of changes made by ioq3 in the way that commands happen at certain time, I don't think the vanilla had that.

Chomenor commented 3 days ago

Here is the commit and bug tracker link. It's an optimization so that if the same configstring changes multiple times while a client is loading the map, the server only sends the final version and can skip the intermediate versions.

ec- / Quake3e

Handle dropped gamestate after UDP download #286