Open just-bank opened 7 years ago
Very interesting find!
What would be the usage-case of using NetEvent::Guaranteed over GuaranteedOrdered?
@Areloch
NetEvent::Guaranteed
assumed to be "top priority" and always sent before any GuaranteedOrdered
.
Like you have lots of "regular" GuaranteedOrdered
events in queue scheduled to be sent and you are posting Guaranteed
- its added on top of the GuaranteedOrdered
queue so its delivered first (together with unguaranteed, which are packed/sent only once).
But in reality I've never seen any usage of this, every project I've seen was simply doing:
class MyNetEvent : public NetEvent
{ ... }
and never re-declared mGuaranteeType
, which by default is set to:
NetEvent() { mGuaranteeType = GuaranteedOrdered; }
Stock T3D don't use Guaranteed
, and I've never used it by myself in AW and LiF.
Using read/writeCussedU32 doesn't completely solve problem as that just makes the packet sequence number range larger, but it'll break eventually if you exceed 2^31 sequenced events over a GameConnection
.
NetEvent::Guaranteed
was intentionally designed to not care about duplicate events, however NetEvent::GuaranteedOrdered
does and why it exists, that's up to a higher level usercode to decide what to do about it. So that should be left alone in case anybody wants to, or is already using it.
I believe overall problem is that the sequence number range stored in the NetEvent
packets, 7bits, is just too tiny for massive packet loss and high packet usage throughput. Here's why. Torque doesn't actually just use 7bits for sequence numbers, it's actually using the full 32bit (S32) value (really should be U32 for sanity sake), so effectively both sides keep track of sequence number and wraparound generation count (mSeqCount >> 7)
. Because of so much packet loss one side ends up having a different generation count than its peer, most likely the server side cause it sends the most data compared to the client. So then eventually the client gets desynced and waits forever for the right sequence that it'll never receive
The 7bit sequence number made sense back in the 1990s when everybody was on pretty much dial-up, ISDN, and best up to 1.5Mbit ADSL. Also Tribes games didn't really have much going on in terms of network traffic (~15kbytes/sec at most per client downstream) and they were more well aware of the limitations than we are, such as max garaunteed packets with no ACK before the engine loses track (7bits is 128 unique packets). So with a 80% packet loss and just making it exchange NetEvents as fast as possible it'll definitely cause wraparound generation sequence number desyncs eventually.
Case in point, this code checks for sequence number wrap around and accounts for it: https://github.com/GarageGames/Torque3D/blob/development/Engine/source/sim/netEvent.cpp#L352-L354 The last while loop is what does the guaranteed ordered events: https://github.com/GarageGames/Torque3D/blob/development/Engine/source/sim/netEvent.cpp#L369-L381
Notice that it's checking for packets received, waiting queue, and only processing them in sequence order, and not just the 7bit sequence number, it's using the full wraparound generation + 7bit sequence numbers. So it makes sense why reworking the code to use up to full 32bit sequence numbers across packets would make it easily recoverable.
Going ahead with 32bit sequence numbers in packets is the quickest fix, but not necessarily the complete solution. Because you still have to account for wraparound. Because networking technologies are becoming faster not just in terms of bytes per second, but packets per second. Not only that, but because if you're using Torque for long-life running game servers, like MMO, then it'll have a higher chance of happening than you think it generally would.
What needs to be done is that NetEvent has to hold off sending new ordered NetEvents when oldest and newest pending ordered packet not ack'd sequence number difference (NewestSentSeqNum - OldestSentSeqNum)
is more than (SeqNumBitSize >> 2)
, at most, to avoid wraparound generation desync.
if(!bstream->writeFlag(ev->mSeqCount == prevSeq + 1))
bstream->writeInt(ev->mSeqCount & 0x7F, 7);
Is a neat space-saving trick where it won't pack the next, literal in sequence, NetEvent's sequence number and it's apart of of the same network packet that'll be going out. So you might want to keep that similar procedure in there.
We have found a very nasty bug in TNL (Torque Network Library) code, which was since the very beginning (TGE1.3+ or earlier), so I will call this a "15++ y.o. bug" in Torque.
Issue:
GuaranteedOrdered
events are stopped from being processed which is caused by packed loss.How to replicate:
Start two instances of Torque engine (any engine version, T2, TGE1.3+ or latest T3D). Connect one to another via GameConnection. Call
%conn.setSimulatedNetParams(0.8, 0);
on both sides, so we simulate a 80% of packet loss. Our test functions:Now, on any side send
'test'
cmd to another side so its start looping (doing infinite ping-pong).It could take some time until it stops working (it took 17 minutes for me in a first run to catch the bug, second run - 35 minutes). In production we hit it quite often, with 500+
NetConnection
s online the bug makes itself felt every few hours.What happens:
As example we take "10" as current event sequence count. Server sends 'test' event[1] to the client:
... but the packet gets lost! Server re-sends the same event[2] to the client by writing to
BitStream
themSeqCount
:as the event is not the next one (we are re-sending).
Client receives event[2] and unpacking the
seq
via:Going down in the code we call
evt->unpack()
and do some checks:and process it inside the
loop.
The client sends ACK for that packet we have just processed, and all is fine until that packet with ACK is dropped by some reason (by simulating packet drop or some real network issues). The server will resend the same event AGAIN: Server re-sends the same event[3] to the client by writing to
BitStream
themSeqCount
:as the event is not the next one (we are re-sending).
Client receives event[3] and unpacking the
seq
via:Now is the "magic" part of code:
Now our
while
loop:We have
fucked upstuck. Any further event we will receive from the server will be unpacked but never processed. The events from the client to the server are sent and processed fine (in case it will not lockout the same way).Solution:
The simplest solution is to use
writeCussedU32()
: replace codewith:
replace code
with
remove the following:
as we don't need it anymore.
replace the whole
while
loop:with:
Side note:
Even with this change we can't prevent events with
mGuaranteeType == NetEvent::Guaranteed
from being processed more than once. Those are sent together withNetEvent::GuaranteedOrdered
in the same packet, so it will call:When our ACK packet (we have sent to server) is dropped or lost, the server will resend it to client and the same
Guaranteed
event will be processed again! This could lead to some undefined behaviour in game-play, depending on how this type of events are used in engine by game developers.For us we have decided to remove the
NetEvent::Guaranteed
from our engine and always useNetEvent::GuaranteedOrdered
. From what I see, there are no use of this atm in T3D MIT, so it can be safely (?) removed.NetEvent::Unguaranteed
are working the same way as it was before - they are never being send more than once.Thinking aloud
Guaranteed
type of events.writeCussedU32()
gives disadvantage to store large numbers).The End.
Before I submit PR, can someone comment on this?
I think this should be applied on any/all Torque engine(s) (T2D, OpenTNL, etc), so after we finalize what we do with T3D we can spread this info to other projects.