Long delays probably caused by packet losses

huitema commented 2 years ago

We see some fairly long transmission delays when using the prototype client over AWS. This graph, for example, shows the delay between packet arrivals, with series of spikes: We see some spikes in delay every seconds, about 2 spikes per second. Not shown on that graph is a 2 second spike happening at T=27.

Traces show that some packets are delivered late, causing queues before the jitter buffer.

The most likely explanation for the 2 second interval is a double loss. The original packet was lost, and then the repair was also lost. The second timer would be longer than the first one. You see at least 2 retransmissions per second -- the 200 and 300 ms spikes. Assuming 100 packets per second, that's a 2% loss rate. But then, that means double losses will occur in 0.04% of packets, i.e., once every 25 seconds. Which is more or less what we see.

Probably need to revisit the way losses are corrected, in order to smooth delays.

huitema commented 2 years ago

There are some original decisions that need to be revisited. The original design went for simplicity. If a datagram is lost, it is resent over the control stream. This is simpler because the QUICRP can avoid dealing with repeats or repeated packets and potential spurious repeat issues, but it has a few downsides:

if there is a lot of data to repeat, we can experience head-of-line blocking on the control stream
stream data is queued after datagrams, so the repeat will be slower than a regular transmission
stream repeat timeouts use forms of exponential back-off, which can cause some very long delays

First potential fix is to send the repeats as datagrams instead of streams. That requires some management of transmissions to avoid spurious repeats, i.e., keeping track of packets sent successfully.

Second potential fix would be some form of FEC. If the goal is to guarantee "at most one repeat", then we could do a very simple form of FEC, for example repeating packet twice at some interval.

suhasHere commented 2 years ago

Agree going with datagrams for retransmissions would be better here. Also we should work on simple FEC option of repeat the same data twice but delayed by 10ms for the first phase

Also may the control stream would help if its expanded to per frame/object instead of the group ? is that the case today?

huitema commented 2 years ago

Yes, something like repeat after 10ms, that's what I have in mind. Statistics show correlated losses sometimes over 16 consecutive packets, so something like "wait for 16 packets or 10ms and then repeat" would make sense.

There is probably also an issue for relays. If they are receiving a repeated packet, they can guess that is is already late. If that packet got lost again, the end to end delay would probably exceed the acceptable bound. So those packets should probably get the same kind of FEC or pseudo-FEC protection as the repeated packets.

huitema commented 2 years ago

This remind me of some real time scheduling algorithms. Consider a list of tasks, each of which has a deadline. The scheduler will arrange the task to try execute them in time, before the deadline. We have something similar here, with a "time limit" for each fragment.

huitema commented 2 years ago

There is a 1-1 mapping between control stream and media stream, and I think we should keep it that way.

suhasHere commented 2 years ago

I think we should bring in BESTBEFORE metadata from QUICR proposal at some point. Its sender marked TTL after which relays need not cache or even try retransmitting.

On control stream and media stream , what does a stream mean in the context of datagram here ?

huitema commented 2 years ago

We should not mix control stream discussion on this issue. The control stream is there to handle subscription. "Subscribe" is done by opening a stream and sending the subscribe message on it. The response allow the server to state how the stream will be sent -- for example, datagrams, rush, plain stream, etc. Other control messages can handle the end of the steam, e.g. "finished after object number N". More control messages may pop up as we refine the protocol, e.g., indicate whether a Group of Objects is dropped. Closing the control stream indicates end of interest (for the client) or end of transmission (for the server).

huitema commented 2 years ago

BESTBEFORE looks like a good idea, but i hope it does not imply a common clock. Also, that needs to be rationalized with priority flags, indications of drops, etc. Maybe something to do next.

suhasHere commented 2 years ago

agree . we don't need a common clock .. its an integer indicating that how many milliseconds from now I can treat this media as alive.

huitema commented 2 years ago

The first set of commits for PR #62 makes sure that lost fragments are repeated as datagrams. This provides modest gains. In the "triangle test with loss", we observe the following measures of end to end delay, considering the delay in microseconds between the capture of a frame and the first time at which the received object can be "played":

Mode	forward as	Repair with	Average	Max	STD
Stream	Object	Stream	144,373	535,537	112,426
Stream	Fragment	Stream	90,047	364,905	60,164
Datagram	Object	Stream	55,331	238,972	39,365
Datagram	Fragment	Stream	43,825	185,942	24,586
Datagram	Fragment	Datagram	41,310	184,048	24,409

Each of these steps shows a modest improvement in the metrics. (There was a Max Delay improvements in initial results, but this disappeared after a slight improvement in the implementation. The pre-improvement version changed packet sizes and packet numbers, which changed the loss pattern, and made simulations harder to compare.) Analysis of the "max delay" event shows that they correspond to large objects, sent as multiple segments.

The histogram above shows three big groups:

objects reassembled without any fragments incurring transmission losses
objects reassembled with fragments incurring at most one transmission loss
objects in which at least one fragment incurred two losses, either on the same link or on two successive links.

The next step is to try reduce the third group by using some form of FEC, or redundancy.

huitema commented 2 years ago

Redundancy was tried, but results are dubious. Available as option.

Quicr / old-quicrq

Long delays probably caused by packet losses #61