Closed huitema closed 2 years ago
There are some original decisions that need to be revisited. The original design went for simplicity. If a datagram is lost, it is resent over the control stream. This is simpler because the QUICRP can avoid dealing with repeats or repeated packets and potential spurious repeat issues, but it has a few downsides:
First potential fix is to send the repeats as datagrams instead of streams. That requires some management of transmissions to avoid spurious repeats, i.e., keeping track of packets sent successfully.
Second potential fix would be some form of FEC. If the goal is to guarantee "at most one repeat", then we could do a very simple form of FEC, for example repeating packet twice at some interval.
Agree going with datagrams for retransmissions would be better here. Also we should work on simple FEC option of repeat the same data twice but delayed by 10ms for the first phase
Also may the control stream would help if its expanded to per frame/object instead of the group ? is that the case today?
Yes, something like repeat after 10ms, that's what I have in mind. Statistics show correlated losses sometimes over 16 consecutive packets, so something like "wait for 16 packets or 10ms and then repeat" would make sense.
There is probably also an issue for relays. If they are receiving a repeated packet, they can guess that is is already late. If that packet got lost again, the end to end delay would probably exceed the acceptable bound. So those packets should probably get the same kind of FEC or pseudo-FEC protection as the repeated packets.
This remind me of some real time scheduling algorithms. Consider a list of tasks, each of which has a deadline. The scheduler will arrange the task to try execute them in time, before the deadline. We have something similar here, with a "time limit" for each fragment.
There is a 1-1 mapping between control stream and media stream, and I think we should keep it that way.
I think we should bring in BESTBEFORE metadata from QUICR proposal at some point. Its sender marked TTL after which relays need not cache or even try retransmitting.
On control stream and media stream , what does a stream mean in the context of datagram here ?
We should not mix control stream discussion on this issue. The control stream is there to handle subscription. "Subscribe" is done by opening a stream and sending the subscribe message on it. The response allow the server to state how the stream will be sent -- for example, datagrams, rush, plain stream, etc. Other control messages can handle the end of the steam, e.g. "finished after object number N". More control messages may pop up as we refine the protocol, e.g., indicate whether a Group of Objects is dropped. Closing the control stream indicates end of interest (for the client) or end of transmission (for the server).
BESTBEFORE looks like a good idea, but i hope it does not imply a common clock. Also, that needs to be rationalized with priority flags, indications of drops, etc. Maybe something to do next.
agree . we don't need a common clock .. its an integer indicating that how many milliseconds from now I can treat this media as alive.
The first set of commits for PR #62 makes sure that lost fragments are repeated as datagrams. This provides modest gains. In the "triangle test with loss", we observe the following measures of end to end delay, considering the delay in microseconds between the capture of a frame and the first time at which the received object can be "played":
Mode | forward as | Repair with | Average | Max | STD |
---|---|---|---|---|---|
Stream | Object | Stream | 144,373 | 535,537 | 112,426 |
Stream | Fragment | Stream | 90,047 | 364,905 | 60,164 |
Datagram | Object | Stream | 55,331 | 238,972 | 39,365 |
Datagram | Fragment | Stream | 43,825 | 185,942 | 24,586 |
Datagram | Fragment | Datagram | 41,310 | 184,048 | 24,409 |
Each of these steps shows a modest improvement in the metrics. (There was a Max Delay improvements in initial results, but this disappeared after a slight improvement in the implementation. The pre-improvement version changed packet sizes and packet numbers, which changed the loss pattern, and made simulations harder to compare.) Analysis of the "max delay" event shows that they correspond to large objects, sent as multiple segments.
The histogram above shows three big groups:
objects in which at least one fragment incurred two losses, either on the same link or on two successive links.
The next step is to try reduce the third group by using some form of FEC, or redundancy.
Redundancy was tried, but results are dubious. Available as option.
We see some fairly long transmission delays when using the prototype client over AWS. This graph, for example, shows the delay between packet arrivals, with series of spikes: We see some spikes in delay every seconds, about 2 spikes per second. Not shown on that graph is a 2 second spike happening at T=27.
Traces show that some packets are delivered late, causing queues before the jitter buffer.
The most likely explanation for the 2 second interval is a double loss. The original packet was lost, and then the repair was also lost. The second timer would be longer than the first one. You see at least 2 retransmissions per second -- the 200 and 300 ms spikes. Assuming 100 packets per second, that's a 2% loss rate. But then, that means double losses will occur in 0.04% of packets, i.e., once every 25 seconds. Which is more or less what we see.
Probably need to revisit the way losses are corrected, in order to smooth delays.