Closed technogeek00 closed 3 years ago
August 26th Meeting
Commonality between LL-HLS and LL_DASH can be achieved through two different mechanisms
Common encoding requirements
There are then two different means of achieving commonality
The discreet part implementation, while simple, is non-optimal, since it requires each DASH segment to be short and to contain a keyframe, which reduces the compression efficiency for both formats.
The byte-range addressing is far more optimal, as it allows longer GOPS to be used for better compression efficiency, while simultaneously reducing the LL-HLS player request rate for media objects.
First revision implemented, for greatest interoperability we are stating that single addressable objects representing the total CMAF segment with individually addressable CMAF chunks is the best approach. We additionally talk to the origin serving requirements as they are important to the production flow of the CMAF content.
TWG Call 2020/11/02 - Will has an additional set of constraints captured in a stand-alone document, we will want to review and adopt them as appropriate.
There is a preview of my blog post here which describes the requirements.
Interop for low latency streaming_ LL-HLS with byte range addressing.pdf
The new requirement (in addition to those listed at the start of this issue which still stand)
The origin server, any proxy cache(CDN) between client and origin, and origin, must all understand and abide by the convention established in https://tools.ietf.org/html/rfc8673. Under this convention, the client should never make an open ended range request if it is expecting an aggregated response from a fixed offset. It should instead send a request with a very large number as the last-byte-pos in the range request. The LL-HLS client SHOULD use 9007199254740991 for this purpose. This would signal the server (or origin) to begin a 206 response that starts at the requested offset and aggregates over time until the object is completely transferred.
The problem I see with the LL-HLS byte-range addressing and the use of 4s GOPs is that only the first part will be indepedent. While the 3 other parts can be advertised as independent in the playlist, even if they are technically not independent as they don't include IDRs, in order to allow the Apple players to switch bitrate on any of the 4 parts in the segment, this will not work in other environments like browsers where switching can happen only on IDRs. It might be different at some point with the AV1 s-frames, but as of now I believe this is the situation with AVC and HEVC.
I don't believe that using different GOP structures for LL-HLS parts and segments will work, as it is basically doubling the encoding cost. Therefore a trade-off should be found to use a single streamset and still allow a decently fast bitrate switching across all platforms. 2s GOPs might be an option, but considering the thinness of buffer levels, 1s GOPs will probably be the right length.
LL-DASH will have a similar encoding cost problem, although clearly not as significant, with the need to produce additional resync track(s) with 1s GOPs on top of the regular 2s/4s GOP segments.
To summarize: it would be great if we had some data on bitrate switching to study. If we see that 2s GOPs is too long, then 1s might be the universal solution to keep encoding costs at a reasonable level?
@Nicholas - the GOP spacing is decoupled from whether you use byte-range addressing for LL-HLS, or discreet part addressing. The core requirement for byte-range efficiency - that the "media segment for both HLS and DASH is a direct concatenation of the parts " is true irrespective of how many GOPs there are per segment.
The GOP spacing defines the switch interval for the live stream, irrespective of whether it is described by HLS or DASH. In a managed network with little instability, 4s GOPs would probably work. Over an unstable LTE connection, 1s GOPs would allow quicker changes. In my opinion, this spec should stay way from recommending any particular GOP length and if anything, indicate that it should be chosen with respect to level of instability. If we need to indicate a range, we could maybe go with 1s-2s.
I realize in my comment on Sept 8th 2.iv I wrote "Example: segments of 4s duration, with one IDR at the start. Each segment is divided in to 4x1s parts.". This was not intended to represent an optimal configuration, merely an example (it mapped to the Apple ref stream at the time, which has since changed). If we want to update this to be more of a recommended config, then I might correct this to "Example: segments of 4s duration, in which each segment is a concatenation of 4x1s independent parts. Each part begins with an IDR and contains a single 1s GOP"
The switch interval is one of the concerns, the other is the cost of encoding several GOP sizes. While this is a small concern with DASH resync tracks where we can encode just one additional rendition, the LL-HLS approach where two GOP sizes are used for parts and full segments is not gonna fly in real life, as it potentially doubles the encoding costs if the bitrate ladder is too large for a single encoder instance (which will probably happen in most cases). It also significantly complexifies the packager logic. I just don't predict a wide industry support for this approach.
@nicoweilelemental - why do you think the "the LL-HLS approach where two GOP sizes are used for parts and full segments" ? If you examine the Apple reference stream at https://ll-hls-test.apple.com/cmaf/master.m3u8 you will note that they have modified it so that now they use the exact same GOP structure for their segments and parts.
@wilaw If they are using only one GOP structure for both segments and parts, that's fine. But that's not what I understood from you rearlier sentence "The Apple reference software currently adjusts the segments to have a different GOP duration than the parts, which means that the bitstream of the segments differs from that of the parts". Do you take it back?
@nicoweilelemental Yes, I take that back. It was true when I wrote that comment on Sept 8th based on a stream analysis I had done in July. However I re-analyzed it in October in preparation for my talk (basically intending to use it as an example of what needed to be fixed) and was pleasantly surprised to find that they had indeed changed the segment structure to be a concatenation of the parts. The segment https://ll-hls-test.apple.com/cmaf/media1/fileSequence216389.mp4 is attached here
If you analyze it, you will see that it is comprised of 4 GOPs.
@wilaw So that's good. We are losing the encoding efficiency of 2s segments but each part is truly independant with 1s GOP, the bitrate switching can happen every second on all platforms, and the parts can also be resync segments for LL-DASH. We could recover the encoding efficiency of 2s GOPs when switching to AV1 and using s-frames for the switches.
Does LL-HLS also permit a mode where not every part is a full GOP? i.e. can you operate with, say, 2 or even 4 second GOPs but 1 second parts? Clearly that reduces switching opportunities for the client but would keep the packaging delay down due to 1 second CMAF chunks. Just trying to understand what the hard constraints are now for LL-HLS and what flexibility remains to optimise and balance encoding efficiency, backwards compatibility, CDN cache usage, CDN load, resilience to instability, overall latency etc.
I think our spec should aim to separate (a) constraints required for a single encoding to be usable for both LL-HLS and LL-DASH and (b) guidelines on one or more operating points within those constraints.
@poolec - yes, when LL-HLS defines partial segments https://tools.ietf.org/html/draft-pantos-hls-rfc8216bis-08#page-11 it places no constraints on GOP length although it does make some recommendations. This in turn means that only some parts start with an IDR and these parts are optionally labeled as "independent" in the playlist. The spec encourages the description of independent parts to minimize join and switch times.
INDEPENDENT
The value is an enumerated-string whose value is YES if the
Partial Segment contains an independent frame. This attribute is
OPTIONAL; however every Partial Segment containing an independent
frame SHOULD carry it to increase the efficiency with which
clients can join and switch Renditions.
The spec then continues to define some recommend values for segment duration and GOPs
The recommended Target Duration is six seconds.
The recommended GOP size is between one and two seconds. Smaller GOPs allow faster switching between Renditions.
@wilaw On another topic than the GOPs/parts: while I'm confident that CDNs and origins can support RFC8673 and 9007199254740991 as the last-byte-pos value, I'm worried that this isn't widely supported in ISPs proxy-caches, and therefore would negatively impact the user experience where it's not supported. What is your opinion on how we can lower this risk?
@nicoweilelemental - two thoughts here:
The other issue with specifying the RFC8673 approach at this stage is that Roger Pantos said about it, on 25 Aug, "We’ll take a closer look at it once iOS 14 et al are in the can. Assuming it works out we can put a reference to that part of RFC 8673 into the EXT-X-PRELOAD-HINT section of the HLS spec." on the hls-interest list - so it may need to be caveated till Apple's implementations have been updated? I guess other players like hls.js can be updated earlier.
@piersoh - I've had off-IETF-list conversations with Apple engineering and they seem supportive of the approach. However, your point is valid and until Apple actually deploy it on their client base we may want to be careful about how we add it to our document. Maybe as a SHOULD versus a MUST, instead of disregarding it altogether? If there are alternate solutions to making byte-range addressing work, I'd be happy to consider them. Or we could omit byte-range addressing all-together until Apple AVPLayer position is clear? Regarding player implementations, I know that THEO player have already implemented it. Exoplayer dev have contacted me for an investigation too.
That's good to hear. I think that the RFC8673 approach seems to the best option available so putting it down as a SHOULD would be prudent. The RFC8673 behaviour on the server would only come into play if a client effectively signals it through its use of the special last-pos
(9007199254740991).
Another issue is when using HTTP/1.1 for LL-HLS - whilst not officially supported in the RFC Apple's apps can playback LL-HLS over H/1.1 as can other apps. So the server/CDN's behaviour could benefit from more detailed description specific to H/1.1 - Ideally it would be good to have the GET H/1.1 request in the case of BYTERANGE-START=0
respond with Transfer-Encoding: chunked
header, like LL-DASH, as opposed to just a no content-length response (as in the case with H/2) which would imply the need for a connection close.
Agreed. I think all CDNs today will insert a Transfer-Encoding: chunked header when returning a 200 aggregating response under HTTP/1.1 as that is a requirement of valid protocol support. It may be worth explicitly stating this in the spec and detailing the two types of responses that a client may see depending on whether it connects with H1.1 or H2. (or H3 for that matter).
Another update on the use of RFC8673 - it is actually only required in the edge case condition of a start or switch not occurring at the start of a segment. If the player only starts at segment boundaries, or segments only have one independent part, then it is possible for a player to play a byte-range addressed LL-HLS stream without actually making any byte-range requests. This sounds counter-intuitive but it isn't. The player simply makes requests for the complete segment and these will be returned either as 200 H2 aggregating response, or a 200 H1.1 CTE response.
Another item is how much we need to say about the use of HTTP priorities WRT LL-HLS's proposed use of them in the current HLS-bis draft and potential upates comming as result of the new Extensible HTTP priority IETF draft. Roger Pantos has discussed this with the draft authors and has come to an agreed way forward - see this GitHub issue.
Implemented in first published document
Use Case Description
A single in the clear CMAF presentation that is being made available in real time with low latency
Working Notes
Open Questions
Resync
points?