SRGSSR / pillarbox-documentation

Technical cross-platform documentation for Pillarbox
https://srgssr.github.io/pillarbox-documentation
MIT License
3 stars 0 forks source link

LSVS stream validation #69

Closed defagos closed 4 months ago

defagos commented 5 months ago

As a member of the stream team, I want the player team to validate the new LSVS streams so that we can go live.

Acceptance criteria

Tasks

defagos commented 4 months ago

The streams can be tested on MMF under the stream-team-dev section:

and

defagos commented 4 months ago

Early feedbacks:

samderdritte commented 4 months ago

Thanks @defagos for the early feedback.

We can adjust as you request. We could adjust both:

I would suggest we don't do both at the same time so we can more easily identify the root cause. Which one would you want us to do first?

defagos commented 4 months ago

@samderdritte Yes, repeating the key is here unnecessary and would already help a lot. Chunk size adjustments can be made later if they make sense, but I would rather go with the key repeat setting first, thanks.

waliid commented 4 months ago

After couples seconds, the HLS DVR stream fails on Apple platforms.

Here are the logs:

"#Version: 1.0\n#Software: AppleCoreMedia/1.0.0.21E236 (iPhone; U; CPU OS 17_4_1 like Mac OS X; fr_ch)\n#Date: 2024/05/07 09:07:09.009\n#Fields: date time uri cs-guid s-ip status domain comment cs-iftype\n2024/05/07 09:07:08.008 https://srfinfo-lsvs.akamaized.net/out/v1/e7bc6a6b7839440f93e21be3e8a76a6e/index_2.m3u8?start=1715058421 AA2722C4-C838-4E78-9DE3-A006BECC9075 - -12317 \"CoreMediaErrorDomain\" \"Removing media file from EVENT playlist.\" -\n"
samderdritte commented 4 months ago

@samderdritte Yes, repeating the key is here unnecessary and would already help a lot. Chunk size adjustments can be made later if they make sense, but I would rather go with the key repeat setting first, thanks.

The change is done. You should be able to test as sone as all eventual cache is flushed).

defagos commented 4 months ago

According to the error information reported by @waliid just above, the likely reason why DVR playback fails after a short while seems related to #EXT-X-PLAYLIST-TYPE:EVENT:

#EXTM3U
#EXT-X-VERSION:5
#EXT-X-TARGETDURATION:2
#EXT-X-PLAYLIST-TYPE:EVENT
#EXT-X-MEDIA-SEQUENCE:329351
#EXT-X-DISCONTINUITY-SEQUENCE:1
#EXT-X-KEY:METHOD=SAMPLE-AES,URI="skd://srg.live.ott.irdeto.com/licenseServer/streaming/v1/SRG/getckc?contentId=SRFinfoDRM&keyId=f49029dd-f277-4d0f-96b9-fb06988b4899",KEYFORMAT="com.apple.streamingkeydelivery",KEYFORMATVERSIONS="1",IV=0x7AB2243C94F7A59F379749683D008E2F
#EXT-X-PROGRAM-DATE-TIME:2024-05-07T05:11:11.460Z
#EXTINF:2.000,
index_2_329351.ts?m=1714475375
#EXT-X-KEY:METHOD=SAMPLE-AES,URI="skd://srg.live.ott.irdeto.com/licenseServer/streaming/v1/SRG/getckc?contentId=SRFinfoDRM&keyId=f49029dd-f277-4d0f-96b9-fb06988b4899",KEYFORMAT="com.apple.streamingkeydelivery",KEYFORMATVERSIONS="1",IV=0x7AB2243C94F7A59F379749683D008E2F
#EXT-X-PROGRAM-DATE-TIME:2024-05-07T05:11:13.460Z
#EXTINF:2.000,
...

According to the specification EVENT playlists must not be mutated, but here they clearly are. Playback therefore fails when a playlist update is received, at which point the mutation is detected and the error -12317 thrown.

@samderdritte To fix this issue all you need to do is likely to remove #EXT-X-PLAYLIST-TYPE:EVENT from child playlists. This is also how our existing LSVS child playlists currently look like.

samderdritte commented 4 months ago

@defagos EXT-X-PLAYLIST-TYPE flag is now set to "none", meaning that the Tag will no longer appear in the playlists.

defagos commented 4 months ago

Thanks for the quick fixes.

EXT-X-PLAYLIST-TYPE removal fixed playback issues we had, as expected. Redundant key information removal also made the playlist shrink to ~350 kB.

samderdritte commented 4 months ago

Great to hear! Let me know whenever there are more issues surging up during your tests.

defagos commented 4 months ago

Other issues found on Apple platforms with our Pillarbox player. These need to be further investigated client-side:

These issues are not experienced with the SRF Info stream currently in production.

waliid commented 4 months ago

I've found another issue on Apple side, after seeking several times in the stream, the player is completely lost, and we get the following error: Unable to get playlist before long download timer

Entire log

#Version: 1.0\n
#Software: AppleCoreMedia/1.0.0.21E236 (iPhone; U; CPU OS 17_4_1 like Mac OS X; fr_ch)\n
#Date: 2024/05/07 14:56:37.037\n
#Fields: date time uri cs-guid s-ip status domain comment cs-iftype\n2024/05/07 14:56:34.034 
https://srfinfo-lsvs.akamaized.net/out/v1/e7bc6a6b7839440f93e21be3e8a76a6e/index_5_0.m3u8?start=1715079346 E5B8A796-8EEF-4551-8346-1EEBD29CC9FD - -16839 \"CoreMediaErrorDomain\" \"The operation couldn’t be completed. (CoreMediaErrorDomain error -16839 - Unable to get playlist before long download timer.)\" wifi-infra\n2024/05/07 14:56:34.034 
https://srfinfo-lsvs.akamaized.net/out/v1/e7bc6a6b7839440f93e21be3e8a76a6e/index_2.m3u8?start=1715079346 E5B8A796-8EEF-4551-8346-1EEBD29CC9FD - -16839 \"CoreMediaErrorDomain\" \"The operation couldn’t be completed. (CoreMediaErrorDomain error -16839 - Unable to get playlist before long download timer.)\" wifi-infra\n
samderdritte commented 4 months ago

Thanks @defagos and @waliid for the updates.

How can we support you with these issues from our side?

Are the manifest and the child-playlists behaving as you expect them and do they contain all informations you need?

defagos commented 4 months ago

Well, I think this is rather a question for you 😉

The iOS / tvOS player sticks to the HLS standard and Apple authoring specifications, so provided the streams you deliver match the standard things should work.

Where things can usually be a bit more informal (most notably audible and legible rendition characteristics in the master playlist) what I see matches our expectations (AUTOSELECT, DEFAULT, LANGUAGE or NAME for example). In media playlists the fix you made yesterday definitely helped a lot.

Note that the issues reported above with I-frames and random seeks seem lifted this morning, not sure if you updated something on your side.

There remain the error reported by @waliid which probably requires further investigation. If needed we might help in debugging, of course.

The question of the segment size is also still open. In documentation Apple usually recommends 6 seconds, which would make the media playlists even smaller, but I guess we can start with 2 seconds and adjust later if needed.

samderdritte commented 4 months ago

Well... :)

We kind of rely on AWS/Elemental to know what they are doing and that they stick to the Apple spec. Otherwise I can try to open a ticket and get Jeff Bezos to start a cage fight with Tim Cook.

Joking aside: There are still things in MediaPackage we could try to adjust. Already good to hear that removing the EXT-X-PLAYLIST-TYPE helped a big chunk.

It looks like the error which @waliid found seems to be linked to Apple Fairplay DRM, but that's about the only thing I found and nobody form Apple has commented on it.

Agree with you on the segment size. In our discussions with AWS they told us that the industry is currently fairly confident that 2s should work stable in today's environment. We want to lower the segment size mainly because the business units get a lot of complaints by audience because our streams have 30-40 seconds latency to competitors. With 2s segments we should be much closer to other OTT competitors (our first measurements showed that we are already faster than Salt.tv for example). But if we find out that 6s segments solves our issues, then we are more than happy to go live with 6s and then lead the discussion about latency with the business units later.

Another thing - and that was behind my question regarding your expecations of the manifest - is the "flexible" DVR window of 2hrs/7200s. I could not track the origin of this feature, but the idea is that the query parameters dw=7200 can be added to the URL and does not need to be calculated by the app which is feeding the player. However it is important to know that dw is not an accepted parameter by the packager - it only accepts "start" & "end" parameters for DVR (this is the same today with Azure). So what we do is the following: every request to the CDN with dw=7200 is translated into a Unix timestamp of starttime=date.now() - 7200 and replaced by start=starttime as query parameter. This start parameter is then propagated to the child playlists. This calculation is only done once when the master-playlist is requested - all subsequent requests to child-playlists will contain the calculated start-parameter. So if a user is watching the stream for four hours without interruption, the start-parameter will still be 6hrs ago. Now what we can influence on the Packager is the size of the storage in the DVR. Currently, this "startover window" is set to the same 7200s as the DVR Parameter. In consequence, the child-playlists will not contain segments older than 7200s. Hence, even if the start= parameter of the child-playlist may be 4hrs in the past, the first segment in the returned playlist will only be 7200s past now(). You can observe this behavior if you look at the EXT-X-PROGRAM-DATE-TIME timestamps.

What we could to is to increase the startover window (max possible value is 14 days into the past). This would then, however, mean that the child playlists would contain even older segments than 7200s in the past. The child playlists would grow over time. In most cases this is the actual wanted behavior: Take for example a football match in LSVE - the start-param for the DVR is set at a fixed point in the past, so that people joining late can go back to the beginning of the game.

Regarding i-frames and random seeks - no, we did not do any other changes since yesterday. One thing which I could think of is that you tested shortly after yesterdays changes and the packager needed some time to actually propagate the changes. Could be that the 2hrs of DVR lead to problems after the change - and only once the full DVR is based on the same settings, everything is working properly.

Happy to do a more detailed debug session with you, if needed. And please let us know if you want to try either of these:

defagos commented 4 months ago

Otherwise I can try to open a ticket and get Jeff Bezos to start a cage fight with Tim Cook.

Still waiting on another fight first 😉

Will have a deeper look at the rest of your comment afterwards, but we would definitely be interested in testing another stream with 6 second chunks (it would be better if you can keep the current stream tests as they are), just to compare the results. Thanks in advance.

samderdritte commented 4 months ago

we would definitely be interested in testing another stream with 6 second chunks (it would be better if you can keep the current stream tests as they are), just to compare the results. Thanks in advance.

Sure thing: https://play-mmf.herokuapp.com/mmf/subtitles_2019/media/urn:rts:video:_streamteam_1_lsvs_on_aws_dvr_6s (not sure if you need to wait for 2hrs for the DVR storage to fill up)

This is an exact copy of the other stream, only differences:

amtins commented 4 months ago

@samderdritte I've noticed that request processing time is a factor of 10+ in stream delivery.

Production prod

Origin Origin

CDN cdn

This may seem unimportant, but on some players it can be enough to trigger the bitrate selection algorithm and thus degrade the playback experience.

Network requests aside, the streams are stable and play well, although the web player still doesn't support TTML.

defagos commented 4 months ago

An interesting article from Zattoo (from 2021 but a lot is probably still relevant today).

For information BlueTV currently uses a segment size of 4 seconds.

defagos commented 4 months ago

@samderdritte Would it be possible to deliver a DVR stream without any DRM so that we can perform some tests? Thanks in advance.

samderdritte commented 4 months ago

@defagos of course:

I have created a stream without DRM: https://play-mmf.herokuapp.com/mmf/subtitles_2019/media/urn:rts:video:_streamteam_1_lsvs_on_aws_nodrm_dvr_2s

Additionally, I have also created a version with 72hours DVR storage, if you want to compare that: https://play-mmf.herokuapp.com/mmf/subtitles_2019/media/urn:rts:video:_streamteam_1_lsvs_on_aws_dvr_2s_3daysdvr

Did I understand it correctly, that we are not facing any issues with the DASH streams (except for the TTML subtitles, but that's the same as today)? In the links above, I only made the changes to the HLS streams - the DASH are unchanged. Let me know if we need to adjust them as well.

I would like to understand the causes of the Unable to get playlist before long download timer error. Are you in Geneva tomorrow? I will be on site all day for LSVE testing. It would be great if you had 30mins to discuss this issue.

defagos commented 4 months ago

@samderdritte Thanks for the additional test streams. The stability issues observed during seeks also affect the stream without DRM, the issue is therefore related to the stream itself, not to the DRM.

The result page has been updated accordingly.

amtins commented 4 months ago

@samderdritte could you remove the #EXT-X-PROGRAM-DATE-TIME except the first one ? After that could you test is again @defagos ? I tried to do it on my own by proxying the playlist but I was rejected by the server, it hurts my feelings...

samderdritte commented 4 months ago

@amtins #EXT-X-PROGRAM-DATE-TIME tags are now set to appear once at the beginning of the child playlists without repetition. Note: the packager allows for an interval setting for this tag (I have now set the interval equal to playlist length, but if you would need a shorter interval, the repetition interval could be changed according to your requirements).

defagos commented 4 months ago

@samderdritte Thanks. Tested and the result is still the same without #EXT-X-PROGRAM-DATE-TIME, stream fails to play after seeking to another location on Apple platforms.

Here are a few Charles captures, all obtained with the following scenario:

  1. Open a stream.
  2. Seek to another distant location, in a continuous motion and not too fast. Each move generates a new seek request which cancels previous ones, a naive approach implemented in Letterbox but also in other players.

Azure (current production stream)

Here is the relevant capture.

You can observe that successive seeks behave gracefully (200 status codes). No error is raised.

AWS 2-second stream (with DRM)

Here is relevant capture.

You can observe that successive seeks end lead to IO: Stream cancelled by CLIENT errors. After a while these turn into Socket: Broken pipe errors, at which point playback fails and an error is reported client-side by the Apple player.

AWS 6-second stream (with DRM)

Here is the relevant capture.

You can observe that successive seeks end lead to IO: Stream cancelled by CLIENT errors. After a while these turn into IO: Stream reset by SERVER with error code PROTOCOL_ERROR (0x1) errors, at which point playback fails and an error is reported client-side by the Apple player.

Video

Here is a video capturing the last two scenarios. A modified version of Pillarbox is used so that the final player error is displayed to the user. You can observe the events in Charles following a seek made in the app on the right.

https://github.com/SRGSSR/pillarbox-documentation/assets/170201/fff4ef7f-8990-4e61-a54f-64ea85cd4f26

samderdritte commented 4 months ago

@defagos Our quota increase request with AWS has been approved. We can now adjust the size of the child playlists up to 480 minutes. I have now set it to 120min/7200s for the HLS streams. This should match what we observed on the current Azure streams. (We can now do tests with different playlist sizes to find the sweet spot for HLS.)

Could you repeat your tests to see if there has been an improvement related to the seeking issue? 2s (DRM): https://play-mmf.herokuapp.com/mmf/subtitles_2019/media/urn:rts:video:_streamteam_1_lsvs_on_aws_dvr_drm_2s 2s (no DRM): https://play-mmf.herokuapp.com/mmf/subtitles_2019/media/urn:rts:video:_streamteam_1_lsvs_on_aws_dvr_nodrm_2s 6s (DRM): https://play-mmf.herokuapp.com/mmf/subtitles_2019/media/urn:rts:video:_streamteam_1_lsvs_on_aws_dvr_drm_6s 6s (no DRM): https://play-mmf.herokuapp.com/mmf/subtitles_2019/media/urn:rts:video:_streamteam_1_lsvs_on_aws_dvr_nodrm_6s

defagos commented 4 months ago

@samderdritte Thanks for the updated streams. Sadly the seek issue is still experienced with all of them.

samderdritte commented 4 months ago

@defagos ok :(

Could you provide new exports of the Charles files? The AWS Support could not read the captures which you provided earlier. They will analyze them with their experts.

defagos commented 4 months ago

Sure, maybe they will need to use the beta version of Charles to open the file.

Is it sufficient if I provide a single capture for the 2s (no DRM) case? The issue seems to be identical with all streams anyway and DRM does not seem to be the culprit here.

Also if they need a small iOS sample code we can provide one.

defagos commented 4 months ago

Here is the capture for the 2s (no DRM) stream already. Let me know if you need other captures.

I am running Charles 5.0b13 on macOS and the file can be opened.

samderdritte commented 4 months ago

Thank you, @defagos! I am able to open the capture with the v5.0b13. Will send them to AWS and tell them to use the beta.

If you have the sample iOS code, I can share that with them as well. I guess anything which helps them reproduce the error, will be good.

defagos commented 4 months ago

A few links talking about a similar issue (but without further information):

defagos commented 4 months ago

Here is a small iOS sample project to reproduce the issue.

This implementation was kept as simple as possible. It provides a slider to perform continuous seek requests, each one cancelling pending ones. The slider makes it possible to seek into a hardcoded range of 2 hours and does not reflect the current playback position. Other streams are provided in the source code for comparison (just update the URL of the player item). Their DVR windows are different but the 2-hour slider range suffices to reproduce the issue.

/cc @samderdritte

defagos commented 4 months ago

It seems that the behavior is exacerbated when Charles is used. So we have to be careful to understand what is intrinsic and what is not.

defagos commented 4 months ago

Another update. I think there are a several issues layered on top of each other which make the situation harder to understand.

First there are issues with Charles, the tool we usually use to inspect network traffic. I moved to Proxyman and I get a stable behavior.

Moving between tools revealed a lot of issues with networking on iOS devices as well. Several issues are documented regarding proxy configuration for Proxyman but I had strange configuration issues recently which often were solved with a restart. In some cases I even had to reset network settings to be able to get a network connection again. So I am not sure that we can completely rule out issues with our test devices, especially since their configuration is tweaked very often.

In the end I could achieve pretty stable behavior with the test LSVS streams on iOS. I could still have playback fail once but could not reproduce the issue afterwards.

Maybe the cascading failed requests we see are related to the use of HTTP/2.0 (our current production streams are served over HTTP/1.1). In any case I am not sure there is a serious blocker after all the tests I made, with devices whose network settings were properly reset first.

I'll update our Confluence status page accordingly and inform our streaming team

samderdritte commented 4 months ago

Thank you, @defagos for the extensive testing and the update!

We had a meeting yesterday with the Broadband PM and it was decided that we will proceed with the goLive migration.

We can still tweak the packaging settings in the future if we think that one or the other things (e,g, child playlist size) will yield better performance.

defagos commented 4 months ago

I think that HTTP/2.0 is the culprit, as suggested above:

Since there is now way to force the HTTP version client-side on iOS, I guess the only thing we need to do to fix the issue is to disable HTTP/2.0 support server-side. Things should then work smoothly.

This might be an issue in AVFoundation itself, though. Maybe this could be worth a bug report.

Remark: When switching HTTP/2.0 support on or off in Charles it is helpful to restart Charles itself and the simulator, otherwise the updated setting might not be correctly applied.