SRGSSR / srgletterbox-apple

The official SRG SSR media playback experience
https://srgssr.github.io/marketing/letterbox/
MIT License
14 stars 7 forks source link

SwissTXT livestream playback issues #82

Open defagos opened 6 years ago

defagos commented 6 years ago

SwissTXT livestreams with timeshift support sometimes randomly restart at the beginning. As investigated by Vinh, it might be related to long playlists containing inconsistencies:

Hi, just had a chat with Markus Sollenberger and he had some insights. Apparently there were issues with jumping streams before which have been debugged with your team beginning of 2016? Apparently the playlists can become very big (multiple days) which might cause confusion in some players when for whatever reason a single chunk is missing at the end. As an effect some players (incl. the AVPlayer?) jumps to the beginning of the playlist. The workaround was to add a “dw=0” parameter to the CDN requests, to indicate live streaming. This reduces the playlist to just the current chunks. Or to provide a start and end date. But never just the playlist without either range or dw params.

If this is true, we probably need the IL team to provide us with correct URLs with short time ranges.

vpdn commented 6 years ago

Initial test report: Mario (Sport) tested the BVB vs Real game on a 3g connection yesterday. The stream with "dw=0" played fine (Build 100), whereas the iPad with removed dw=0 (Build 99) skipped once to the beginning of the stream. So the parameter seems to really have an impact and is critical for swisstxt livestreams.

vpdn commented 6 years ago

And here's the note in the old sport app code as the reason why the "dw=0" parameter has been filtered out from the stream URLs. However I think the condition is wrongly inverted and the workaround is not removed during production but only left in production. Should probably be a "!= production" instead of equal.

image

sebastiennoir commented 6 years ago

@defagos @pietrini @SebastienChauvin @pyby I got questions asking if this dw parameter handling is currently performed in Letterbox or not. Can you tell in this issue what is the situation ?

pyby commented 6 years ago

Letterbox doesn't change Akamai URL parameters, expect the \_b___ (start Bit Rate).

dw=0 parameter transforms a DVR stream to a live only stream. The logic is on the backend side.

if SRF wants to remove the DVR stream, we should open a ticket for the AIS team (remove DVR stream in fullDVR and add highlights as chapters). it will remove a cool feature to have timeshift playback during the live.

SwissTXT knows that there is an issue on there DVR streams.

vpdn commented 6 years ago

For clarification: Removing the dw=0 parameter seem to have been a work around in the SRF Sport app to a bug, that the player wouldn't detect a fullDVR as such (making scrubbing impossible). The "fix" wasn't a business requirement but an attempt of a technical workaround. As long as letterbox plays all (swisstxt) streams fine, there's no need on our side to do anything at all.

defagos commented 6 years ago

We have contacted SwissTXT. The issue probably arises when a long playlist is played (without start or end date, and without window length parameter). In our Letterbox implementation, all streams are provided with such parameters. Playlists are therefore always truncated, and this problem should never occur when using our component.

A stress test is planned at SwissTXT to discover whether this intuition is correct or not.

defagos commented 6 years ago

We have scheduled an investigation of potential livestream issues ahead of the Olympics. I think we will have a better idea whether these issues are mitigated enough with current livestream setups and how Letterbox works.

defagos commented 6 years ago

We discussed with Akamai and could send a detailed Charles capture which they could compare with their server-side logs. Mohammad Moghal found something that needs further investigation:

I have reviewed this today and I do see some abnormalities. I noticed the ‘EXT-X-DISCONTINUITY’ tag in a few manifest files. This suggests that there were a few stop/starts and this is not consistent across the different bitrates. I would have to look further into this and will get back to you with an update.

defagos commented 6 years ago

Here is Mohammad's detailed answer:

I have done some more investigation on the stream jumps today and I have noticed that these jumps are related to stream drops and the extensive DVR window.

The EXT-X-DISCONTINUITY tag is what causes the streams to jump and this tag is added to the manifest whenever the connection breaks to the Entry Point.

I was able to correlate all of these breaks to the entry points. In order to avoid this I would suggest using the 'dw' query parameter or maybe use the 'start' parameter to control the DVR window. I would also suggest firing up the stream at least 30 minutes before you expect to start the stream.

Analysis details

I have reviewed the second Charles capture sent in by Samuel. Please find my detailed analysis below.

The stream enc18aww has a DVR window of 4320 mins (3 days). The first manifest (index_2000_av-p.m3u8) was requested at 10:40:19 UTC on 05/02/2018.

The manifest looked something like this:

--Start--
Segment151774678  =   Sun, 04 Feb 2018 12:19:40 GMT  0
.
.
Segment151774953 =  Sun, 04 Feb 2018 13:05:30 GMT  0
Discontinuity tag
Segment151781014 =  Mon, 05 Feb 2018 05:55:40 GMT  0
.
.
Segment151782720 =  Mon, 05 Feb 2018 10:40:00 GMT  0
--End--

This correlates to the connection breaks of the primary stream to the entry point below.

2018/02/04-12:19:19+0000 Announcing [enc18aww_gop_250@68260]
2018/02/04-12:19:19+0000 Announcing [enc18aww_gop_500@68260]
2018/02/04-12:19:20+0000 Announcing [enc18aww_gop_1200@68260]
2018/02/04-12:19:20+0000 Announcing [enc18aww_gop_2000@68260]
2018/02/04-12:19:20+0000 Announcing [enc18aww_gop_3500@68260]
2018/02/04-13:05:27+0000 Unannouncing [enc18aww_gop_250@68260]
2018/02/04-13:05:27+0000 Unannouncing [enc18aww_gop_500@68260]
2018/02/04-13:05:27+0000 Unannouncing [enc18aww_gop_1200@68260]
2018/02/04-13:05:27+0000 Unannouncing [enc18aww_gop_2000@68260]
2018/02/04-13:05:27+0000 Unannouncing [enc18aww_gop_3500@68260]
2018/02/05-05:55:23+0000 Announcing [enc18aww_gop_250@68260]
2018/02/05-05:55:23+0000 Announcing [enc18aww_gop_500@68260]
2018/02/05-05:55:23+0000 Announcing [enc18aww_gop_1200@68260]
2018/02/05-05:55:24+0000 Announcing [enc18aww_gop_2000@68260]
2018/02/05-05:55:24+0000 Announcing [enc18aww_gop_3500@68260]
2018/02/05-14:05:30+0000 Unannouncing [enc18aww_gop_250@68260]
2018/02/05-14:05:30+0000 Unannouncing [enc18aww_gop_500@68260]
2018/02/05-14:05:30+0000 Unannouncing [enc18aww_gop_1200@68260]
2018/02/05-14:05:30+0000 Unannouncing [enc18aww_gop_2000@68260]
2018/02/05-14:05:30+0000 Unannouncing [enc18aww_gop_3500@68260]

The player requested the final three segments in the playlist as it should have and after that another manifest was requested. The manifest requested was for the higher bit-rate but for the backup stream (index_3500_av-b.m3u8). The manifest for the backup stream had a few breaks in it and I could correlate the breaks at the entry point that happened on the 5th February. I was not able to capture the ones on the 4th February as there is a possibility that the entry point may have changed due to some reason as they are dynamically assigned.

The second manifest looked something like this:

--Start--
Segment151774678 = Sun, 04 Feb 2018 12:19:40 GMT  0
.
.
Segment151774692 = Sun, 04 Feb 2018 12:22:00 GMT  0
Discontinuity Tag
Segment151774708 =  Sun, 04 Feb 2018 12:24:40 GMT  0
.
.
Segment151774957 = Sun, 04 Feb 2018 13:06:10 GMT  0
Discontinuity tag
Segment151781014 =  Mon, 05 Feb 2018 05:55:40 GMT  0
.
.
Segment151781027 =  Mon, 05 Feb 2018 05:57:50 GMT  0
Discontinuity tag
Segment151781040 = Mon, 05 Feb 2018 06:00:00 GMT  0
.
.
Segment151781044 =  Mon, 05 Feb 2018 06:00:40 GMT  0
Discontinuity tag
Segment151781049 =  Mon, 05 Feb 2018 06:01:30 GMT  0
.
.
Segment151782721 =  Mon, 05 Feb 2018 10:40:10 GMT  0
--End--

The player from did not request the final (3rd last) segment from the playlist but actually requested the final 2 segments before the second break highlighted above. This is where the stream automatically jumped to DVR to something that was played a day before.

segment151774956 : Sun, 04 Feb 2018 13:06:00 GMT  0
segment151774957: Sun, 04 Feb 2018 13:06:10 GMT  0

Below are the encoder disconnects for the backup stream that I was able to see from the logs I had from one of the Entry Points

2018/02/05-05:59:37+0000 Announcing [enc18aww_gop_250@68260]
2018/02/05-05:59:37+0000 Announcing [enc18aww_gop_500@68260]
2018/02/05-05:59:38+0000 Announcing [enc18aww_gop_1200@68260]
2018/02/05-05:59:38+0000 Announcing [enc18aww_gop_2000@68260]
2018/02/05-05:59:38+0000 Announcing [enc18aww_gop_3500@68260]
2018/02/05-06:00:30+0000 Unannouncing [enc18aww_gop_250@68260]
2018/02/05-06:00:30+0000 Unannouncing [enc18aww_gop_500@68260]
2018/02/05-06:00:30+0000 Unannouncing [enc18aww_gop_1200@68260]
2018/02/05-06:00:30+0000 Unannouncing [enc18aww_gop_2000@68260]
2018/02/05-06:00:30+0000 Unannouncing [enc18aww_gop_3500@68260]
2018/02/05-06:01:10+0000 Announcing [enc18aww_gop_250@68260]
2018/02/05-06:01:10+0000 Announcing [enc18aww_gop_500@68260]
2018/02/05-06:01:11+0000 Announcing [enc18aww_gop_1200@68260]
2018/02/05-06:01:11+0000 Announcing [enc18aww_gop_2000@68260]
2018/02/05-06:01:11+0000 Announcing [enc18aww_gop_3500@68260]
2018/02/05-14:06:03+0000 Unannouncing [enc18aww_gop_250@68260]
2018/02/05-14:06:03+0000 Unannouncing [enc18aww_gop_500@68260]
2018/02/05-14:06:03+0000 Unannouncing [enc18aww_gop_1200@68260]
2018/02/05-14:06:03+0000 Unannouncing [enc18aww_gop_2000@68260]
2018/02/05-14:06:03+0000 Unannouncing [enc18aww_gop_3500@68260]

The third manifest that was played was for the same bitrate but for the primary stream (index_3500_av-p.m3u8) in the capture looked like this

--Start--
Segnment151774678 =  Sun, 04 Feb 2018 12:19:40 GMT  0
.
.
Segment151774953 = Sun, 04 Feb 2018 13:05:30 GMT  0
Discontinuity
Segment151781014 =  Mon, 05 Feb 2018 05:55:40 GMT  0
.
.
Segment151782722 =  Mon, 05 Feb 2018 10:40:20 GMT  0
--End--

The segments that were played after this manifest were again not the final few segments and they were the ones after the discontinuity highlighted above.. The following segments were played after the manifest

segment151781031 = Mon, 05 Feb 2018 05:58:30 GMT  0
segment151781032 = Mon, 05 Feb 2018 05:58:40 GMT  0
segment151781033 = Mon, 05 Feb 2018 05:58:50 GMT  0

After this the stream played as it should have increasing the segments in order chronologically.

As mentioned above the dw parameter will allow you to restrict your DVR window to a certain time frame. In this case if we had kept it to 30 minutes we would not have experienced any of the breaks that occurred. If we look at a particular case of a football game you could set this to the start time of the game.

Please do let me know if you want to discuss this in more detail on a call or if you have any further questions.

defagos commented 6 years ago

Here is my answer to Mohammad's analysis:

Thank you very much for your investigations. I reproduced the issue without start or window length parameters, but in our implementation for SwissTXT events (for which users reported jumps) we always have at least:

  1. A start parameter (live event with DVR capabilities)
  2. Start and end parameters (event VOD)
  3. A dw=0 parameter (live event without DVR capabilities)

From my understanding our users reported issues in the case 1. It seems plausible to me that just adding a start parameter wouldn't fix the issue in all cases (there should intuitively be no difference between a stream without start parameter, or a stream with a start parameter near its beginning, as the playlists would be almost the same length). dw=0 is known to fix the issue but disables DVR entirely, whereas other dw values provide us with a sliding window, which is kind of awkward for sports events and probably not what we want.

I can ask SwissTXT for a new stream and try to reproduce the jumps with a start time parameter if this can provide you with more information about what is going wrong. Just let me know.

I find it strange that the actual window as displayed by our player (see screenshots I attached to my original report) is close to 6 hours, not 3 days, though. Might it be related to the fact that SwissTXT piles up several events one after the other on the same stream (Christoph, please correct me if I am wrong)? The jumps I experienced namely always lead me to the “Livestream will begin shortly” screen, near the location where a new event is added to the pile. This makes me think that something special is occurring at those special locations, something that would lead to the discontinuities you have discovered. What do you think?

defagos commented 6 years ago

We discussed with SwissTXT, nothing can be done before the Olympics, they don't see what could be wrong.

I'll monitor the Olympics streams, which are readily available, and check whether I can reproduce the issue with them.

vpdn commented 6 years ago

The jumps I experienced namely always lead me to the “Livestream will begin shortly” screen

That was also my experience when I did the debugging of the streams. The location where the player jumps to seem to not be totally random. Most of the times when a jump happens, I saw the aforementioned splash screen.

defagos commented 6 years ago

Michael could dig up an old email (mid-2017) from Akamai whose conclusion seems to be the same, but which also proposes a solution:

We strongly suspect that the issue is related to the #EXT-X-DISCONTINUITY Tag being present at different time location in Primary VS Backup Encoder. As a reference you can consult the KB 19194: https://control.akamai.com/search/kb/19194 where the HLS RFC and the Apple approach are explained.

The behavior of the HLS.js Player is probably very similar to the one of iOS, and another publicly available player to test with might be this http://demo.jwplayer.com/developer-tools/http-stream-tester/

As mentioned in the session, a quick workaround could be to use of the argument ?dw=100 , for Example: http://players.akamai.com/hls/#/Main.html?url=http%3A//srgssruni23cww-lh.akamaihd.net/i/enc23cuni_ww@118762/master.m3u8?dw=100 at the end of the URL, but this would disable/shorten the time shift to only around 100 seconds, which is not suitable for your use case

Next steps:

You could check if removing discontinuity tag fixes the playback issue: this can be done either by using developer’s proxy tools like Fiddler/Charles, or by asking PS to remove the #EXT-X-DISCONTINUITY tag by modifying the advanced metadata configuration. The Internal KB 000019194 https://control.akamai.com/search/kb/19194 also contains instructions for advanced metadata (Anyway “testing by removing” from Fiddler/Charles is likely the fastest way to verify for then later let PS can change definitely the configuration). Since HLS.js is Open Source, another option could be to modify the source code of the player to ignore or differently handle the TAG #EXT-X-DISCONTINUITY, instead of letting it remove on our side by PS Change.

We hope this sum up can help you further, and attached you can find the Playlists and the Console logs we collected with the Developers Tools from the Chrome Browser. Please, don’t hesitate to come back to us for further information or support.

Kind Regards and a nice week end

Dario Ferroni Technical Support Engineer

defagos commented 6 years ago

A quick update about this issue.

During the Olympics, the problem was still experienced by some internal users, but we did not get major negative feedback from our users.

We had a workshop in Bern this week, and we could talk with Markus and Christoph from SwissTXT. After some negotiations, we agreed to give the EXT-X-DISCONTINUITY removal strategy a chance.

SwissTXT will first disable it on a test encoder. If successful, Ahmed from SRF then proposed to extend this setting to all SRF livestreams for a wider IRL test. Finally, if everything goes well, all 130 Akamai configurations will be updated. Since this cost us money (Akamai teams must do the configuration), we will start small and expand only if successful.

The change will be made in the coming days according to Markus on Slack:

I had a call with Akamai and they will set up the removed discontinuity tag on Encoder 0 stream (Test) and if that works, we will do some A/B testing on 2 streams in production.

defagos commented 6 years ago

Akamai sent us the following interesting documentation EXT-X-DISCONTINUITY:

As per HLS RFC, the discontinuity tag needs to be consistent across all bitrates: https://tools.ietf.org/html/draft-pantos-http-live-streaming-19#section-6.2.4. Apple does not formally recognize the concept of “primary” and “backup” so “alternate” streams must also have consistent discontinuity tags. If the discontinuity tags are inconsistent across bitrates, live playback will jump back to DVR on Apple Devices.

Inconsistent Discontinuity tag on a bitrate might occur on Stream Packaging stream id, if only one or few bitrates get disconnected or if backup disconnects and primary is fine or vice versa. The playback will jump from live to DVR when the player switches from a bitrate playlist that has no EXT-X-DISCONTINUITY tag to a bitrate playlist that has one.

This issue is specific to iOS devices and as Apple has restricted dvr scrubbing on live stream, we do not anticipate they will fix the issue. They do not expect DVR playlist when playing live content. The quickest and easiest mitigation is to use the “dw” query parameter with a smaller window (e.g. ?dw=100 for 100 seconds). In this case, the issue might still occur but the stream playback will jump back only by few minutes and any discontinuity tags will soon fall outside the short DVR window.

There is another mitigation that can be applied through metadata usage which will allow Edge servers to read through the playlist and remove any discontinuity tag that might be present in the playlist. There has been limited tests performed on this metadata and we have seen positive results.

Customer should be aware of that it is not as per Apple specifications.

The local test setup will be deployed very soon.

defagos commented 6 years ago

We encountered discontinuities on the FIFA World Cup France - Belgium SRF stream. The logs for post-mortem analysis can be downloaded here, with Akamai headers enabled.

defagos commented 6 years ago

For information, Mohammad is not our official Akamai contact anymore, though Markus seems to be still in touch with him since he knows more about our problems.

defagos commented 6 years ago

For information (again), here is how to setup Akamai pragma headers with Charles:

  1. Open Tools > Rewrite... and set locations to enable rewrite for screen shot 2018-07-26 at 11 13 25

  2. Create a header append rule which will be applied for these locations: screen shot 2018-07-26 at 11 13 41

The header list to set is: akamai-x-cache-on, akamai-x-cache-remote-on, akamai-x-check-cacheable, akamai-x-get-cache-key, akamai-x-get-extracted-values, akamai-x-get-true-cache-key, akamai-x-serial-no, akamai-x-get-request-id