Dash-Industry-Forum / Content-Steering

A standardized means of steering DASH players between substitutable content sources by way of a remote steering server.
6 stars 0 forks source link

TTL and DCSM reloading #27

Open bbert opened 1 year ago

bbert commented 1 year ago

I have some considerations on content steering specification about TTL.

  1. First, specification is a bit confusing if the player SHOULD or SHALL reload the steering manifest after the specified TTL interval.
  1. Before clarifying if it must be SHALL or SHOULD, I’d like to consider the use case for which the DCSM could be refreshed before TTL interval. The recommended value for TTL is 300 seconds, and in some cases, it would be valuable to force the players to refresh the manifest without waiting for TTL interval. As an example, a new CDN server could be allocated and which we would like to prioritize as soon as possible. Instead of globally reducing the TTL to a very low value and overload the steering server, one could decide to force the players to update the DCSM when necessary. This would require obviously an external control mean to steer the players and I don’t know how and if this should be part of the content steering specification. As a control mechanism, one potential solution is for example to standardize a CMSD response header key to force the players to reload the DCSM. Whatever the solution to control the DCSM reloading, if that makes sense we should consider adding some text in specification to take into account the use case where client can reload the DCSM before waiting for TTL delay.
haudiobe commented 1 year ago

2023/10/06 TF meeting

gwendalsimon commented 1 year ago
  1. Before clarifying if it must be SHALL or SHOULD, I’d like to consider the use case for which the DCSM could be refreshed before TTL interval. The recommended value for TTL is 300 seconds, and in some cases, it would be valuable to force the players to refresh the manifest without waiting for TTL interval. As an example, a new CDN server could be allocated and which we would like to prioritize as soon as possible. Instead of globally reducing the TTL to a very low value and overload the steering server, one could decide to force the players to update the DCSM when necessary.

I am not really convinced we need such a method to break the TTL (besides the complexity of it, see below). With a 300-sec TTL, all players are evenly distributed within this 5-min window, so they will come to the steering server one by one. In this particular use-case, it guarantees a graceful redirection of players to the new CDN, although forcing an update could generate a storm on the new CDN.

This would require obviously an external control mean to steer the players and I don’t know how and if this should be part of the content steering specification.

I would expect the steering server to be a stateless service, which does not store any information about the players. Furthermore, the steering server is not expected to know which players are still watching the session.

As a control mechanism, one potential solution is for example to standardize a CMSD response header key to force the players to reload the DCSM.

A CMSD message is issued by the CDN edge server. It cannot be the trigger to reload the DCSM since the decision to force the reload would come from the steering server... unless the steering server could ask the CDNs to send a CMSD message on its behalf.

burak-kara commented 1 year ago

As an example, a new CDN server could be allocated and which we would like to prioritize as soon as possible. Instead of globally reducing the TTL to a very low value and overload the steering server, one could decide to force the players to update the DCSM when necessary.

I remember this video from Apple WWDC22. They explain Pathway Cloning (starting from 8:16 with the background story) used to introduce a new CDN to the system. The idea still relies on the DCSM update at each TTL. They add PATHWAY-CLONES field to DCSM.

I try to illustrate the edge cases in which we want the new CDN to join the system before TTL (maybe preferably without any delay). But, for such cases, the player has the second (and so on) pathway on the PATHWAY-PRIORITY list as a backup.

bbert commented 1 year ago

Thanks @gwendalsimon and @burak-kara for your comments.

@burak-kara yes I know about pathway cloning but the use case was to update DCSM in order to get precisely new pathways before TTL delay.

@gwendalsimon I agree with you on the facts that steering server should preferably be stateless and the difficulties to ask CDN sending CMSD messages.

Let's tackle this issue in another way. In fact the use case would be to enable a player to know about new pathways when it encounters some issues with current available pathways.

A potential solution is to complete the client bahaviour specification by adding the possibility for the client to refresh the DCSM when it encounters playback problems and when it has already switched to all of the available pathways.

By the way, I think we should explain more precisely in the client steering behaviour what is meant by "If the client encounters playback problems". When should a client make a BaseURL or Location switch?:

Or is it completely opened to player implementation? @dsilhavy do you have any opinion on that?

haudiobe commented 9 months ago

Encourage to review the latest specification here: https://members.dashif.org/wg/Interoperability/document/4810

haudiobe commented 8 months ago

We should check check what the IOP says. Do we reload the MPD in case of repeated segment 404? IOP and MPEG-DASH recommends to reload the MPD. That may resolve the issue for bertrand.

@dsilhavy please let know how you have implemented. Then we fix the spec. and the we check of bertrands still exists and then we fix the spec even more.

bbert commented 8 months ago

We should check check what the IOP says. Do we reload the MPD in case of repeated segment 404? IOP and MPEG-DASH recommends to reload the MPD. That may resolve the issue for bertrand.

And in case MPD uses the same baseUrl as for the segments, the player would not be able to refresh the MPD. Please consider the use case where the player streams the content (MPD+segments) from a CDN and needs to be redirected to a newly created CDN/pathway to avoid playback failure.

dsilhavy commented 8 months ago

This is what dash.js does today:

As of today, we are not refreshing the manifest in case of repeated segment 404s. We are also not refreshing the DCSM.

What would be great if we can also collect the relevant parts of the specifications that dash.js shall implement to improve the current behavior.

haudiobe commented 8 months ago

Live TF 2024/03/01

Accepted that the spec details need to be collected.

haudiobe commented 3 days ago

IOP WG 2024/10/29

We suggest to update client behaviour

Please comment, we will update the spec.

thasso commented 3 days ago

As @dsilhavy was explaining what dash.js does, let me try to generalise the list.

The player receives a 404 response on a segment download and can perform the following actions:

Generally I would be okay to add Content Steering update to the list. Its a reasonable thing to do. My question would be if we want to formalise the client behaviour more than just allowing this option as well? If we just add this to the content steering spec, as an implementer, it might be unclear in which order the client is to go through the list above.

In the IOP Guidelines we basically quote the MPEG Spec and say in 4.8.2.1

Similarly, if the DASH access client receives an HTTP client error (i.e. messages with 4xx error code) for the request of a Media Segment, the requested Media Segment may not be available anymore or may not be available yet. In both these case the client should check if the precision of the time synchronization to a globally accurate time standard or to the time offered in the MPD is sufficiently accurate. If the clock is believed accurate, or the error re-occurs after any correction, the client should check for an update of the MPD. . If multiple BaseURL elements are available, the client may also check for alternative instances of the same content that are hosted on a different server.

This is in itself already ambiguous since it it not clear if the client should prioritise multiple BaseURL entries over retry behaviour or manifest updates. That said, I would propose the following order:

  1. Clock Sync (unless the client is sure that the clock is correct) and try again
  2. Retry according to client's retry configuration
  3. Use an alternative BaseURL and try again
  4. Content Steering Update and try again
  5. Manifest Update and try again
  6. Blocklist the rendition (for a configurable period of time) and try again with a different rendition
  7. Terminate the streaming session with an error

The implementation may decide to do steps 4. (and 5.) in parallel to ongoing segment download retries and not synchronously.

We should do the Content Steering update before the Manifest Update. @bbert mentioned above already that in case Manifest+Segments are coming from the same CDN and there is an issue, the player will not be able to do the Manifest update.

I added 2. and 6. to the list because this is something that I think is reasonable behaviour and there are popular implementations out there (ExoPlayer is one of them) that implement this as well.

What I am not sure of is if we should first try alternative BaseURLs or first (synchronously) update Content Steering. At the end I think it is a matter of available time for the client. If the client has enough buffer, it can easily first get an update from the content steering server. If the client is very close to running out of buffer, it might be better to use an alternative BaseURL. I would also assume here that the list of alternative BaseURLs is already sorted based on the last pathway priority response from the steering server. In this case, going to the next in the list is probably a reasonable and fast choice?

@bbert you also asked here if we should further clarify when the client should do a BaseURL or location switch. Personally I think this should only happen in the error case. Mostly because it would keep it simple and a lot of the other properties might easily depends on the client and the clients network rather than something upstream.