Clock drift - Githubissues

ZmGorynych commented 5 years ago

The timing model disallows clock drift in sec. 5.2.4. This is not a realistic requirement for linear -- there always is a small drift (i.e., very small difference in duration of a second) which in traditional MPEG-2 TS case is fixed by adjusting the wall clock time given PCR (which is the encoder time when a packet is written) and the wall clock time at the decoding device. If this is not done, a drift will eventually (within a relatively long time) start causing playback issues. In order to avoid the problem, we introduced the prft box, which is essentially a PCR equivalent. The proposed methods of mitigating the drift on a packager are unrealistic:

Silence / black picture create undesirable user experience not conveying the artistic intent;
You need in-the-clear access and possibly a just-in-time encoder and some stream rewriting in order to repeat a picture. You may not necessarily have the keys to the content
Telling the vendor to fix encoder implementation is not necessarily a good approach (you are just passing the work to the encoder vendor who may or may not be able to fix things, and inviting a situation where you have separate packager code for each encoder). Moreover, its timing may actually be derived from the timing at the acquisition (genlock).

I would suggest not touching the timing written by the encoder and doing shift compensation using the prft box at the DASH client.

sandersaares commented 4 years ago

I consider it unreasonable to expect DASH clients to implement and to correctly implement clock drift control. It is a complex topic difficult to understand, test and to even obtain test data for. There are endless complexities like different clocks drifting at different rate or the drift rate changing in time. Even designing a test harness capable of testing clock drift compensation would be impractical for client developers.

As such, it will likely be that only the most sophisticated DASH clients will implement clock drift compensation correctly, if any at all. While prft makes it easier to create DASH services, it just passes the buck to an audience far less capable of dealing with it.

I can appreciate the theoretical basis and do not dispute that it could be corrected on the client side. However, this is just not the practical reality for interoperable scenarios. Implementing this on the client side is impractical. I did a quick search of the code of some popular players and got no hits. Nobody implements this prft based compensation in widely used free players. Therefore, I believe it justified to say that this mechanism is not usable in interoperable DASH scenarios. The only place this problem can be solved in real world interoperable scenarios is on the service side.

Furthermore, the DASH standard sets a clear expectation that DASH timing is defined in relation to wall clock time. Accepting drift (even if technically not in nonconformance) would undermine this principle and lead to special casing. Clock drift is a defect upstream that needs to be stamped out. Those scenarios where it cannot be eliminated might be valid DASH but should not be considered interoperable.

sandersaares commented 4 years ago

Copy-pasting @ZmGorynych comment from https://github.com/Dash-Industry-Forum/DASH-IF-IOP/issues/231 to keep the discussion in one place:

What you are suggesting is impossible in operation and works perfectly only in a lab environment. In any deployment multiple clocks will inevitably very slowly drift apart. You cannot practically solve this -- you have a large distributed system where the service is de-facto driven by the genlock at the acquisition point at the van/studio, and consumed by multitude of devices with their own slightly different clocks and different time sources and protocols. A SHALL statement prohibiting this scenario is an affront to our credibility as an industry forum. I would rather say "if you are doing low-latency linear, please use prft and remember that slow clock drift may accumulate over sufficiently long time". Let the implementers sort it out. They may indeed chose to ignore it, but it's their informed choice, not ours.

sandersaares commented 4 years ago

Let the implementers sort it out.

We are the implementers. We cannot afford to assume that there is someone else who can just sort out the mess.

In any deployment multiple clocks will inevitably very slowly drift apart.

The problem is not about multiple clocks. An encoder effectively produces content according to a single clock per representation. This clock may drift from wall clock time, which causes issues. We can assume wall clock time is globally synchronized.

My viewpoint is that DASH-IF needs to state that the duty of the encoder is to keep an accurate clock and to regularly apply any stretching/compressing required to compensate for drift that has occurred. Each representation can be treated independently (assuming regular drift compensation interval - measured in seconds - the representations cannot drift apart enough from each other for any noticeable effects to occur).

I do not think this is rocket science - why do you claim it can only work in a lab environment? Perhaps we speak of different things? I again emphasize fixing it on the service side has nothing to do with multiple clocks (which I agree might be less practical). We have wall clock time and all participants in the DASH ecosystem need to track it in real time.

I also point out that DASH already assumes that playback of dynamic presentations is tied to wall clock time - the device and the service must be synchronized no matter what, as the anchor between the MPD timeline and the wall clock is fixed.

The drifting clock in the encoder is the root cause of the drift and it is also where the drift can be solved with minimal negative impact on the whole ecosystem. Pushing the problem down the pipe to client side does not have any advantages over solving the problem on the encoder side, as far as I can tell, and has significant disadvantages.

sandersaares commented 4 years ago

Copy-pasting @bmesander comment from Dash-Industry-Forum/DASH-IF-IOP#231 to keep the discussion in one place:

If you specify clock synchronization, you must specify a tolerance. Also consider that even if time is tightly synchronized, any two systems may well be in a different integer second around the top of each second.

sandersaares commented 4 years ago

Clock drift chapter updated in latest version to better outline the situation, that the workarounds are merely workarounds and illustrate with a picture: https://dashif-documents.azurewebsites.net/Guidelines-TimingModel/master/Guidelines-TimingModel.html#no-clock-drift

sandersaares commented 4 years ago

Proposed resolution: close issue.

Rationale: text was clarified according to received comments, to relate to CMAF, to better outline the root cause of the issue and to explicitly mention what is a solution and what is an imperfect workaround.

While on alternative approaches have proponents, these alternatives require more discussion and elaboration if they are to be integrated into the guidelines - perhaps proponents of the alternative approaches can raise the topic at the next F2F for detailed discussion.

haudiobe commented 4 years ago

(IOPv5 20/02/05): Keep it open for discussion, but we need concrete proposals. If no further comments are received by end of February, the issue will be closed.

ZmGorynych commented 4 years ago

I would suggest the following: (a) recommend usage of prft in live deployments; (b) recommend client implementers to look at the deltas between a client-side estimate of prft vs the actual value of prft and adjust if there is a developing drift. Unsure what should be the tolerance. @bmesander -- thoughts?

ojw28 commented 4 years ago

Telling the vendor to fix encoder implementation is not necessarily a good approach (you are just passing the work to the encoder vendor who may or may not be able to fix things, and inviting a situation where you have separate packager code for each encoder). Moreover, its timing may actually be derived from the timing at the acquisition (genlock).

How is this different to passing the work to the DASH client implementors, who may also not be able to fix things :)? Clock drift needs to be fixed somewhere. The closer to source it can be fixed, the fewer problems it's likely to cause.

I agree with @sandersaares comments. It feels like it should be possible to correct drift (i.e., keep any error within an acceptable tolerance and therefore prevent unbounded drift) somewhere on the serving side. The closer to source the better, but timestamp adjustments in the packager seems like a plausible solution if necessary. If you make correcting the clock drift the client's job, you turn a nice clean timing model that the client (and client implementor) can reason about into something that's instantly significantly more complicated.

If you specify clock synchronization, you must specify a tolerance. Also consider that even if time is tightly synchronized, any two systems may well be in a different integer second around the top of each second.

I agree that it might be helpful to specify a tolerance. Although it's unclear to me what an acceptable tolerance would be. I'm not I follow the comment about integer seconds. Why would a DASH client ever be highly sensitive to being in a different whole second compared to the serving side? Why would it be doing any calculations based on whole seconds at all, rather than, say, milliseconds? I'm not sure if I've just missed the point.

ZmGorynych commented 4 years ago

It is impossible to eliminate the drift at the server side -- in a large-scale deployment with a mix of mobile, IPTV, and STB clients you do not know whether the client is synchronized to the same time source as the distribution or the contribution encoder.

If you look at the older IPTV clients, you may notice that typically they are tracking both the difference between the consecutive PCRs and the difference between the wall-clock arrival times. I think following the prft and looking at the wall clock time is exactly the same .

If you miraculously know what all of your clients are synchronized to, or take care of this using UTCTiming, I think your packager will need to follow the clock and periodically insert small (order of magnitude of milliseconds) gaps between segments (i.e., use S@t).

Anyhow, assuming that a single upstream component will take care of a problem is an exceptionally bad practice from reliability standpoint. If you want to provide high quality experience, every element in your chain needs to be resilient to whatever input it gets and provide valid output.

ojw28 commented 4 years ago

If you want to provide high quality experience, every element in your chain needs to be resilient to whatever input it gets and provide valid output.

This seems like a rather extreme position. We're all operating with finite resources (time, money, run-time resource). In this context it's necessary to decide what a practical, reasonable and necessary amount of resiliency is at each point in the chain. If a problem can be reliably handled at a single point in the chain, is it a good use of resource (that could be spent on something else) to also handle the problem at every other point in the chain, just in case, and furthermore to do that for all possible problems?

I do not think this is rocket science - why do you claim it can only work in a lab environment? Perhaps we speak of different things? I again emphasize fixing it on the service side has nothing to do with multiple clocks (which I agree might be less practical). We have wall clock time and all participants in the DASH ecosystem need to track it in real time.

It would be helpful to answer this question. As has already been pointed out, DASH already assumes the device and service are synchronized. Given the existence of this synchronization, why is it theoretically impossible for the server side to correct the drift to the same level of accuracy as the synchronization itself?

ZmGorynych commented 4 years ago

We are operating thousands of channels, some of them have issues. They take different routes, use different vendors, and herding the cats makes sense but is operationally hard to achieve. The viewer does not care about specs or whose equipment malfunctions. Hence you have to have bulletproof components, otherwise you may get to 1-2 9's, but not 5 9's you need for a production system.

In terms of clock drift -- just as an example, we've observed drift issue between encoded streams coming from different data centers.

The motivation for UTCTiming descriptor was that many mobile devices used GPS as time source, while the CDN was on NTP, and Akamai observed mismatches of 1 sec (if not more), which resulted in 404's. This was the moment where we realized that assumption of precise global synchronization was overly optimistic.

sandersaares commented 4 years ago

What I see in your comments @ZmGorynych is a strongly motivated claim that synchronizing clocks across a large fleet of different systems from different vendors is difficult. No argument from me to that!

However, CMAF says different tracks follow same timeline (implies synchronized encoders) and DASH defines wall clock relative client operation (which implies synchronized clock, as no temporal coordination could otherwise occur between client and service).

If the assumption that such a clock exists is not practical, MPEG needs to remove the wall clock from DASH timing calculations, adjust CMAF to allow encoders of different tracks to drift and switch to some kind of live-edge-relative DASH timing model (for example). This seems like a big ask but if this is as big a problem as you say, perhaps a big problem deserves a big solution.

Yes, it may be difficult to eliminate significant clock drift (I cannot agree with "impossible") and yes, prft and other tools may theoretically allow for alternative drift-tolerant timing models to be introduced but the DASH-IF interoperability guidelines are not the place for such innovation. We need to keep to the practical reality that is implemented by ecosystem participants.

Likely the most useful thing DASH-IF could do here is to focus its energy on creating good test vectors and validation tools to enable drifting services to be identified and to promote good clock synchronization behavior in clients:

It would be nice to have content with diffferent UTCTiming descriptors that have intentional "wrong" date/time combined with test streams that reject requests that do not come at "expected" (synchronized to "wrong" values) times.
DASH-IF should have some long-term runnable stream validator that you can let run over 24 hours (or even weeks) against some DASH service to determine whether its representations start drifting in any direction.

haudiobe commented 4 years ago

Solved by profile

haudiobe commented 4 years ago

Please check clause 6.4 here: https://1drv.ms/w/s!AiNJEPgowJnWgotJG4uaEqFkZ3r1wQ?e=9T5DOu

sandersaares commented 4 years ago

The linked looks fine and I am OK with this being part of a profile but I still can't help but wonder about the implications of clock drift to CMAF/DASH.

As I understand the MPEG standards, they do not allow for clock drift (yes there are compensating mechanisms like prft but that doesn't remove the big picture statements like synchronized tracks). If MPEG wants to allow clock drift, it should remove the generic constraints like saying that CMAF tracks all use a common timeline (because if they are produced by different encoders that drift relative to each other, they really don't) and the entire concept of wall clock from DASH (if you can't trust client and server to track time, you must use relative time - but DASH is not built on relative time right now).

Dash-Industry-Forum / Guidelines-TimingModel

Clock drift #5