Use abs-capture-time header extension instead of RTCP SR for calculating captureTime on remote sources when available

murillo128 commented 6 months ago

Current approach to use the RTCP SR synchronization timestamp for the captureTime has two flaws:

It doesnt' work on uniderctional streams (it requires implementing rtrr)
Does not work on SFU scenarios as you will get the timestamp from the SFU, but not the timestamp of the originating source.

In webrtc we already have a working solution which will allow to support it on both cases:

https://w3c.github.io/webrtc-extensions/#dom-rtcrtpcontributingsource-capturetimestamp

Should we include that if the abs-capture-time header extension is available we should use it instead of the RTCP SR value?

tguilbert-google commented 6 months ago

@drkron, any thoughts? I don't have much WebRTC experience.

murillo128 commented 6 months ago

pinging the usual suspects @fippo @henbos @alvestrand @jan-ivar @aboba

aboba commented 6 months ago

In practice, captureTime is only available via RVFC for locally captured frames. Are you looking to obtain it on the remote peer as well?

murillo128 commented 6 months ago

captureTime is already supported for remote webrtc sources

captureTime, of type DOMHighResTimeStamp
For video frames coming from a local source, 
this is the time at which the frame was captured by 
the camera. For video frames coming from remote 
source, the capture time is based on the RTP 
timestamp of the frame and estimated using clock 
synchronization. This is best effort and can use 
methods like using RTCP SR as specified in RFC 
3550 Section 6.4.1, or by other alternative means if 
use by RTCP SR isn’t feasible.

However relying on RTCP RR or rtrr doesn't provide insightfull information on an SFU scenario. Using the abs-capture-time value would be thr best in this case.

Arctunix commented 6 months ago

When it comes to the "remote" part of captureTime, the current definition of it is very difficult to utilize in practice:

a) RFC 3550 Section 6.4.1 provides the sender with RTT estimations but what we need is RTT estimations at the receiver. This means that the receiver must either also send its own RTCP Sender Report or the receiver must send an RTCP Extended Report with a Receiver Reference Time Report Block and getting a DLRR Report Block back (see RFC 3611).

Note that even if the receiver does send its own SR, it may still not be sufficient. WebRTC is (if I remember correctly) implemented to always put the Delay Since Last SR response into a separate RTCP Receiver Report even if the receiver is sending media. This leads us to the awkward situation where the receiver has to "cheat" and use RTT estimations and NTP timestamps from a completely different set of RTCP reports (i.e. from completely different SSRCs) than the ones involved with each video frame in VideoFrameCallbackMetadata.

b) As @murillo128 mentioned above, RFC 3550 Section 6.4.1 and its derivatives are unable to "look beyond" RTCP-terminating mixers.

I believe that it would be more useful to redefine captureTime so that it's always based on timestamps from capture system's reference clock rather than having to be re-synced to the "local" system's reference clock. This would leave things as-is for the "local" case while allowing abs-capture-time (and possibly "timestamps baked into video frame headers") to be used for the "remote" case.

For example, changing the text from:

For video frames coming from a local source, this is the time at which the frame was captured by the camera. For video frames coming from remote source, the capture time is based on the RTP timestamp of the frame and estimated using clock synchronization. This is best effort and can use methods like using RTCP SR as specified in RFC 3550 Section 6.4.1, or by other alternative means if use by RTCP SR isn't feasible.

To say something along the lines of:

For video frames coming from a local source, this is the time at which the frame was captured by the camera. For video frames coming from a remote source, this is timestamp set by the system that originally captured the frame and with its reference clock being the capture system's NTP clock (same clock used to generate NTP timestamps for RTCP sender reports on that system).

In an ideal world, VideoFrameCallbackMetadata would have a full set of properties for the "remote" case:

1) Capture timestamp from the original capture system's reference clock. This is what's proposed here.

2) Estimated clock offset between the original capture system's reference clock and the local system's reference clock. This lets us calculate the one-way delay when combined with (1).

3) CSRC or SSRC associated with (1) and (2). Knowing timestamps, but not knowing from where they are coming from, is problematic when mixers are involved.

This is basically RTCRtpContributingSource but on a per-frame basis:

drkron commented 6 months ago

The neat thing (when it works) with the current definition is that all timesstamps are using the same reference and can be compared to performance.now(). This makes it very simple to calculate glass-to-glass delay, receive-to-render delay, etc.

I would suggest that absoluteCaptureTime is added next to the capture timestamp. This timestamp would then be the unaltered capture timestamp in the sender's NTP clock.

WICG / video-rvfc

Use abs-capture-time header extension instead of RTCP SR for calculating captureTime on remote sources when available #86