cta-wave / device-playback-task-force

9 stars 0 forks source link

Every sample S[k,s] shall be rendered and the samples shall be rendered in increasing presentation time order. #65

Closed jpiesing closed 1 year ago

jpiesing commented 4 years ago

Many sections have a general observation something like this;

Every sample S[k,s] shall be rendered and the samples shall be rendered in increasing presentation time order.

IMHO this is wrong for video and impossible to implement for audio.

I recommend this be moved from a general observation to being specific to video and both instances of "sample" replaced by "frame".

Audio is largely a TBD in section 11.

jpiesing commented 4 years ago

For audio, this is a duplicate of #74.

haudiobe commented 2 years ago

accepted. We will move to video.

gitwjr commented 1 year ago

Jan 31, 2023: @haudiobe @jpiesing
DPCTF Spec v1.36 was edited by @gitwjr per the above comment. Sections 8.2-8.12; 8.15-8.23; and 9.2-9.6 were edited by deleting the text from the General sections and adding it to Video and Audio sections with minor text changes ("samples" to "video frame" in video, and "samples" to "audio samples" in audio) per discussions. Added comments in 8.3; 8.4; 8.17; 8.18; 8.23; 9.3 due to the "General" text differing from the text in the above comment. Those sections should be reviewed to ensure the text is correct.

yanj-github commented 1 year ago

For video: "Every video frame" is fine but I am a bit confused about what S[k,s] stands for?

For audio: How about "Every audio segment"? @cta-source any suggestion please? Maybe we can mention audio segment duration is 20ms somewhere?

jpiesing commented 1 year ago

For video: "Every video frame" is fine but I am a bit confused about what S[k,s] stands for?

Difficulty in getting my head around that terminology was the reason why I never contributed to the spec. I don't understand the difference between upper and lower case s in this text.

gitwjr commented 1 year ago
 @yanj-github  
 "Maybe we can mention audio segment duration is 20ms somewhere?"

9.2.5.4 states "It is preferred to keep to a measurement resolution of 20 mS, thus, the quantitative requirements in this specification are kept to multiples of 20 mS, such as “+40 mS”." This perhaps could be stated more clearly. Should it be stated somewhere else? 10.3.54 is the section on Audio but it is currently TBD.

gitwjr commented 1 year ago
  @yanj-github @jpiesing 
  For video:
  "Every video frame" is fine but I am a bit confused about what S[k,s] stands for?

From 4.3 Abbreviations, and a little digging:

gitwjr commented 1 year ago

Call on 2023/02/28: The order of the audio playback is trivial. Leave terminology for Audio as "audio sample". Leave open for now for further review.

haudiobe commented 1 year ago

@jpiesing will write a proposal to add a note to clause 8.2.5.3

jpiesing commented 1 year ago

Proposal:

Note: In ISOBMFF, an audio sample is "a compressed section of audio in decoding order" , i.e. conceptually similar to an access unit in other MPEG specifications. An ISOBMFF sample will be encoded from many uncompressed samples.

gitwjr commented 1 year ago

@haudiobe Please review Jon's recommended text and comment.

gitwjr commented 1 year ago

@cta-source @haudiobe Mike will propose some text and ask Thomas to review before entering into the spec.

mbergman42 commented 1 year ago

My proposal:

The current text including edits:

If the track is an audio track, then the following additional observations are expected: the following additional observations are expected: 1) The mediaTime of the presented sample shall match the one reported by the currentTime currentTime value within the tolerance of +/- 20ms. 2) Every audio sample S[k,s] shall be rendered and the audio samples shall be rendered in increasing presentation time order. 2) Note: In ISOBMFF, an audio sample is "a compressed section of audio in decoding order" , i.e. conceptually similar to an access unit in other MPEG specifications. An ISOBMFF sample will be encoded from many uncompressed samples.

Proposal, to replace all of the above:

For audio tracks, the following observations are expected: 1) When examined in 20 mS test sample periods (“test samples” of 20 mS), the time from the start of the track (T0) to the start of the test sample (which is equivalent to playout mediaTime) shall match the time reported by the playout currentTime value within the tolerance of +40/-120 ms. 2) When examined as a continuous sequence of timestamped samples of the audio stream, the 20 mS test samples are a complete rendering of the source audio track and are rendered in increasing presentation time order.

@haudiobe ; @jpiesing ; @gitwjr ; @yanj-github your thoughts?

yanj-github commented 1 year ago

Thanks @mbergman42

Proposal, to replace all of the above:

    For audio tracks, the following observations are expected:

        1. When examined in 20 mS test sample periods (“test samples” of 20 mS), the time from the start of the track (T0) to the start of the test sample (which is equivalent to playout mediaTime) shall match the time reported by the playout currentTime value within the tolerance of +40/-120 ms.
        2. When examined as a continuous sequence of timestamped samples of the audio stream, the 20 mS test samples are a complete rendering of the source audio track and are rendered in increasing presentation time order.
  1. New tolerance +40/-120 ms is reasonable in theory.
  1. Looks good to me.
mbergman42 commented 1 year ago

The "+40/-120mS" element is in 9.2.5.4, 9.3.5.4, 9.4.5.4 and 9.6.5.4.

The 40 mS would be from the detected audio sync, which I think you are currently taking from samples at the end of the track. My intent was that within the overall audio track sync, no sample should be presented outside these tolerances. So "playout currentTime value" may be the problem (not the 40 mS value).

I guess you develop some time basis from the end of the track samples, then measure from there. So maybe,

1. When examined in 20 mS test sample periods (“test samples” of 20 mS), the time from the start of the track (T0) to the start of the test sample (which is equivalent to playout mediaTime) shall match the expected time based on overall track synchronization within the tolerance of +40/-120 ms.

@yanj-github

yanj-github commented 1 year ago

Thanks Mike, I am a bit lost here. The "+40/-120mS" element is in 9.2.5.4, 9.3.5.4, 9.4.5.4 and 9.6.5.4. is checking the Audio-Video Synchronization. Correct me if I have completely misunderstood, I assumed currentTime from original wording is different to the Audio-Video Synchronization. "The mediaTime of the presented sample shall match the one reported by the currentTime value within the tolerance of +/- 20ms." observation is meant to measure audio media time with correspondent HTML currentTime reported by test status QR code.

mbergman42 commented 1 year ago

Sorry, I was in the wrong context. I agree that the 40/120 value isn't the right approach.

I'm not sure why the 40 mS is too small. I may not understand how the startup delay works in this context. We may be working off of different timing baselines. My timing baseline (in my assumptions) is the baseline you develop from synchronizing using the PN signal detection. I think you planned to sync against the last few samples. That timebase should not rely on startup delay. If it does, then I need to understand better.

When examined in 20 mS test sample periods (“test samples” of 20 mS), the time from the start of the track (T0) to the start of the test sample (which is equivalent to playout mediaTime) shall match the time reported by the playout currentTime value within the tolerance of +40/-120 ms.

Maybe the "which is equivalent to playout mediaTime is no longer true? If the timebase is dependent on syncing to the end of the audio? So delete the part in parenthesis?

yanj-github commented 1 year ago

@mbergman42 The way we capture time differences is to work out the correspondent audio sample that is presented at the same recording time as currentTime QR code. And compare the audio media time (the start of the test sample 0ms 20ms ….etc) with currentTime reported by QR code. Startup delay is measured from play() event to 1st audio sample.

On the device that I used for testing it shows 169ms start up delay on currentTime and the currentTime reports catches up with media time however the delay is still bigger then expected tolerance range from 45ms to 90ms. This is what I see from the test result of one device which I don’t believe is a representation of all devices. As this can be a device issue.

The main issue I can see from the result is the duration of audio playback is 30s exactly, but the duration between reporting of currentTime=0 till currentTime=30s is longer then 30 seconds.

At currentTime=0 no audio sample is played At currentTime= 25.97 audio_media_time = 100 … At currentTime= 29786.706 audio_media_time = 29840 At currentTime= 29999.999 no audio sample is played

I can see test failure on this device, but I am afraid I am not knowledgeable enough to comment what should be an expected tolerance here. If the currentTime and audio_media_time expected to match within +-20ms, then we can keep it. Can anyone help please?

mbergman42 commented 1 year ago

The way we capture time differences is to work out the correspondent audio sample that is presented at the same recording time as currentTime QR code. OK, it sounds like you’re doing audio sample verification using video as the “base” time.

The way I do it here was without referring to video. I take PN01, slice into 20 mS slices, and for each slice, measure the delay from (audio) T=0 to the location of that slice in the recorded audio. So, I'm just verifying recorded audio against PN01.

I think the test requirement (“Every sample S[k,s] shall be rendered and the samples shall be rendered in increasing presentation time order”) doesn’t require reference to video, if we’re making audio observations?

If so, I think this can be done without involving the video. My test program was,

  1. Get first 20 mS sample from PN01
  2. Locate this first sample in the recorded audio and take that as “T=0”
  3. Get a next 20 mS sample in PN01, find the delay from T=0 before the sample appears in the recorded audio. This delay should match 1:1 with the location in PN01.
  4. Repeat step 3 until end of track.

The main issue I can see from the result is the duration of audio playback is 30s exactly, but the duration between reporting of currentTime=0 till currentTime=30s is longer then 30 seconds.

Maybe this is clock skew? I had to deal with skew between the playout system’s audio clock frequency (DAC clock) compared to the recording system’s audio clock frequency (ADC clock). The slight difference between these clocks means that after some number of samples, the skew builds up and the samples don’t match within 20 mS anymore. That clock run-out or skew should not be counted for “tolerance” in this test, in my opinion.

DAC/ADC clocks for laptops are (at best) on the order of +/-100ppm. At that clocking tolerance, the skew between playout and recording could be 6 samples of 20 mS for a 30 second track.

My solution was to re-sync the timebase on each successful verification of a sample. That way, I was eliminating the skew each 20 mS. This is similar to how it would be done in communications systems for aligning sending and receiving clocks.

I don’t know if you’re seeing clock skew, but that would be my guess.

mbergman42 commented 1 year ago

This is the text in the new version of the DPCTF spec as discussed in the DPCTF meeting today (2023-05-17). This text is intended to resolve the fact that audio testing isn't relative to "frames" and that startup delay for audio needs to be defined differently from video. This text is used for most other sections (e.g., "as per 8.2.5.2" is used in multiple places), and a version of it is used but modified for certain tests.

8.2.5.2 Video If the track is a video track, then the following additional observations are expected: 1) Every video frame S[k,s] shall be rendered such that it fills the entire video output window following the properties in clause 5.2.2. 2) The presented sample shall match the one reported by the currentTime value within the tolerance of +/- (2/framerate + 20ms). 3) Every video frame S[k,s] shall be rendered and the video frames shall be rendered in increasing presentation time order. 4) Video start-up delay: The start-up delay should be sufficiently low. As user agents may pre-load the first frame, the time to first frame is not relevant, but what is relevant is that once hitting play, the second frame is rendered within the considered start-up delay. In addition, there may be missing frames at start up. Hence, TR [k, x] – Ti < TSMax where x is the first detected frame change after the play() event. [Editor: moved from 8.5.2.1 General.]

8.2.5.3 Audio If the track is an audio track, then the following additional observations are expected: 1) When examined as a continuous sequence of timestamped audio samples of the audio stream, the 20 mS test audio samples shall be a complete rendering of the source audio track and are rendered in increasing presentation time order. [Editor: This text was worked out between Yan and me in a sidebar.] 2) Audio start-up-delay: The start-up delay shall be sufficiently low. TR [k, 1] – Ti < TSMax. [Editor: This text was worked out between Yan and me in a sidebar.]

mbergman42 commented 1 year ago

On review and discussion in the DPCTF Test Suite meeting 2023-06-20, we agreed to 1) use the above text for automated test observations in the spec, and 2) do not attempt to document manual observations (it's a trivial exercise for the tester to verify at the human perception level, and we'll be bringing the OF up to full A/V test capability anyway).

I'll work with @gitwjr to update the spec with this language.

mbergman42 commented 1 year ago

The text is incorporated in v1.47 of the DPCTF spec. Recommend closing this as the action has been taken.

gitwjr commented 1 year ago

Closed per above comments.