cta-standards / R4WG20-QoE-Metrics

Issue tracking repository for the R4-Wg20 QoE Initiative
9 stars 2 forks source link

Comments made by Tony Stott, Head of Streaming Video at Sky #22

Open mlevine84 opened 5 years ago

mlevine84 commented 5 years ago

1) Pg5: “A Playback Session starts when a user attempts to play media (audio or video) and ends when the media completes.” I would describe it in relation to the customer, since the end of the session may not coincide with the end of media. The last definition has been truncated so not sure what is meant here.

2) Pg10: Playback Failure Percentage. This must explicitly be measuring failure once playback has commenced, as opposed to failure beforehand. It must also measure terminal failure not initiated or caused by the customer, e.g. the customer not having entitlement is not counted as a failure. The naming of the metric seems at odds with the other metrics in that it is an average failure rate but termed a ‘percentage’.

3) Pg10: Average Initial Startup Time. Must measure up to any ads or promo’s before the content starts.

4) Pg11: Average Playback Stalled Count. Reference to playbackStallCount before its defined. I question this being a first order measure. Its better described in terms of buffering ratio below. Specifically, it's the average of the count of stalls per session when it should be the average as a ratio of viewing time, otherwise comparing it meaningfully between sets is fraught.

5) Pg11: Average Stalled Time Percentage. Key metric but a bit of a mouthful. Can we rename to something simpler, say: buffering ratio? The denominator suggested is watched time which includes seeking (and perhaps restart time). This is incorrect as a customer seeking lots and pausing and restarting will create a high value which is not indicative of a bad experience or a failing delivery chain. In Conviva land, this is the difference between “rebuffering ratio” and “connection induced rebuffering ratio” – I would rather we got this right from the start. Watched time also includes start-up time. This infringes the calculation since stalled time should only measure once playback has commenced, therefore you have a numerator which excludes start-up time and a denominator that includes it. As an example, on the current definition if I see a 1s stall in the first 2s after an 8s start-up I would get a buffering ratio of 10% which doesn’t sound too bad, when it should be 33% (1/(1+2)).

6) Pg11: I would say key metrics missing from the normative stats are: average session length, avg start failures & avg restart time.

7) Pg12: Need to define the length of time a session can continue for, i.e. if a session is recorded at 3 weeks we can presume nobody is actually watching it. When should this cut-off be? Same for buffering, if we see buffering of 1hr we can conclude the reporting has failed and nobody is watching. This needs to be defined as is a big variance right now, causing metrics to be incomparable. Also, I don’t understand why a session ends when “a measurement timeout expires” – isn’t this bad reporting?

8) Pg15: “time is specified in seconds”, it should be in milliseconds. Events must be frame accurate, i.e. 40ms for us in Europe. I have a load of things I want to see in this table, but main ones are buffersize, bandwidth, device, OS + version, app version & a list of errors encountered.

We also need to be clear on why we are reporting everything as averages. Averages are easy to calculate but are they the most useful cut? We find averages mask issues. With millions of plays, we find most customers have an okay experience. But the ones that contact us, have negative NPS & churn are the ones in the bottom 1%. Surely these are the experiences we should be tracking and working to fix? As such, shouldn’t we change the definitions to the 99th or 95th percentile?

gheikkila commented 5 years ago

Copy of issue #3

mlevine84 commented 5 years ago
  1. WG edited "PlayBack Session" definition. Comment resolved.

  2. WG feels definition is broad enough because the term "playback" captures the users intent of initiating playback. Any further error analysis is up to the analytic system. Metric is not a rate.

  3. The document already does this.

  4. This has been heavily discussed in the group, exactly with your arguments. The reason for (so far) keeping the non-normalized "stalls/session" is that many vendors seem to like this, as it's easy to explain for the (upper) management.

  5. Valid point. WG will review this further later.

  6. WG will discuss this further.

  7. Valid point. WG will review this further later.

  8. Valid point. Increase decimal point from 2 to 3. This doesn't imply increase in accuracy. We are specifying the maximum precision.

  9. WG has attempted to address this in Annex B.

mlevine84 commented 5 years ago

Comments 5-8: Assigned to Neel and Steve

heff commented 4 years ago
  1. I think this is a good point, but in favor of publishing something I think we should address it after the first version. I think we'll get lots of similar good feedback after publishing.

  2. The wording in the doc says "When time is specified, it is in seconds with precision of up to three decimal places." So it is inclusive of milliseconds. We could decide to require millisecond precision.

njadia commented 4 years ago

Comment # 8 - Addressed by WG, standardized on the unit. Completed.

@njadia Update diagram for "Playback of one asset with incomplete ..." change "Poll of Playhead time" to "Poll for a heartbeat"

@heff To propose language to address Comment #7 and add language to represent WG acknowledges the point raised but after much debate decided to leave it open.

Comment # 6 - WG agrees to include the requested metrics in the future version of the spec.

Comment # 5 - WG will review & deliberate on this offline, @heff will come with a proposal and WG agreed to conclude on this on Nov 13th meeting.

heff commented 4 years ago

For Number 5 I proposed an option in the google doc for discussion at the next meeting.

For Number 6 I created issues for the additionally requested metrics for discussion later on if we should include them (we haven't committed to including them just yet).

For Number 7 I added some language to the doc at the end of the Standardized Playback Session Metrics description to note that there is not currently a cap in session length but we'll discuss adding one in a later version.