cta-standards / R4WG20-QoE-Metrics

Issue tracking repository for the R4-Wg20 QoE Initiative
9 stars 2 forks source link

Feedback on proposed QoE metrics #3

Closed tonystott closed 5 years ago

tonystott commented 5 years ago

These comments are from Tony Stott, Head of Streaming Video Performance at Sky:

Pg5: “A Playback Session starts when a user attempts to play media (audio or video) and ends when the media completes.” I would describe it in relation to the customer, since the end of the session may not coincide with the end of media. The last definition has been truncated so not sure what is meant here.

Pg10: Playback Failure Percentage. This must explicitly be measuring failure once playback has commenced, as opposed to failure beforehand. It must also measure terminal failure not initiated or caused by the customer, e.g. the customer not having entitlement is not counted as a failure. The naming of the metric seems at odds with the other metrics in that it is an average failure rate but termed a ‘percentage’.

Pg10: Average Initial Startup Time. Must measure up to any ads or promo’s before the content starts.

Pg11: Average Playback Stalled Count. Reference to playbackStallCount before its defined. I question this being a first order measure. Its better described in terms of buffering ratio below. Specifically, it's the average of the count of stalls per session when it should be the average as a ratio of viewing time, otherwise comparing it meaningfully between sets is fraught.

Pg11: Average Stalled Time Percentage. Key metric but a bit of a mouthful. Can we rename to something simpler, say: buffering ratio? The denominator suggested is watched time which includes seeking (and perhaps restart time). This is incorrect as a customer seeking lots and pausing and restarting will create a high value which is not indicative of a bad experience or a failing delivery chain. In Conviva land, this is the difference between “rebuffering ratio” and “connection induced rebuffering ratio” – I would rather we got this right from the start. Watched time also includes start-up time. This infringes the calculation since stalled time should only measure once playback has commenced, therefore you have a numerator which excludes start-up time and a denominator that includes it. As an example, on the current definition if I see a 1s stall in the first 2s after an 8s start-up I would get a buffering ratio of 10% which doesn’t sound too bad, when it should be 33% (1/(1+2)).

Pg11: I would say key metrics missing from the normative stats are: average session length, avg start failures & avg restart time.

Pg12: Need to define the length of time a session can continue for, i.e. if a session is recorded at 3 weeks we can presume nobody is actually watching it. When should this cut-off be? Same for buffering, if we see buffering of 1hr we can conclude the reporting has failed and nobody is watching. This needs to be defined as is a big variance right now, causing metrics to be incomparable. Also, I don’t understand why a session ends when “a measurement timeout expires” – isn’t this bad reporting?

Pg15: “time is specified in seconds”, it should be in milliseconds. Events must be frame accurate, i.e. 40ms for us in Europe. I have a load of things I want to see in this table, but main ones are buffersize, bandwidth, device, OS + version, app version & a list of errors encountered.

We also need to be clear on why we are reporting everything as averages. Averages are easy to calculate but are they the most useful cut? We find averages mask issues. With millions of plays, we find most customers have an okay experience. But the ones that contact us, have negative NPS & churn are the ones in the bottom 1%. Surely these are the experiences we should be tracking and working to fix? As such, shouldn’t we change the definitions to the 99th or 95th percentile?

gheikkila commented 5 years ago

Copy of issue #22