Netflix / vmaf

Perceptual video quality assessment based on multi-method fusion.
Other
4.67k stars 756 forks source link

VMAF reference definition #238

Closed yilin9999 closed 5 years ago

yilin9999 commented 6 years ago

Hi, We have read the FAQ and some discussions, and still had a question about the reference definition.

Why does VMAF use 1080p/3H as its reference?

In my opinion, I think there are two possible aspects of this calibration.

I am curious about which one does VMAF apply?

li-zhi commented 6 years ago

The first one is more close to the truth. Regarding viewing distance: a study has found that in UK the average household TV viewing distance is greater than 3H. So in this case, VMAF is a pessimistic estimation.

yilin9999 commented 6 years ago

Hi li-zhi,

Thanks for your helpful explanation.

I have read the VMAF definition in the following two descriptions:

  1. https://github.com/Netflix/vmaf/blob/master/resource/doc/models.md

    • This model is trained using subjective data collected in a lab experiment, based on the absolute categorical rating (ACR)
  2. https://medium.com/netflix-techblog/toward-a-practical-perceptual-video-quality-metric-653f208b9652

    • In standardized subjective testing, the methodology we used is referred to as the Double Stimulus Impairment Scale (DSIS) method.
    • A consistent metric. Since VMAF incorporates full-reference elementary metrics, VMAF is highly dependent on the quality of the reference.

Questions:

  1. I think ACR result can indicate the "absolute quality" for the video. Instead of only representing the distortion level in the previous VMAF with DSIS method, could current VMAF with ACR indicate “absolute quality” for different titles?

  2. The part of “Predict Quality on a Cellular Phone Screen” in the first link, the phone model doesn’t apply the fixed viewing distance, but reference’s resolution is fixed in 1080p. In my opinion, the viewing distance of the phone screen is much longer than TV (>4H), so I think the proper reference’s resolution is supposed to be <1080p. it means the VMAF score with phone model will be much higher. Is that right?

li-zhi commented 6 years ago

@yilin9999 These are good questions.

On 1): what we found is that when passing ACR scores to the SVM model for training, when the hyper-parameters are set properly (meaning without incurring overfit), the learned model will still behave like a degradation metric at the very high quality end (for example, when passing two identical videos as ref/dis pair). When looking at the overall correlation, we found that the model trained by ACR behave better than DSIS, when cross-validated on other datasets.

On 2): this is a valid point. The assumption of 1080p and 3H corresponds to the angular resolution of 60 pixels/degree. If the viewing distance is > 4H, the corresponding planar resolution should be smaller than 1080p. The current phone model is piggybacked on the default model (hence 1080p reference) by applying a polynomial transform to the predicted scores (i.e. a point operator, learned from curve-fitting the default vs. phone scores of the same video viewed in two conditions). Perhaps a better way is to downsample to a lower resolution than 1080p in the training process. On the other hand, one needs to be cautious in applying the 60 pixel/degree heuristic, since it is a conservative estimation. The downsampling operation may be an overkill. It would be interesting to compare the prediction accuracy of the current phone model with the one trained with a lower reference resolution.