Closed yilin9999 closed 5 years ago
The first one is more close to the truth. Regarding viewing distance: a study has found that in UK the average household TV viewing distance is greater than 3H. So in this case, VMAF is a pessimistic estimation.
Hi li-zhi,
Thanks for your helpful explanation.
I have read the VMAF definition in the following two descriptions:
https://github.com/Netflix/vmaf/blob/master/resource/doc/models.md
https://medium.com/netflix-techblog/toward-a-practical-perceptual-video-quality-metric-653f208b9652
Questions:
I think ACR result can indicate the "absolute quality" for the video. Instead of only representing the distortion level in the previous VMAF with DSIS method, could current VMAF with ACR indicate “absolute quality” for different titles?
The part of “Predict Quality on a Cellular Phone Screen” in the first link, the phone model doesn’t apply the fixed viewing distance, but reference’s resolution is fixed in 1080p. In my opinion, the viewing distance of the phone screen is much longer than TV (>4H), so I think the proper reference’s resolution is supposed to be <1080p. it means the VMAF score with phone model will be much higher. Is that right?
@yilin9999 These are good questions.
On 1): what we found is that when passing ACR scores to the SVM model for training, when the hyper-parameters are set properly (meaning without incurring overfit), the learned model will still behave like a degradation metric at the very high quality end (for example, when passing two identical videos as ref/dis pair). When looking at the overall correlation, we found that the model trained by ACR behave better than DSIS, when cross-validated on other datasets.
On 2): this is a valid point. The assumption of 1080p and 3H corresponds to the angular resolution of 60 pixels/degree. If the viewing distance is > 4H, the corresponding planar resolution should be smaller than 1080p. The current phone model is piggybacked on the default model (hence 1080p reference) by applying a polynomial transform to the predicted scores (i.e. a point operator, learned from curve-fitting the default vs. phone scores of the same video viewed in two conditions). Perhaps a better way is to downsample to a lower resolution than 1080p in the training process. On the other hand, one needs to be cautious in applying the 60 pixel/degree heuristic, since it is a conservative estimation. The downsampling operation may be an overkill. It would be interesting to compare the prediction accuracy of the current phone model with the one trained with a lower reference resolution.
Hi, We have read the FAQ and some discussions, and still had a question about the reference definition.
Why does VMAF use 1080p/3H as its reference?
In my opinion, I think there are two possible aspects of this calibration.
First is resolution-driven, 1080p is mainstream TV high quality, and based on the ITU-R BT 2022, section 1.3.2, the corresponding viewing distance is 3H
Second is viewing-distance-driven, 3H is the common comfortable distance for watching TV, and based on the ITU-R BT 2022, section 1.3.2, the corresponding resolution is 1080p.
I am curious about which one does VMAF apply?