Netflix / vmaf

Perceptual video quality assessment based on multi-method fusion.
Other
4.48k stars 747 forks source link

VMAM between reference and itself not 1? #76

Closed mht13 closed 5 years ago

mht13 commented 7 years ago

Hi, I have noticed in multiple cases that when comparing the reference file with itself the score is not 100. Sometimes it can be 99.99 but in other cases, say, 98.7, or 99.2, while SSIM in all cases gives exactly 1. So shouldn't VMAF give exactly 100 in those cases? If not, is that related to the perceptual component in the data fusion? And if so, would that be some rough level of uncertainty between VMAF and DMOS? (say, 1% for what should be absolutely lossless and some other number for other resolutions). Is this a known feature? If so, are there details between statistical deviations between VMAF and DMOS for the training sets used as provided? Any thoughts on this?

Btw, vmaf_feature_motion seems to be causing this.

li-zhi commented 7 years ago

Hi @mht13 : you are right. VMAF doesn't guarantee that you get a perfect score in this case, but you should get a score close enough. Similar thing happens to many of these machine learning-based predictors (another example is VQM-VFD).

We are also trying to add a confidence interval to VMAF's prediction. If you have any ideas, please let us know.

vmaf_feature_motion does have an impact on the score in this case. Intuitively, lower the motion, more likely you are able to appreciate the artifact. So you might be more likely to get a imperfect score for static scenes.

mht13 commented 7 years ago

Hi @li-zhi

What follows is an extremely rough suggestion.

Ideally I'd like to see a rigorous uncertainty quantification approach, but that is technically somewhat non-trivial. I have some expertise on this so if you are interested we can chat about it and perhaps do something together.

The other, way simpler, thing that I would do is the following. You are getting a PDF using data fusion and SVM. My understanding is that you are reporting point wise values (from a statistical point of view) with no confidence intervals, at least as publicly known. That goes back to me wondering how you analyze the 1-5 scores -- you probably want to do a proper multinomial analysis (which is what we do). That would give you CI for at least your scores. Perhaps you are doing so already and if so I apologize for being redundant/naive.

I have noticed from your preprint on the arXiv that you do tests wrt to number of observers, and that is great.

So, given that you already have all that data, I'd do a KDE for each of them (there are lots of details of how to do so but it is more or less known). At that point you are dealing with continuous functions, so you can do a standard (self in this case) convergence test, in whatever norm you find appropriate (I think it should be a weighted one and that is what we do). I can give some more suggestions if this is something that interests you, but it is standard convergence tests, with the difference that now you are dealing with continuos functions as opposed to discrete data from "random" (even if carefully selected) data. The relative self-differences in the KDE should converge, at a perhaps unknown a priori convergence rate. The self-differences should give you a rough idea of the uncertainty in your final model.

Hope that helps.

li-zhi commented 7 years ago

@mht13 thanks for for the comments and suggestions. I have several questions:

1) When you suggested multimodal analysis, did you have in mind binning the scores into five bins 1, 2, ... 5? I am not sure if this is appropriate since we collected continuous scores within [1, 5]. But the very issue of continuous score vs. discrete score is up to debate.

2) The suggested KDE approach is interesting. Do you suggest that I throw in the processed scores (e.g. MOS) or the raw scores as sample? It will be nice if you can give more detailed suggestions on the convergence tests and analysis. Also, do you have a rough idea of the computational complexity? How does it reconcile with the support vector regression that's actually been used?

3) My original thought was that to derive the confidence interval of VMAF scores, there are two pieces of uncertainty that one needs to quantify: 1) the confidence of the subjective scores. This was listed as future work in the Arxiv paper. I have some ideas but haven't got the bandwidth to execute; 2) the confidence of SVR prediction. I googled the literature, the problem seems non-trivial. Again, if you have any suggestions. Please let me know.

mht13 commented 7 years ago

@li-zhi

First off, I run a startup on content adaptive video compression (fastech.io) .There are many answers to these questions in our tech report/white paper. We are redoing many of those experiments and have more data than what was written at the time. Feel free to email me to tiglio@fastech.io

With regards to your questions, these are my thoughts:

1) I am not sure that I understand your question, but let me give it a try. I am not suggesting binning the scores, but instead at least do a proper multinomial analysis (I can help out with that, I know how to do it). The subtlety is that if you have multiple scores, which are exclusive, how to establish CI for each of them. Continuous scores requires a lot of data to establish CI, so I would stick to the 1-5 or 1-9 score system. We can discuss this further if it is something that interests you.

2) Again, I am not sure of what the question is. In my opinion, you would ideally want to propagate all the errors properly from the raw scores. Now, doing that properly is a lot of work and highly non trivial. So, to start with I would use the processed scores and do a KDE (here you want to use a compact support kernel, not the standard Gaussian kernels), and there is the issue of the width of the kernel that you choose. But that is more or less known.

The convergence tests, in my mind, should be standard ones. You choose a norm (which should be consistent with your metric) and do self convergence tests. Namely, you take your highest quality data (you did already this kind of analysis in your preprint on the arXiv). So take the self differences, they should go down and a rough estimate of your errors is that you take the differences between the last two sets. While doing this you might want to use a weighted norm which is essentially the VMAF metric (you have to put that into a weight, but it is doable I think)

About computational complexity, it is really cheap, since you would be doing that offline anyway and you have the data already.

Now, again if you want to do a first to end propagation errors you can do that. It is a lot of work starting with SVM and data fusion, but doable (and I think it is worthwhile).

3) I would suggest not to go into every step because that will be a nightmare. Instead, you can take the final VMAF scores and do a self-convergence test. That is easier than what appears. First, you need to turn your statistics into continuous functions. The KDE will have some error, which is not too difficult to quantify. Then you do a classical numerical analysis self-convergence test.

Please let me know if you want me to elaborate on any of these points.

Cheers

li-zhi commented 7 years ago

About computational complexity, it is really cheap, since you would be doing that offline anyway and you have the data already.

I am confused: shouldn't we have a prediction interval for each VMAF score calculated? Why should this be offline?

mht13 commented 7 years ago

It is done offline anyway (though now there is a possibility of training your own set).

You provide a default metric. That is being done offline. If you would like to have CI online, that is a whole new discussion.

mht13 commented 7 years ago

Let me clarify a little bit this, since I realize I might have been confusing. You are building a model offline anyway, with the expectation (or actual realization) that when people use it online it will be trustful (training vs validation). So you might as well add CI as well within the model. So essentially my suggestion is that you provide error bars with the model, you validate it and that's it. Nobody (I think) expects that you can provide CI online. For example, say, one person runs one instance of VMAF, you cannot get any kind of sense of a CI with one point. So you might as well do it offline and validate it.

You have some spreads already in the plots that you show vs MOS. Just quantify them, and that's it. You can do it in an agnostic way with the final model, as I suggested. Doing it step by step in every process and come up with rigorous CI would be quite some work. But given that you have a model, why not add some CI in an agnostic way?. That is not expensive at all, the most expensive part is actually doing the perceptual tests, training the model and, well, actually computing it online.

nandakd commented 7 years ago

Great discussion! I strongly agree with the general need here - which is better quantifying VMAF accuracy and spread, since the correlation metrics tell only a small part of the story.

Assume the subjective scores are the golden reference (either using the joint estimation technique in your paper or conventional mean, z-score, subject rejection methods). If we consider VMAF as a prediction of MOS scores, how do we accurately describe the relationship between VMAF and MOS?

Initial thoughts - use the best-fit regression equation (linear/poly/exp/log) and the standard error of regression to calculate the 95% prediction interval (95% confidence that the prediction will lie within this interval). However, is the accuracy of VMAF uniform across its entire range? For instance, I would assume higher accuracy in the middle ranges and lower accuracy in the low-range. How do we capture this?

Thoughts?

christosbampis commented 5 years ago

Before closing this issue (hasn't been active for some time now), I feel that the uncertainty portion of VMAF predictions has been recently addressed (to some extent) with the confidence intervals using CI. Please find more details about it in the reference section.