Netflix / vmaf

Perceptual video quality assessment based on multi-method fusion.
Other
4.5k stars 748 forks source link

Corner case is giving wrong VMAF and PSNR values! #371

Closed waveletbeam closed 3 years ago

waveletbeam commented 4 years ago

We tested two different VMAF versions under windows. For the corner case where the reference video and the video under test (distorted video) is the same, we expected VMAF values of 100 for each frame and PSNR values around 116dB(8Bit). But we got VMAF values between 97 and 100 and a constant PSNR value of 60dB (For 10bit files we got a PSNR value of 72dB).

http://projekte.waveletbeam.com/VMAF_v2.jpg http://projekte.waveletbeam.com/test8Bit_v2.csv

For PSNR maybe the formula 10Log(MAX/ROOT(MSE)) is used instead of 20Log(MAX/ROOT(MSE))

You also can double check the results using the Netflix clip checkerboard_1920-1080_10_3_0_0.yuv, which is located in the VMAF test folder.

Any help or opinions?

li-zhi commented 4 years ago

For VMAF, please refer to the FAQ below: https://github.com/Netflix/vmaf/blob/master/FAQ.md#q-when-i-compare-a-video-with-itself-as-reference-i-expect-to-get-a-perfect-score-of-vmaf-100-but-what-i-see-is-a-score-like-987-is-there-a-bug

For PSNR, we force the value to be capped at 60dB for 8-bit representation and 72dB for 10-bit representation. This can be justified by the PSNR formula and the mean squared error caused when an arbitrary signal is represented by quantizing to x bits.

waveletbeam commented 4 years ago

This gives us a pretty good understanding of the use cases for VMAF. This corner case gives us the confidence that the precision of VMAF is in the best case 3 VMAF points, which is already 50% of a visual difference (6 VMAF points).

Especially this has to be taken into account for encoder comparisons!

Additionally, PSNR is capped at 60 dB. From the technical point of view there is no reason why PSNR values above 60 dB (8Bit) couldn’t be calculated. The downside is that VMAF in this version only can be used for the distribution of highly compressed video. But for the overall video quality also the quality and frequency range of the ‘Intermediate Distribution Master’ should be monitored. There is still missing a definition of ‘video resolution’. For the distribution, video lines and a VMAF (SSIM, PSNR) value gives us the confidence that the quality is acceptable for the most viewing conditions. But video quality and resolution are described by a much richer range of parameters, which are also important for the encoding ladder. If you are already reducing the resolution of the Intermediate Distribution Master e.g. via a too highly compressed video material, you will get influenced VMAF values at the distribution path. VMAF is tuned on the current encoder technology. New encoder technologies like AV1 will provide mandatory tools like ‘Film Grain Synthesis’, which much more reflect the nature of video, film look and the transmission channel.

li-zhi commented 4 years ago

For confidence of VMAF prediction, a better way would be to use bootstrapping to quantify the 95% confidence interval of each prediction. See this page for more information: https://github.com/Netflix/vmaf/blob/master/resource/doc/conf_interval.md

For PSNR, to see why we use 60 dB to cap 8 bit and 72 dB to cap 10 bit, the rule-of-thumb formula is (6 * N + 12), where N is the bit depth. To be more precise, the formula is: 10 * log10( (2^N-1)^2 / (1/12)), where (1/12) is the MSE of uniform noise within [0, 1].

waveletbeam commented 4 years ago

PSNR: capped at 60dB:
But why is it necessary to capp at 60db based on this noise estimation? By the way, noise levels can be much higher. After denoising of the Netflix meridian demo content we got a PSNR value below 50dB, while the picture quality was enhanced.
http://projekte.waveletbeam.com/Meridian_1080p.jpg

li-zhi commented 4 years ago

Note that the uniform noise is the quantization noise resulted from 8-bit representation. It has nothing to do with the film grain or camera noise in the source. The 60 dB can be thought of as the fundamental limit of what a 8-bit representation can bring you. Anything beyond 60 dB is not sensible.

st599 commented 4 years ago

60 seems a bit high for 8 bit video.

waveletbeam commented 4 years ago

Even for lossy video I got dB values above 60dB in some cases. Today IMF with lossy J2K is used for the intermediate distribution master but with AV1 this will change because you will also need the uncompressed grain for the tool 'film gain synthesis'. So, as I see it today there is no need for capping at 60dB and even more this threshold is limiting the use cases of VMAF. Lossy J2K artefacts: http://projekte.waveletbeam.com/JPEG2000_Lossy.jpg

st599 commented 4 years ago

More than 60 is meaningless for 8 bit video.

8 bit quantisation means you can't have a PSNR of greater than 58.9 dB.

waveletbeam commented 4 years ago

The quantization already happed much earlier in the workflow in the camera (14/16Bit) and later where we are downsampling from 12 /10 bit to 8 Bit. At the distribution(encoding)stage we have 8 or 10 Bit values per channel and there is an error based on this quantization, which is part of the noise and grain amplitudes but definality you can get dB values above 100dB for 8bpc images. Maybe you are only using a limited color range?

st599 commented 4 years ago

By definition, you can not improve the channel performance beyond the quantisation noise limit. Therefore the maximum valid PSNR is the quantisation noise limit. This is easily calculable.

For a value above that PSNR value to be valid, you'd need to have a non-uniform distribution of quantisation noise in the initial A/D conversion.

waveletbeam commented 4 years ago

The removal of video noise and grain will lead to PSNR values between 113dB and 58dB. Even with an additional lossy compression with high bitrates you will get PSNR values above 60 dB. For PSNR values below 60dB the noise and grain structure has been totally destroyed. For video distribution this matters!

st599 commented 4 years ago

I repeat: By definition, you can not improve the channel performance beyond the quantisation noise limit. Therefore the maximum valid PSNR is the quantisation noise limit. This is easily calculable.

Any number above this is not valid.

waveletbeam commented 4 years ago

I think at this point we can stop the conversation! Thank you for sharing your opinion.

st599 commented 4 years ago

It's not an opinion.

I've never seen a text discussing the derivation that doesn't mention this principle. It's an easy mathematical proof.

waveletbeam commented 4 years ago

In this case you have to provide the mathematical proof otherwise it's only your opinion

st599 commented 4 years ago

The mean squared error of a single quantisation step is given by: image

which simplifies to: image

If you set the step size to 1 (as the equation is expecting 256 levels, the smallest step of which is 1) and substitute this in to the PSNR equation: image

which for 8 bits solves to 58.92261 dB

If you get a number higher than this, then you're effectively saying the process under test is more accurate than the quantisation step size of the video. This is not possible as the information is lost at the stage it is quantised. You can have a perfectly accurate process, but this can not be more accurate than the quantisation allows.

waveletbeam commented 4 years ago

Now I see the problem. Its your MSE calculation. The lowest mean squared error is not 1/12

Trulli Trulli

The lowest mean squared error can be calculated by only changing the brightness of one pixel value by the value of one

for 8 BIT: MSE= 1/ (1920*1080)=0,000400938 PSNR=20Log(255/ROOT(MSE))=116,1dB

By the way, quantization noise is only one source of all the following possibilities:

Additionally you have grain, if you capture 35mm film

Here are some real world examples:

Wavelet Beam noise management and lossless compression:

n:467 mse_avg:0.09 mse_y:0.05 mse_u:0.04 mse_v:0.30 psnr_avg:58.56 psnr_y:61.14 psnr_u:61.77 psnr_v:53.35 n:468 mse_avg:0.05 mse_y:0.07 mse_u:0.02 mse_v:0.01 psnr_avg:61.10 psnr_y:59.73 psnr_u:65.00 psnr_v:70.38 n:469 mse_avg:0.19 mse_y:0.28 mse_u:0.00 mse_v:0.00 psnr_avg:55.41 psnr_y:53.66 psnr_u:inf psnr_v:74.21 n:470 mse_avg:0.02 mse_y:0.03 mse_u:0.00 mse_v:0.00 psnr_avg:64.83 psnr_y:63.09 psnr_u:inf psnr_v:80.49 n:471 mse_avg:0.05 mse_y:0.08 mse_u:0.00 mse_v:0.00 psnr_avg:60.82 psnr_y:59.06 psnr_u:99.26 psnr_v:93.82 n:472 mse_avg:0.00 mse_y:0.00 mse_u:0.00 mse_v:0.00 psnr_avg:76.43 psnr_y:74.68 psnr_u:inf psnr_v:96.25

Wavelet Beam noise management and lossy compression H264 1920x1080 @ 7,4Mbit/s: n:120 mse_avg:0.11 mse_y:0.15 mse_u:0.04 mse_v:0.04 psnr_avg:57.69 psnr_y:56.50 psnr_u:61.87 psnr_v:62.11 n:121 mse_avg:0.11 mse_y:0.14 mse_u:0.04 mse_v:0.04 psnr_avg:57.86 psnr_y:56.70 psnr_u:61.78 psnr_v:62.05 n:122 mse_avg:0.11 mse_y:0.15 mse_u:0.04 mse_v:0.04 psnr_avg:57.68 psnr_y:56.52 psnr_u:61.80 psnr_v:61.80 n:123 mse_avg:0.12 mse_y:0.17 mse_u:0.04 mse_v:0.04 psnr_avg:57.17 psnr_y:55.93 psnr_u:61.78 psnr_v:61.88 n:124 mse_avg:0.12 mse_y:0.16 mse_u:0.04 mse_v:0.04 psnr_avg:57.45 psnr_y:56.21 psnr_u:61.99 psnr_v:62.44 n:125 mse_avg:0.11 mse_y:0.15 mse_u:0.05 mse_v:0.04 psnr_a^vg:57.61 psnr_y:56.45 psnr_u:61.43 psnr_v:62.03 n:126 mse_avg:0.11 mse_y:0.14 mse_u:0.05 mse_v:0.05 psnr_avg:57.71 psnr_y:56.60 psnr_u:61.46 psnr_v:61.52 n:127 mse_avg:0.11 mse_y:0.15 mse_u:0.05 mse_v:0.05 psnr_avg:57.55 psnr_y:56.44 psnr_u:61.20 psnr_v:61.50 n:128 mse_avg:0.11 mse_y:0.14 mse_u:0.05 mse_v:0.05 psnr_avg:57.63 psnr_y:56.54 psnr_u:61.25 psnr_v:61.39 n:129 mse_avg:0.11 mse_y:0.15 mse_u:0.05 mse_v:0.05 psnr_avg:57.54 psnr_y:56.43 psnr_u:61.14 psnr_v:61.54 n:130 mse_avg:0.11 mse_y:0.15 mse_u:0.05 mse_v:0.04 psnr_avg:57.62 psnr_y:56.49 psnr_u:61.34 psnr_v:61.75 n:131 mse_avg:0.13 mse_y:0.17 mse_u:0.05 mse_v:0.05 psnr_avg:57.04 psnr_y:55.87 psnr_u:61.08 psnr_v:61.33 n:132 mse_avg:0.12 mse_y:0.16 mse_u:0.05 mse_v:0.05 psnr_avg:57.31 psnr_y:56.17 psnr_u:61.08 psnr_v:61.45 n:133 mse_avg:0.11 mse_y:0.15 mse_u:0.05 mse_v:0.04 psnr_avg:57.59 psnr_y:56.50 psnr_u:60.95 psnr_v:61.63 n:134 mse_avg:0.11 mse_y:0.14 mse_u:0.05 mse_v:0.05 psnr_avg:57.60 psnr_y:56.52 psnr_u:61.19 psnr_v:61.31 n:135 mse_avg:0.11 mse_y:0.15 mse_u:0.05 mse_v:0.05 psnr_avg:57.57 psnr_y:56.47 psnr_u:61.02 psnr_v:61.50 n:136 mse_avg:0.11 mse_y:0.14 mse_u:0.05 mse_v:0.05 psnr_avg:57.78 psnr_y:56.73 psnr_u:60.88 psnr_v:61.51 n:137 mse_avg:0.12 mse_y:0.15 mse_u:0.05 mse_v:0.05 psnr_avg:57.44 psnr_y:56.33 psnr_u:61.20 psnr_v:61.25 n:138 mse_avg:0.12 mse_y:0.16 mse_u:0.05 mse_v:0.05 psnr_avg:57.26 psnr_y:56.16 psnr_u:60.98 psnr_v:60.94 n:139 mse_avg:0.12 mse_y:0.15 mse_u:0.05 mse_v:0.05 psnr_avg:57.33 psnr_y:56.24 psnr_u:60.77 psnr_v:61.21 n:140 mse_avg:0.13 mse_y:0.18 mse_u:0.05 mse_v:0.05 psnr_avg:56.88 psnr_y:55.70 psnr_u:60.93 psnr_v:61.32 n:141 mse_avg:0.23 mse_y:0.31 mse_u:0.06 mse_v:0.06 psnr_avg:54.59 psnr_y:53.21 psnr_u:60.59 psnr_v:60.58 n:142 mse_avg:1.09 mse_y:1.46 mse_u:0.52 mse_v:0.17 psnr_avg:47.76 psnr_y:46.49 psnr_u:50.99 psnr_v:55.79 n:143 mse_avg:1.07 mse_y:1.43 mse_u:0.52 mse_v:0.15 psnr_avg:47.85 psnr_y:46.57 psnr_u:50.93 psnr_v:56.31

Wavelet Beam noise management and lossy compression ProRes 1920x1080 n:3375 mse_avg:0.19 mse_y:0.19 mse_u:0.19 mse_v:0.19 psnr_avg:55.42 psnr_y:55.45 psnr_u:55.39 psnr_v:55.34 n:3376 mse_avg:0.20 mse_y:0.21 mse_u:0.19 mse_v:0.19 psnr_avg:55.09 psnr_y:54.95 psnr_u:55.41 psnr_v:55.36 n:3377 mse_avg:0.20 mse_y:0.21 mse_u:0.19 mse_v:0.19 psnr_avg:55.07 psnr_y:54.91 psnr_u:55.42 psnr_v:55.41 n:3378 mse_avg:0.19 mse_y:0.20 mse_u:0.19 mse_v:0.19 psnr_avg:55.26 psnr_y:55.17 psnr_u:55.43 psnr_v:55.43 n:3379 mse_avg:0.08 mse_y:0.03 mse_u:0.19 mse_v:0.19 psnr_avg:58.93 psnr_y:63.15 psnr_u:55.43 psnr_v:55.42 n:3380 mse_avg:0.07 mse_y:0.01 mse_u:0.19 mse_v:0.19 psnr_avg:59.58 psnr_y:66.54 psnr_u:55.43 psnr_v:55.43 n:3381 mse_avg:0.06 mse_y:0.00 mse_u:0.19 mse_v:0.19 psnr_avg:60.02 psnr_y:72.07 psnr_u:55.43 psnr_v:55.43 n:3382 mse_avg:0.06 mse_y:0.00 mse_u:0.19 mse_v:0.19 psnr_avg:60.20 psnr_y:86.89 psnr_u:55.43 psnr_v:55.43 n:3383 mse_avg:0.06 mse_y:0.00 mse_u:0.19 mse_v:0.19 psnr_avg:60.20 psnr_y:108.29 psnr_u:55.43 psnr_v:55.43 n:3384 mse_avg:0.06 mse_y:0.00 mse_u:0.19 mse_v:0.19 psnr_avg:60.20 psnr_y:108.29 psnr_u:55.43 psnr_v:55.43

waveletbeam commented 4 years ago

Please help! PSNR values above 60 dB would be helpful for a lot of use cases, where we like to measure the over all video quality though the whole distribution chain. This includes the IMF files and production footage.

ruben-ar14-mons commented 4 years ago

First of all: What tools did you use for the part 'Wavelet Beam noise management and lossless compression'

Secondly, most encoder(-config)s use 10-bit internally - so by default, you get capped at least 70.98 and something dB. Thus it makes sense that you get values over 60dB. but i´ve never encountered a higher PSNR in an encoder-logfile. You should double check them. So how much bit does your encoder use internally?

I´m really confused by both formulas you both gave - the both make sense to me to a certain degree.

waveletbeam commented 4 years ago

Hello Ruben, You have to dig deeper into the math part of PSNR calculation in conjunction with image processing. You easily can start by using excel for the beginning. It's fun and once you pushed different values into the matrix, this knowledge is gold dust:) I hope this helps. Best, Dirk

ruben-ar14-mons commented 4 years ago

Of course given by the formula provided by you, you would get 0dB if for example one frame is black and the other is white and similiarly inf dB for pictures that are the same. But one PSNR 'measurement' alone is meaningless. So could you elaborate why you would need such high values (and what tools you used for stated 'real worls examples')?

waveletbeam commented 4 years ago

PSNR is still my first choice, if I have to evaluate picture quality. PSNR values > 50dB will appear, if we are comparing the original picture verses e.g. the denoised picture or if we are evaluating intermediate formats like IMF. Additional we a using VMAF though the whole workflow and not only for the distribution (Glas-2-Glas). There are some tools out there which are also calculating PSNR values like FFMPEG. We have developed our own tools for Matlab ,c++ and CUDA.

ruben-ar14-mons commented 4 years ago

we are comparing the original picture verses e.g. the denoised picture that is still only one PSNR value! what do your compare to that?

did you encounter higher values than those capped values (60 dB for 8 bit, 72 dB for 10 bit) given by @st599 formulas with ffmpeg?

li-zhi commented 3 years ago

I have added a FAQ at: https://github.com/Netflix/vmaf/blob/master/FAQ.md#q-why-are-the-psnr-values-capped-at-60-db-for-8-bit-inputs-and-72-db-for-12-bit-inputs-in-the-packages-implementation

Hopefully this addresses the issue.