Netflix / vmaf

Perceptual video quality assessment based on multi-method fusion.
Other
4.56k stars 748 forks source link

Darkening an image boosts the VMAF score #1102

Closed jonsneyers closed 1 year ago

jonsneyers commented 2 years ago

I tried this on this test image: https://jon-cld.s3.amazonaws.com/test_images/reference/011.png but probably it will work for any image.

convert test.ppm -quality 75 test-75.jpg
convert test.ppm -quality 60 test-60.jpg
convert test.ppm -quality 75 -gamma 0.97 test-75-darker.jpg
convert test.ppm -quality 60 -gamma 0.97 test-60-darker.jpg
convert test.ppm -gamma 0.97 test-darker.ppm

Computing VMAF scores with test.ppm as reference image gives the following results:

test-75.jpg: vmaf = 94.478783
test-60.jpg: vmaf = 93.050756
test-75-darker.jpg: vmaf = 96.275693
test-60-darker.jpg: vmaf = 94.705308
test.ppm: vmaf = 97.427999
test-darker.ppm: vmaf = 99.373649

It is hard for me to take a metric seriously if it behaves like this. First of all, I would expect the reference image when compared to itself to get a 'perfect' score of 100. More importantly, I would expect any distortion, including slightly darkening the image, to result in a score that is lower than when not applying this distortion.

This behavior implies that any encoder can very easily "optimize for VMAF" by simply making the input image darker before encoding it. As you can see, a q60 jpeg can get a higher score than a q75 jpeg this way, while it of course doesn't look any better.

It would be interesting to know what is causing this behavior and to see if there is a way to mitigate this.

li-zhi commented 2 years ago

Can you also provide the commands you used to compute VMAF?

On Tue, Oct 4, 2022 at 3:21 AM Jon Sneyers @.***> wrote:

I tried this on this test image: https://jon-cld.s3.amazonaws.com/test_images/reference/011.png but probably it will work for any image.

convert test.ppm -quality 75 test-75.jpg convert test.ppm -quality 60 test-60.jpg convert test.ppm -quality 75 -gamma 0.97 test-75-darker.jpg convert test.ppm -quality 60 -gamma 0.97 test-60-darker.jpg convert test.ppm -gamma 0.97 test-darker.ppm

Computing VMAF scores with test.ppm as reference image gives the following results:

test-75.jpg: vmaf = 94.478783 test-60.jpg: vmaf = 93.050756 test-75-darker.jpg: vmaf = 96.275693 test-60-darker.jpg: vmaf = 94.705308 test.ppm: vmaf = 97.427999 test-darker.ppm: vmaf = 99.373649

It is hard for me to take a metric seriously if it behaves like this. First of all, I would expect the reference image when compared to itself to get a 'perfect' score of 100. More importantly, I would expect any distortion, including slightly darkening the image, to result in a score that is lower than when not applying this distortion.

This behavior implies that any encoder can very easily "optimize for VMAF" by simply making the input image darker before encoding it. As you can see, a q60 jpeg can get a higher score than a q75 jpeg this way, while it of course doesn't look any better.

It would be interesting to know what is causing this behavior and to see if there is a way to mitigate this.

— Reply to this email directly, view it on GitHub https://github.com/Netflix/vmaf/issues/1102, or unsubscribe https://github.com/notifications/unsubscribe-auth/AECVVEH6R77BJGZ7NAFSQVDWBQAJZANCNFSM6AAAAAAQ4NQZQM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

jonsneyers commented 2 years ago

This was with the default model, and after converting the ppm files to 4:4:4 y4m using ffmpeg.

With the NEG model things are somewhat better: the *-darker.jpg images still get a slightly higher score than their non-darkened counterparts, but test-darker.ppm no longer gets a higher score than the original.

I now understand that part of this behavior (in the default model) is intentional as VMAF apparently considers the darkening to be an "enhancement gain". It was not clear to me when opening this issue that the default model of VMAF is not aiming to be a full reference fidelity metric but rather some kind of appeal-oriented metric where "better than the original" is an actual possibility. Perhaps it would be good to point this out somewhat more prominently (e.g. in the README of this repo) so people don't accidentally assume it is a fidelity metric that can be used to evaluate encoders.

li-zhi commented 2 years ago

This is indeed what differentiates VMAF from a traditional "fidelity" metric such as PSNR or SSIM, as it responses to "enhancement" operation like contrasting and sharpening. A more detailed discussion can be found in this memo.

To fully understand the -gamma 0.97 operation, I assume this gives a gamma correction with exponent 0.97, which is concave. This means it stretches the darker pixels in the image and compresses the brighter pixels. Since the image tested is a night shot, it would have mostly darker pixels, and this would correspond to a "stretch" (or "contrasting"), hence explaining the "enhancement". Is my understanding correct?

jonsneyers commented 2 years ago

Yes, -gamma 0.97 causes a slight darkening. Testing on other images, it looks like usually a slight darkening improves the VMAF score compared to the original score, while on some images a slight brightening improves the score. It makes me wonder how VMAF decides what kind of adjustment is in fact an enhancement — I am testing on pristine images that presumably already look exactly how they're supposed to look, so it is surprising that any manipulation of colors is considered an enhancement rather than an error. In any case, color balance is presumably an artistic choice made by a photographer, and in some cases (e.g. e-commerce) it is important to reproduce colors as accurately as possible, rather than distorted in some way that VMAF considers an "enhancement". You don't want to sell some off-white dress and get the photo of it modified to some quite different "enhanced" color that according to VMAF will result in a more pleasing image, just to then get a lot of returned items and dissatisfied customers because the product photo on the website was misleading.

I discovered this behavior because I was accidentally incorrectly comparing images that are in different colorspaces (different transfer function), and that information got removed in the conversion to y4m. Doing the comparison in the correct colorspace (i.e. the same one for reference and distorted) caused scores to drop compared to doing the comparison incorrectly (which corresponded to a slight darkening of the distorted images due to them getting interpreted with an incorrect transfer curve). This caused quite a bit of confusion on my end, since I was assuming that the higher scores were the correct ones and the lower scores were caused by a colorspace issue, but digging deeper it turned out to be exactly the other way around and VMAF just happened to like (at least on average) the darkening effect of applying an incorrect transfer curve.

igv commented 2 years ago

This is indeed what differentiates VMAF from a traditional "fidelity" metric such as PSNR or SSIM, as it responses to "enhancement" operation like contrasting and sharpening.

DLM especially overestimates "enhancement" (and wavelet version of VIF).

@jonsneyers Try VIFp alone (this is what used in VMAF with some slight adjustments/improvements). Btw, I found its Rate-Distortion plot is almost identical to your SSIMULACRA.

igv commented 1 year ago

@jonsneyers a little off-topic: I just your “Hall of shame” image gallery and wanna ask if you tried PieAPP. In my opinion this is the most accurate metric for any type of distortions (lossy compression/super-resolution/denoising...).

nilfm commented 1 year ago

Closing this issue as part of a clean-up. Please feel free to re-open for further questions or discussion.