SmilingWolf / VQMTCrossVal

Cross validation for objective quality metric measurement tools on multiple public datasets
4 stars 2 forks source link

p-norm selection #1

Open jyrkialakuijala opened 4 years ago

jyrkialakuijala commented 4 years ago

Butteraugli's default error field aggregation is by taking the max. For other metrics the default is the 2nd norm (square root of sum of squares). Max works better at the highest quality setting, but at lesser qualities a lower norm is more favorable, as the total surface area of the spoilage relates to the user experience.

The butteraugli in JPEG XL reference codec uses an equal parts mixture of 3rd, 6th and 12th norms, and might be more interesting to test against the reference corpora than the stock butteraugli in github.com/google/butteraugli.

Equally interesting might be to optimize the p norm for every metric. I have quite substantial non-published personal evidence that 2nd norm is just too low, and ramping the norm up a little bit might be a rather non-controversial way to improve the psychovisual metrics.

SmilingWolf commented 4 years ago

Thanks for the heads up. I did run the new Butteraugli from the JPEG XL repo on the KADID10k dataset as soon as I got it to compile, and my results are the following:

I will soon edit in some numbers for the above claims.

KADID10k:

JPEG+JPEG2K:

SROCC
DSSIM -0.913147
SSIMULACRA -0.893521
Butteraugli -0.896065
Butteraugli_XL -0.902046
Butteraugli_XL_3n -0.929330

JPEG+JPEG2K+GBlur:

SROCC
DSSIM -0.910477
SSIMULACRA -0.891608
Butteraugli -0.910774
Butteraugli_XL -0.914425
Butteraugli_XL_3n -0.927194

Overall:

SROCC
DSSIM -0.856177
SSIMULACRA -0.699555
Butteraugli -0.735086
Butteraugli_XL -0.591036
Butteraugli_XL_3n -0.538231

As for the non published evidence, mind posting is somewhere or even reaching out in private? I'm interested in the results and in eventually adding a few more figures to the tables in README.md.

jyrkialakuijala commented 4 years ago

Related to the non-published evidence: I have ~2500 small image pairs that I have ranked myself that contain very slight compression artefacts and specially constructed artefacts that psychovisual distance metrics tend to be bad with. I have been hoping to release this corpus externally, but didn't get to it yet. With two other private ranked corpora with larger images and at higher qualities when we optimized the p-norm for aggregation, we found best values between 6 and 14 for the p-norm.

jyrkialakuijala commented 4 years ago

What if you remove images with MOS < 3.5, MOS < 3 or MOS < 2 ?

I have tuned butteraugli around MOS 4-5, since I didn't expect a lot of value from compressing completely spoiled images :-]

(assuming a MOS range of [1..5])

SmilingWolf commented 4 years ago

Now that sounds interesting. I can do that.

jyrkialakuijala commented 4 years ago

FYI -- There is some more subtlety to the p-norm: My experience is that strong degradations benefit from a lower p-value (like 2-4). When degradations are only minor and a careful inspection with lots of time was used to detect them, higher correlations can be obtained with a higher p-value (like 10-20 or even max).

SmilingWolf commented 4 years ago

This seems to open a few analysis possibilities. For example, I could graph the correlation of various norms in function of the distance from the highest MOS going from the upper 1-2 percentile and up from there all the way to 100%/the whole dataset, to get an idea of where they cross, if trends exist, and to possibly evaluate which could be the most "stable" one.

As a subset of the above for synthesis, I could make 3 different groups per dataset, considering the lower 50%, the upper 50%, and overall MOS scores. You're of course more than welcome to suggest other thresholds you see fit, for example based on the above it might be better to start with a 30-70% split.

Other ideas?

jyrkialakuijala commented 4 years ago

What you propose sounds just right. I'm waiting for these results like a small child for the Santa :-D

SmilingWolf commented 4 years ago

I pushed the files with Butteragli XL max, 3p-norm and 2,3,6,12-norms in the NormAnalysis folder. In the plotting scripts I've taken the liberty of rescaling the MOS values between [1..5] using a minmax scaler so that they are uniform across all 6 datasets for the sole purpose of making plot reading and further elaborations a bit easier across datasets.

The plot that seems to reflect best your experience with n-norms should be the one generated from the IVC dataset, where at MOS 4 (slight degradation) the max- and 12-norms are the most precise ones, whereas the closer you get to MOS 1 (heavy degradation) 2-, 3- and 6-norm give more accurate responses.

JPEGXR sees 6-norm as the constant leader, with the close company of 3-norm, and 12-norm in the first [4.25..5], ultimately left surpassed by 2-norm.

Interesting the relationship between TID2013 and KADID-10k, where one shows the exact opposite tendencies of the other: where TID2013 more or less consistently gives (from better to worse) in order 2-, 3-, 6-, 12-norm, with the max- hovering between 12- and 6-norm in accuracy, KADID-10k gives (from better to worse): max-, 12-, 6-, 3- and 2-norm.

About TID2013, the [4..5] range looks completely botched, not only for Butteraugli, but for all other metrics too, even exhibiting a positive correlation between MOS and dissimilarity metrics. I haven't been able to come up with an explanation for this.

jyrkialakuijala commented 4 years ago

I pushed the files with Butteragli XL max, 3p-norm and 2,3,6,12-norms in the NormAnalysis folder.

Good stuff! I looked through these plots seven times during the weekend.

The plot that seems to reflect best your experience with n-norms should be the one generated from the IVC dataset,

Agreed.

JPEGXR sees 6-norm as the constant leader,

I'm guessing that expert viewers -- or naive viewers with more time -- will result in higher p-norm correlating more.

About TID2013, the [4..5] range looks completely botched

That is very interesting. Until now I had thought that TID2013 is the gold standard in this field. I just was not able to get much out of it with butteraugli and was clueless about the reason... Some wild guesswork on the cause of low norms (and reversal):

It could be because of the wording that was used to guide the testers. If the candidates are asked to look for the largest error and it is understood that it means the errors that cover most surface area, then 1-norm (or other small norms) are the king. ... but doesn't explain positive correlation.

It could also be that the testers didn't have enough time (or interest) to study the high quality range properly. When they didn't see any delta they clicked a middle rating to get the next image faster and not to be off a lot.

It could be that the demonstration images (to teach the testers to do the rating) had a large surface area difference that correlated with the requested MOS scores, i.e., demonstration images had a small norm bias.

Also, they could have had an instruction to rate [3] if they were unsure, and possibly they would become unsure if they didn't see anything.

Would you consider contacting the author nikolay@ponomarenko.info to get more info on the TID2013 reversal at high quality?

SmilingWolf commented 4 years ago

I've been reading the TID2013 paper[1] again, some notable points:

I'll try to contant mr. Ponomarenko in the coming days to see if he has additional insights to offer.

[1] N. Ponomarenko, L. Jin, O. Ieremeiev, V. Lukin, K. Egiazarian, J. Astola, B. Vozel, K. Chehdi, M. Carli, F. Battisti, C.-C. Jay Kuo, Image database TID2013: Peculiarities, results and perspectives, Signal Processing: Image Communication, vol. 30, Jan. 2015, pp. 57-77