bjin / mpv-prescalers

prescalers for mpv, as user shaders
GNU Lesser General Public License v3.0
355 stars 34 forks source link

AR kernel shape? #57

Closed haasn closed 1 year ago

haasn commented 1 year ago

I noticed you added new AR options based on the 'gaussian AR' design in libplacebo. For various reasons, I'm unhappy with the way that anti-ringing filter works, in particular on real-world content it tends to perform much more poorly than the synthetic test patterns it was designed on (a tale as old as time...), because the ^8 exponent is insufficiently low to mask the effects of the gaussian kernel making things very blurry if the source contrast is too low.

People have been saying your ravu-zoom-ar variants produce better output, do you have any insight into why? It seems to be based on a similar design overall, but how did you train the AR kernel? I noticed you use a fancy LUT instead of a simple kernel to weigh the contributions of pixels. What shape does it have, in practice? (And what parameter does it depend on? I don't really understand why you divide subpix by vec2(2.0, 288.0), isn't it already in range [0,1]? And shouldn't it be symmetric?

bjin commented 1 year ago

I noticed the blurry result produced by gaussian kernel, and addressed this by clamping ravu kernel weights (and divides sum of clamped weights later), see de47f3d14b7165756d6060ebc1b2ebaa0c40851b

-                GLSL("wg = vec4(%r,%r,%r,%r);" % tuple(w_gauss[i]))
+                GLSL("wg = max(vec4(0.0), w);");

I later trained an non-ringing ravu-zoom-r2 kernel (with all kernel weight being non-negative) for anti-ringing purposes, and used the same kernel for both ravu-zoom-ar-r2 and ravu-zoom-ar-r3, see 8c24271854b3594472d40d5579a4932fa2d10db3

Unlike gaussian kernel, the trained non-ringing kernel is actually pretty sharp (but with some aliasing artifacts). I guess this could be difference between the ravu-zoom-ar and libplacebo gaussian based ar.

The following are the visualization of new ravu-zoom kernels (angle from left to right, strength+coherence*4 from up to down),

non-ringing ravu-zoom-r2 kernel: ravu-zoom-ar-r2-model vis

ravu-zoom-r2 kernel: ravu-zoom-r2-model vis

ravu-zoom-r3 kernel: ravu-zoom-r3-model vis

haasn commented 1 year ago

So you basically use a forcibly clamped version of the normal kernel as the antiringing kernel? Doesn't that basically halve performance, since you need to include now all samples?

Maybe a better idea would be to cut off only the first lobe of the main kernel and use that as the antiringing kernel... actually, now I want to try this. As a bonus point, we would get weight for free.

bjin commented 1 year ago

So you basically use a forcibly clamped version of the normal kernel as the antiringing kernel?

Yes, clamped kernel has basically the same shape as non-ringing kernels.

Doesn't that basically halve performance, since you need to include now all samples?

Not really, as I have tested, the most rendering time was spend on loading samples from textures. Also, not all samples are required for anti-ringing kernel, it's much smaller as visualized.

haasn commented 1 year ago

@bjin I played with the ideas from this thread: https://code.videolan.org/videolan/libplacebo/-/merge_requests/526

It improves performance significantly, and also works when downscaling. But, it doesn't really fix the problem of the kernel making things very blurry, which is mostly a limitation of using a small-ish exponent (^8).

Maybe you have some clever maths ideas on how to improve the situation in other ways? To give a rough summary, the idea is to implement a "weighted max()", while preserving associativity/commutativity. What we do is effectively a weighted PowerMean with exponent k = 8. I noticed that antiringing improves for very large k, but we run into floating point precision limits (e.g. in the degenerate case of there being only a single non-zero weighted value, we end up computing pow(pow(x, k), 1/k) which truncates small x to zero. Ideally, I would like to use a very large k, like 64 or even 128, but this destroys all values sufficiently close to black/white.

Maybe there's a better operation we can use as the basis of "weighted max"?

haasn commented 1 year ago

Hmm, thinking about it again, instead of doing ((w₁C₁^k + w₂C₂^k + ...)/(w₁+w₂+...))^(1/k), we could do ((w₁C₁^k + w₂C₂^k)/(w₁C₁^(k-1) + w₂C₂^(k-1) + ...), effectively moving the error to the weights - very small weights will just be rounded down to 0 and end up having no effective contribution. This will maintain error consistent with the wgsum.

I implemented this and it seems to give slightly sharper results even at the same exponent 8. (It also removes the need for sqrt from the second half, a huge win)

bjin commented 1 year ago

we could do ((w₁C₁^k + w₂C₂^k)/(w₁C₁^(k-1) + w₂C₂^(k-1) + ...)

Nice idea!

haasn commented 1 year ago

Unfortunately, I still get some numeric issues when raising this to a nice exponent (k=32 at least is required to recover full sharpness in my testing). Lowering it down to 8 gets rid of visible artifacts, but also reintroduces some blurriness. Probably the remaining glitches are as a result of wgsum ending up as some very small (denormalized) number, so you end up dividing a denormalized number by another denormalized number, losing precision.

What we really need, I think, is some way to dynamically rescale lo/hi now such that wgsum is always equal to 1. But this is just exercise in floating point math implementation.

bjin commented 1 year ago

k=32 at least is required to recover full sharpness in my testing

64-bit float number (double) is available from OpenGL 4.0, and on modern GPU it's like 2 times slower (compare to 32 bit float).

so you end up dividing a denormalized number by another denormalized number, losing precision.

I actually doubt this. Assuming the sum calculation (of small float points) doesn't loose precision, the final division of two small number won't lose percision as well, since 24bits out of the 32 bits are digits in normalized scale ([1,2) or [0.5,1)).

bjin commented 1 year ago

another idea is to scale all 'c_i' by '1/max c_i', so that 'max c_i' become 1, this way precision could be retained as much as possible in high exponent float point number arithmetic

bjin commented 1 year ago

also, this trick could be useful, assuming GLSL compiler won't over-optimize it

haasn commented 1 year ago

I was thinking about it, but I'm not sure that's actually where the precision loss comes from. Incidentally, here's the code I'm using to brainstorm:

import math

def weight(x):
    return math.pow(x, 32)

def wsum_ref(values):
    accum = 0
    wsum = 0
    for x in values:
        w = weight(x)
        accum += w * x
        wsum += w

    return accum / wsum

if __name__ == '__main__':
    values = [math.pow(x / 100, 2.2) for x in range(100)]
    print(wsum_ref(values))

I found this implementation which I'm currently trying to get to work inside GLSL (but it produces strange artifacts for some reason):

def wsum_dyn(values):
    accum = 0
    wsum = 0
    for x in values:
        w = weight(x)
        if w:
            wsumw = wsum + w
            accum = accum * (wsum / wsumw) + (w / wsumw) * x
            wsum = wsumw

    return accum

Idea is to keep dynamically rescaling the accumulator

Edit: Nvm, forgot to initialize wsum in GLSL

bjin commented 1 year ago

maybe use numpy.float32 instead of default 64 bit float in python?

haasn commented 1 year ago

Hmm, actually, even with float32 I get perfect accuracy even from naive implementation. I think the problem is actually that it's possible for the sum of weights to be 0. Adding a simple w = max(w, 1e-37) actually works around the issue, too. Though it changes subjective appearance, for reasons I don't yet entirely understand.

I'm still not sure what's actually going on in this AR kernel.

haasn commented 1 year ago

Oh, I figured out a very stupid trick:

We can 'fix' the tiny weights by simply multiplying on big constant, e.g. w = 1e30 * pow(w, 32). I tried doing this for a while, but finally realized why it didn't help on its own. The solution is to split this up into iterated w = a * w * w (5 times to get exponent of 32), with a picked to roughly preserve the value range (a value of 2.0 seems to work well).

This will help protect the weights and prevent them from shooting off to 0

bjin commented 1 year ago

I think the problem is actually that it's possible for the sum of weights to be 0.

Is it really possible?

From what I understand, weight w_i should never be a problem, since it's multiplied only once. And the sum of weights should never be close to zero, otherwise I consider it a bug.

it's c_i what's worth taking care off, since c_i^32 can easily exceeds the range of float32. (2^(-8))^32 vs 2^(-127).

The solution I came up with is mapping c_i from [0, 1.0] to [0.5, 1.5], then c_i^32 should be handled by float32 well.

bjin commented 1 year ago

((w₁C₁^k + w₂C₂^k + ...)/(w₁C₁^(k-1) + w₂C₂^(k-1) + ...)

Also with this formula, sum of weights doesn't have to be normalized to 1.0. It can be calculated as is.

haasn commented 1 year ago

Is it really possible?

To be clear, I talk about the sum of w * c^32 values (i.e. the adjusted weight). For example if c is black everywhere, this is trivially true. But clamping to 1e-37 solves that edge case.

The solution I came up with is mapping c_i from [0, 1.0] to [0.5, 1.5], then c_i^32 should be handled by float32 well.

Oh, that's much cleverer than what I was doing (multiplying by constant), and solves the issue here. It also preserves differentiation between high and low values (the ultimate goal), even if we race off to infinity in both directions. Maybe an even smarter idea would be to normalize it relative to the current color value, so that all values exceeding the current pixel value are above 1.0, and all values below the current pixel value are below 1.0. But it works fine as-is now.

With your trick I can even go up to exponents as high as 128 without issue.

Also with this formula, sum of weights doesn't have to be normalized to 1.0. It can be calculated as is.

Not sure what you mean. I still need to normalize to the sum of adjusted weights at the end.

bjin commented 1 year ago

Not sure what you mean. I still need to normalize to the sum of adjusted weights at the end.

I mean wgsum doesn't have to be calculated, since in ((w₁C₁^k + w₂C₂^k + ...)/(w₁C₁^(k-1) + w₂C₂^(k-1) + ...), both dividend and divisors contains 1.0 / wgsum as factor, it can be cancelled.

I have an initial implementation for ravu-zoom and ravu-lite for reference: ec0fd218fc02d7f6897c9a4a283af5a4db183e43

haasn commented 1 year ago

Oh, I see, you are actually doing it slightly different from me - but you still need to divide at the end, which is basically what I'm doing with wwsum. in my branch:

https://code.videolan.org/videolan/libplacebo/-/merge_requests/526/diffs?commit_id=0f37737bae34a3a2be47799c7eb6afa422125d18

haasn commented 1 year ago

I changed also the shifting from [0.5, 1.5] to [1.0, 2.0], for a quick litmus test we can see that 0.5^32 is denormalized but 2^32 retains full precision. I'm not sure I can articulate why. But maybe it has something to do with the fact that very small (black) values don't change the number at all, now, meaning we get the biggest change in exponent from the value we care about the most (true max). So if you add a bunch of values of shape pow(x+1,k) for very large k, this will degenerate into just representing max(x) as you run into precision limits - which is exactly the result we want.

Edit: This is not true, I'm not sure why I thought it was - 0.5 ^ 128 and 2 ^ 128 both run into floating exponent limits. Though I still suspect shifting by +1 will produce value closer to true max than shifting by +0.5 only. 🤷🏻

Edit 2: Actually, in testing, this is not true. And I guess from intuition, you want to make the extra added value as small as possible. So maybe going off this logic, we should solve for offset^32 = minimum, so to maintain minimum value of 1e-37 we should set the offset to 0.07. I'll do some testing

bjin commented 1 year ago

thanks for pointing this out!