Closed haasn closed 1 year ago
I noticed the blurry result produced by gaussian kernel, and addressed this by clamping ravu kernel weights (and divides sum of clamped weights later), see de47f3d14b7165756d6060ebc1b2ebaa0c40851b
- GLSL("wg = vec4(%r,%r,%r,%r);" % tuple(w_gauss[i]))
+ GLSL("wg = max(vec4(0.0), w);");
I later trained an non-ringing ravu-zoom-r2 kernel (with all kernel weight being non-negative) for anti-ringing purposes, and used the same kernel for both ravu-zoom-ar-r2 and ravu-zoom-ar-r3, see 8c24271854b3594472d40d5579a4932fa2d10db3
Unlike gaussian kernel, the trained non-ringing kernel is actually pretty sharp (but with some aliasing artifacts). I guess this could be difference between the ravu-zoom-ar and libplacebo gaussian based ar.
The following are the visualization of new ravu-zoom kernels (angle
from left to right, strength+coherence*4
from up to down),
non-ringing ravu-zoom-r2 kernel:
ravu-zoom-r2 kernel:
ravu-zoom-r3 kernel:
So you basically use a forcibly clamped version of the normal kernel as the antiringing kernel? Doesn't that basically halve performance, since you need to include now all samples?
Maybe a better idea would be to cut off only the first lobe of the main kernel and use that as the antiringing kernel... actually, now I want to try this. As a bonus point, we would get weight for free.
So you basically use a forcibly clamped version of the normal kernel as the antiringing kernel?
Yes, clamped kernel has basically the same shape as non-ringing kernels.
Doesn't that basically halve performance, since you need to include now all samples?
Not really, as I have tested, the most rendering time was spend on loading samples from textures. Also, not all samples are required for anti-ringing kernel, it's much smaller as visualized.
@bjin I played with the ideas from this thread: https://code.videolan.org/videolan/libplacebo/-/merge_requests/526
It improves performance significantly, and also works when downscaling. But, it doesn't really fix the problem of the kernel making things very blurry, which is mostly a limitation of using a small-ish exponent (^8).
Maybe you have some clever maths ideas on how to improve the situation in other ways? To give a rough summary, the idea is to implement a "weighted max()", while preserving associativity/commutativity. What we do is effectively a weighted PowerMean
with exponent k = 8
. I noticed that antiringing improves for very large k
, but we run into floating point precision limits (e.g. in the degenerate case of there being only a single non-zero weighted value, we end up computing pow(pow(x, k), 1/k)
which truncates small x
to zero. Ideally, I would like to use a very large k, like 64
or even 128
, but this destroys all values sufficiently close to black/white.
Maybe there's a better operation we can use as the basis of "weighted max"?
Hmm, thinking about it again, instead of doing ((w₁C₁^k + w₂C₂^k + ...)/(w₁+w₂+...))^(1/k)
, we could do ((w₁C₁^k + w₂C₂^k)/(w₁C₁^(k-1) + w₂C₂^(k-1) + ...)
, effectively moving the error to the weights - very small weights will just be rounded down to 0 and end up having no effective contribution. This will maintain error consistent with the wgsum
.
I implemented this and it seems to give slightly sharper results even at the same exponent 8. (It also removes the need for sqrt
from the second half, a huge win)
we could do ((w₁C₁^k + w₂C₂^k)/(w₁C₁^(k-1) + w₂C₂^(k-1) + ...)
Nice idea!
Unfortunately, I still get some numeric issues when raising this to a nice exponent (k=32 at least is required to recover full sharpness in my testing). Lowering it down to 8 gets rid of visible artifacts, but also reintroduces some blurriness. Probably the remaining glitches are as a result of wgsum
ending up as some very small (denormalized) number, so you end up dividing a denormalized number by another denormalized number, losing precision.
What we really need, I think, is some way to dynamically rescale lo/hi
now such that wgsum
is always equal to 1. But this is just exercise in floating point math implementation.
k=32 at least is required to recover full sharpness in my testing
64-bit float number (double) is available from OpenGL 4.0, and on modern GPU it's like 2 times slower (compare to 32 bit float).
so you end up dividing a denormalized number by another denormalized number, losing precision.
I actually doubt this. Assuming the sum calculation (of small float points) doesn't loose precision, the final division of two small number won't lose percision as well, since 24bits out of the 32 bits are digits in normalized scale ([1,2) or [0.5,1)).
another idea is to scale all 'c_i' by '1/max c_i', so that 'max c_i' become 1, this way precision could be retained as much as possible in high exponent float point number arithmetic
also, this trick could be useful, assuming GLSL compiler won't over-optimize it
I was thinking about it, but I'm not sure that's actually where the precision loss comes from. Incidentally, here's the code I'm using to brainstorm:
import math
def weight(x):
return math.pow(x, 32)
def wsum_ref(values):
accum = 0
wsum = 0
for x in values:
w = weight(x)
accum += w * x
wsum += w
return accum / wsum
if __name__ == '__main__':
values = [math.pow(x / 100, 2.2) for x in range(100)]
print(wsum_ref(values))
I found this implementation which I'm currently trying to get to work inside GLSL (but it produces strange artifacts for some reason):
def wsum_dyn(values):
accum = 0
wsum = 0
for x in values:
w = weight(x)
if w:
wsumw = wsum + w
accum = accum * (wsum / wsumw) + (w / wsumw) * x
wsum = wsumw
return accum
Idea is to keep dynamically rescaling the accumulator
Edit: Nvm, forgot to initialize wsum
in GLSL
maybe use numpy.float32
instead of default 64 bit float
in python?
Hmm, actually, even with float32 I get perfect accuracy even from naive implementation. I think the problem is actually that it's possible for the sum of weights to be 0. Adding a simple w = max(w, 1e-37)
actually works around the issue, too. Though it changes subjective appearance, for reasons I don't yet entirely understand.
I'm still not sure what's actually going on in this AR kernel.
Oh, I figured out a very stupid trick:
We can 'fix' the tiny weights by simply multiplying on big constant, e.g. w = 1e30 * pow(w, 32)
. I tried doing this for a while, but finally realized why it didn't help on its own. The solution is to split this up into iterated w = a * w * w
(5 times to get exponent of 32), with a
picked to roughly preserve the value range (a value of 2.0 seems to work well).
This will help protect the weights and prevent them from shooting off to 0
I think the problem is actually that it's possible for the sum of weights to be 0.
Is it really possible?
From what I understand, weight w_i
should never be a problem, since it's multiplied only once. And the sum of weights should never be close to zero, otherwise I consider it a bug.
it's c_i
what's worth taking care off, since c_i^32
can easily exceeds the range of float32. (2^(-8))^32 vs 2^(-127)
.
The solution I came up with is mapping c_i
from [0, 1.0]
to [0.5, 1.5]
, then c_i^32
should be handled by float32 well.
((w₁C₁^k + w₂C₂^k + ...)/(w₁C₁^(k-1) + w₂C₂^(k-1) + ...)
Also with this formula, sum of weights doesn't have to be normalized to 1.0. It can be calculated as is.
Is it really possible?
To be clear, I talk about the sum of w * c^32
values (i.e. the adjusted weight). For example if c
is black everywhere, this is trivially true. But clamping to 1e-37
solves that edge case.
The solution I came up with is mapping c_i from [0, 1.0] to [0.5, 1.5], then c_i^32 should be handled by float32 well.
Oh, that's much cleverer than what I was doing (multiplying by constant), and solves the issue here. It also preserves differentiation between high and low values (the ultimate goal), even if we race off to infinity in both directions. Maybe an even smarter idea would be to normalize it relative to the current color value, so that all values exceeding the current pixel value are above 1.0, and all values below the current pixel value are below 1.0. But it works fine as-is now.
With your trick I can even go up to exponents as high as 128 without issue.
Also with this formula, sum of weights doesn't have to be normalized to 1.0. It can be calculated as is.
Not sure what you mean. I still need to normalize to the sum of adjusted weights at the end.
Not sure what you mean. I still need to normalize to the sum of adjusted weights at the end.
I mean wgsum
doesn't have to be calculated, since in ((w₁C₁^k + w₂C₂^k + ...)/(w₁C₁^(k-1) + w₂C₂^(k-1) + ...)
, both dividend and divisors contains 1.0 / wgsum
as factor, it can be cancelled.
I have an initial implementation for ravu-zoom and ravu-lite for reference: ec0fd218fc02d7f6897c9a4a283af5a4db183e43
Oh, I see, you are actually doing it slightly different from me - but you still need to divide at the end, which is basically what I'm doing with wwsum
. in my branch:
I changed also the shifting from [0.5, 1.5]
to [1.0, 2.0]
, for a quick litmus test we can see that I'm not sure I can articulate why. But maybe it has something to do with the fact that very small (black) values don't change the number at all, now, meaning we get the biggest change in exponent from the value we care about the most (true max). So if you add a bunch of values of shape 0.5^32
is denormalized but 2^32
retains full precision.pow(x+1,k)
for very large k, this will degenerate into just representing max(x)
as you run into precision limits - which is exactly the result we want.
Edit: This is not true, I'm not sure why I thought it was - 0.5 ^ 128
and 2 ^ 128
both run into floating exponent limits. Though I still suspect shifting by +1
will produce value closer to true max than shifting by +0.5
only. 🤷🏻
Edit 2: Actually, in testing, this is not true. And I guess from intuition, you want to make the extra added value as small as possible. So maybe going off this logic, we should solve for offset^32 = minimum
, so to maintain minimum value of 1e-37
we should set the offset to 0.07
. I'll do some testing
thanks for pointing this out!
I noticed you added new AR options based on the 'gaussian AR' design in libplacebo. For various reasons, I'm unhappy with the way that anti-ringing filter works, in particular on real-world content it tends to perform much more poorly than the synthetic test patterns it was designed on (a tale as old as time...), because the
^8
exponent is insufficiently low to mask the effects of the gaussian kernel making things very blurry if the source contrast is too low.People have been saying your ravu-zoom-ar variants produce better output, do you have any insight into why? It seems to be based on a similar design overall, but how did you train the AR kernel? I noticed you use a fancy LUT instead of a simple kernel to weigh the contributions of pixels. What shape does it have, in practice? (And what parameter does it depend on? I don't really understand why you divide
subpix
byvec2(2.0, 288.0)
, isn't it already in range[0,1]
? And shouldn't it be symmetric?