Downscale step seems quite expensive at higher resolutions

Jules-A commented 1 year ago

Cfl_Mix: cfl_mix

KrigBillateral (those %s don't seem accurate lol):

Cfl_Mix without downscaling: cfl_mix_no_ds

Even without downscaling it seems pretty good, especially for the cost but there is still some oddness going on. Any chance of a version with faster downscaling or a fast variant with no DS but better tuned values?

Artoriuz commented 1 year ago

That looks a bit weird. I don't have access to my desktop right now, but the shader is generally very lightweight even on my laptop with an Intel igpu from years ago.

What's the resolution of this video? Also, are you prescaling luma before calling the CfL shader?

Artoriuz commented 1 year ago

I can reproduce this with RAVU, I guess it's the combination of having a larger resolution plus the scaling factor difference to match chroma (0.25x rather than 0.5x).

I'll see what I can do about this, thanks =)

Jules-A commented 1 year ago

What's the resolution of this video? Also, are you prescaling luma before calling the CfL shader?

I'm trying out the 4x fast photo glsl scaler from https://github.com/Alexkral/AviSynthAiUpscale in an attempt to remove aliasing so I believe it's scaling to 8k with my native res being 1440p.

When I removed the downscaling I was just binding and using LUMA, looking at closed issue there appears to be a version without downscaling before but NATIVE was used instead? The 12tap variant had less issues than mixed like that but I don't think it beat catmull_rom in when testing ~20 sources subjectively. Actually now that I've tested chroma scaling a lot more it appears catmull_rom is not just better than lanczos/ewa_lanczos(sharp) but krigbillateral too, at least when playing back from a high resolution (in my case upscaled). Is it possible you could run catmull_rom through your tests?

EDIT: Turns out I edited the shader wrong, the old version from 9ad91ab has much better results without the heavy costs of downscaling. Looks like that makes it use catmull_mull chroma upscaling as well which means it has a higher overhead :/

Artoriuz commented 1 year ago

The version you linked didn't have the downscaling step yet, and it didn't have its own spatial scaler either so it should just use whatever you're setting cscale to (maybe you had it set to catmull_rom?).

Anyway, I'm trying to simplify the downsampling code a little bit and I think I could make it roughly twice as fast without any noticeable quality loss:

//!HOOK CHROMA
//!BIND LUMA
//!BIND HOOKED
//!SAVE LUMA_LOWRES
//!WIDTH CHROMA.w
//!HEIGHT CHROMA.h
//!WHEN CHROMA.w LUMA.w <
//!DESC Chroma From Luma Prediction (Downscaling Luma)

vec4 hook() {
    vec2 factor = ceil(LUMA_size / CHROMA_size);
    vec2 centre = factor / 2.0;
    vec2 offset = 1.0 / factor;

    float output_luma = 0.0;
    float wt = 0.0;
    for (float dx = 0.0; dx <= factor.x; dx++) {
        for (float dy = 0.0; dy <= factor.y; dy++) {
            float luma_pix = LUMA_texOff(vec2(dx, dy) - vec2(centre)).x;
            float wd = exp(-2.0 * pow(length(vec2(dx, dy) - vec2(centre) + offset), 2.0));
            output_luma += luma_pix * wd;
            wt += wd;
        }
    }
    vec4 output_pix = vec4(output_luma / wt, 0.0, 0.0, 1.0);
    return output_pix; 
}

This should be a bit faster since it contains fewer operations to get to the same result. It's still obviously going to be much slower than not downsampling at all, but I guess I could write an orthogonal version of this and do it like Krig is doing (and in that case I suppose it would perform similarly). Could you help me test this in any case?

Jules-A commented 1 year ago

The version you linked didn't have the downscaling step yet, and it didn't have its own spatial scaler either so it should just use whatever you're setting cscale to (maybe you had it set to catmull_rom?).

Yeah, what I was saying was that when I removed the downscaling code from latest version of master and replaced with LUMA it didn't produce very good results compared to the older version I linked.

It's still obviously going to be much slower than not downsampling at all, but I guess I could write an orthogonal version of this and do it like Krig is doing (and in that case I suppose it would perform similarly). Could you help me test this in any case?

Yeah, though I don't know how much longer I can continue testing as I literally spent the most of yesterday testing chroma scaling :/

Jules-A commented 1 year ago

Anyway, I'm trying to simplify the downsampling code a little bit and I think I could make it roughly twice as fast without any noticeable quality loss:

testing with 4x luma upscale (not comparable to previous results) 12_Master:

12_Test: 12_test

12_9ad91ab (not same frame):

The difference in quality is noticible at such high resolutions, master is sharper although the new version seems to have slightly less artifacts (or reduction in severity at least) in my limited testing that I can't tell which is better.

Jules-A commented 1 year ago

Hmm actually just tested it with Mix and it causes some rather large artifacts, I assume 4tap would be the same: speed

vs master: master

Actually I've noticed Cfl (even the older version) has terrible stability in motion so I'm concluding my 12hrs+ of testing Chroma upscaling and abandoning krigbillateral for catmull_rom (mainly for speed reasons)... I'll still test any revisions you throw at me though.

Artoriuz commented 1 year ago

Does this fix the problem?

//!HOOK CHROMA
//!BIND CHROMA
//!BIND LUMA
//!SAVE LUMA_LOWRES
//!WIDTH CHROMA.w
//!HEIGHT CHROMA.h
//!WHEN CHROMA.w LUMA.w <
//!DESC Chroma From Luma Prediction (Downscaling Luma)

vec4 hook() {
    vec2 factor = ceil(LUMA_size / CHROMA_size);
    vec2 centre = factor / 2.0;
    vec2 offset = 1.0 / factor;

    float output_luma = 0.0;
    float wt = 0.0;
    for (float dx = 0.0; dx < factor.x; dx++) {
        for (float dy = 0.0; dy < factor.y; dy++) {
            float luma_pix = LUMA_texOff(vec2(dx, dy) - vec2(centre)).x;
            float wd = exp(-2.0 * pow(length(vec2(dx, dy) - vec2(centre) + offset), 2.0));
            output_luma += luma_pix * wd;
            wt += wd;
        }
    }
    vec4 output_pix = vec4(output_luma / wt, 0.0, 0.0, 1.0);
    return output_pix; 
}

Jules-A commented 1 year ago

Not quite but it is a little less noticeable: new

vs previous: old

It's also slightly faster ~37000ns vs ~39k (frame isn't same as previous results).

Artoriuz commented 1 year ago

Hmm, that's weird. I can't seem to reproduce this here so I suppose the conditions for it to happen aren't being completely met. The red outline in the logo makes me believe the 4-tap regression is the one at fault here, but if it doesn't happen with "master" then that's a bit peculiar.

Jules-A commented 1 year ago

It gets even less noticeable at 3x scaling so it's probably not anything to worry about:

Artoriuz commented 1 year ago

No, I actually know what the problem is... I was doing a half-pixel shift by mistake in the new code 🥲...

This should look very similar to old master now, and it shouldn't have any issues with your red logo:

//!HOOK CHROMA
//!BIND CHROMA
//!BIND LUMA
//!SAVE LUMA_LOWRES
//!WIDTH CHROMA.w
//!HEIGHT CHROMA.h
//!WHEN CHROMA.w LUMA.w <
//!DESC Chroma From Luma Prediction (Downscaling Luma)

vec4 hook() {
    vec2 factor = ceil(LUMA_size / CHROMA_size);
    vec2 centre = factor / 2.0;
    vec2 offset = 1.0 / factor;

    float output_luma = 0.0;
    float wt = 0.0;
    for (float dx = -1.0; dx <= factor.x; dx++) {
        for (float dy = -1.0; dy <= factor.y; dy++) {
            float luma_pix = LUMA_texOff(vec2(dx, dy) - vec2(centre) + offset).x;
            float wd = exp(-0.2 * length(factor) * pow(length(vec2(dx, dy) - vec2(centre) + offset), 2.0));
            output_luma += luma_pix * wd;
            wt += wd;
        }
    }
    vec4 output_pix = vec4(output_luma / wt, 0.0, 0.0, 1.0);
    return output_pix; 
}

I'm not super convinced I'm settling with this though, and I think it won't exactly be that much faster either (still a bit faster though). I'll keep working on this so we can keep the issue open.

Jules-A commented 1 year ago

I think that actually made it worse :/ (also slower) It seems to be breaking in more places now 🤔

Artoriuz commented 1 year ago

Are you 100% sure the problem isn't elsewhere? I'm testing the downsamplers as standalone shaders and they produce almost identical results for 0.5x and 0.25x scaling factors.

Jules-A commented 1 year ago

Are you 100% sure the problem isn't elsewhere? I'm testing the downsamplers as standalone shaders and they produce almost identical results for 0.5x and 0.25x scaling factors.

It may be a combination of shaders causing it, I haven't tested that but in motion it is actually rather noticeable even at 2x. Master didn't have that problem and neither did the 12tap version of the previous downscaling code you sent me (didn't test 12tap with it since I didn't see the issue there).

EDIT: It occurs even when just using fsrcnn as the other shader

Artoriuz commented 1 year ago

Nevermind, it was fine at 0.5x but not at 0.25x, my mistake.

The problem was that the new code becomes sharper as the scaling factor increases, which in turn makes the linear regressions less stable and more aggressive, causing that colour inversion near the edge of the logo.

The version below has the alarming issues fixed, but again it isn't that much faster and it's still a bit sharper at 0.25x:

//!HOOK CHROMA
//!BIND CHROMA
//!BIND LUMA
//!SAVE LUMA_LOWRES
//!WIDTH CHROMA.w
//!HEIGHT CHROMA.h
//!WHEN CHROMA.w LUMA.w <
//!DESC Chroma From Luma Prediction (Downscaling Luma)

vec4 hook() {
    vec2 factor = ceil(LUMA_size / CHROMA_size);
    vec2 start = ceil(-factor / 2.0 - 0.5);
    vec2 end = floor(factor / 2.0 - 0.5);

    float output_luma = 0.0;
    float wt = 0.0;
    for (float dx = start.x; dx <= end.x; dx++) {
        for (float dy = start.y; dy <= end.y; dy++) {
            float luma_pix = LUMA_texOff(vec2(dx, dy) + vec2(0.5)).x;
            float wd = exp(-1.0 / length(factor) * pow(length(vec2(dx, dy) + vec2(0.5)), 2.0));
            output_luma += luma_pix * wd;
            wt += wd;
        }
    }
    vec4 output_pix = vec4(output_luma / wt, 0.0, 0.0, 1.0);
    return output_pix; 
}

With that said, the old code is more convoluted but also more robust, it's probably a bit wasteful as you've noticed, and the performance difference becomes really clear at higher factors, but at least it seems pretty trustworthy.

I guess the best solution here, assuming I can't make it faster without hurting quality, would be to make a simpler version of the shader that's faster to run. I have an idea of what to do to make it way faster without sacrificing quality too much (simpler downsampler paired with a "gaussian blur" to smooth things out rather than doing both things simultaneously).

I don't want to bother you too much so I'll only post an update here when I'm at least half sure it works.

Jules-A commented 1 year ago

Would it be possible to have version that doesn't downscale but does a catmull pass first? I guess you could always make downscaling conditional if luma isn't too much above native res?

Artoriuz commented 1 year ago

Downscaling luma is honestly pretty optional and it's only really done to increase the correlation between the two, what's really important is that it also serves as a denoising/smoothing step which ends up helping the regression afterwards. You can use it without the downscaling step just fine, it just won't be as good.

You can just replace LUMA_LOWRES with LUMA after removing the first pass.

Jules-A commented 1 year ago

You can just replace LUMA_LOWRES with LUMA after removing the first pass.

That's what I originally did but the results weren't as good as 9ad91ab which didn't include a downscaling pass. While it was cheap cost-wise and looked better than catmull-rom in static shots, in motion catmull seemed quite a bit better.

Artoriuz commented 1 year ago

You can double check with this just to confirm the edited shader wasn't in a broken state: https://pastebin.com/uQ4tHgw5

Jules-A commented 1 year ago

You can double check with this just to confirm the edited shader wasn't in a broken state: https://pastebin.com/uQ4tHgw5

Looks like I did the exact same thing from the start

EDIT: Ah, it seems without any upscalers (and no other shaders running) it does look better than the old version, not sure why but it just doesn't seem to combine well with upscalers.

Artoriuz commented 1 year ago

Ah, it seems without any upscalers (and no other shaders running) it does look better than the old version, not sure why but it just doesn't seem to combine well with upscalers.

Which one? What is the "new version" here?

Jules-A commented 1 year ago

Which one? What is the "new version" here?

I was referring to master without downscaling (basically https://pastebin.com/uQ4tHgw5) vs https://github.com/Artoriuz/glsl-chroma-from-luma-prediction/commit/9ad91ab1bc94dfe52a763755b15a02a7ecca4264

Artoriuz commented 1 year ago

I think it's because its built-in spatial filter is very sharp, it probably looks bad at 4x. I didn't really test anything at 4x so it's 100% plausible that I made it worse while pursuing improvements at 2x.

Jules-A commented 1 year ago

Have you tried doing a 2nd pass? I seem to be getting comparable or often less artifacts when running a 2nd pass. I did quite a lot more experimenting today:

Whole bunch of tests

original (notice the reds have a dark outline): ![original](https://github.com/Artoriuz/glsl-chroma-from-luma-prediction/assets/1760158/6004d033-d1ef-4d4f-873c-f60379248a60) fsrcnnx8+cfl_12_noDS+cfl_12_noDS_AR: ![fsrcnn+2xcfl12](https://github.com/Artoriuz/glsl-chroma-from-luma-prediction/assets/1760158/7ea1398c-1c24-4e49-8a57-a67474ae6105) vs fsrcnnx8+cfl_12 (master): ![fsrcnn+cfl12](https://github.com/Artoriuz/glsl-chroma-from-luma-prediction/assets/1760158/5f4fffb2-eff9-4fe6-bc22-1326c12107fc) vs fsrcnnx8+catmull: ![fsrcnn+catmull](https://github.com/Artoriuz/glsl-chroma-from-luma-prediction/assets/1760158/7f0162ea-8dfa-4761-ac3d-3498cd98db23) vs fsrcnnx8+krigbillateral: ![fsrcnn+krigbillateral](https://github.com/Artoriuz/glsl-chroma-from-luma-prediction/assets/1760158/57a29e39-60b7-47dc-a338-8586784c38c6) Enabling 2nd pass makes it a little too sharp though and ringing seems a little more noticeable to I enabled AR in the 2nd pass. I tried to do a catmull first pass but I couldn't seem to work out how but that's what I'd like. I tried fastbillateral+cfl12 (no ds) and the results were decent for the most part but when there were artifacts they were very noticeable. EDIT: Looks like I got it working (still based off old code but should be catmull to native res then Cfl to Luma res: ![my attempt](https://github.com/Artoriuz/glsl-chroma-from-luma-prediction/assets/1760158/f3a2a50b-7c5e-415f-a22f-abb1d68bbb30) Though combining it with the DS code makes it even better (but expensive): ![image](https://github.com/Artoriuz/glsl-chroma-from-luma-prediction/assets/1760158/8139348e-d4cb-45fe-99c4-b555128b0ecb) Downscaling from native rather than source looks just as good (from quick tests) while costing quite a lot less: ![image](https://github.com/Artoriuz/glsl-chroma-from-luma-prediction/assets/1760158/d768dc16-0492-4d6d-a75d-41f53d09ff60) DS from NATIVE with 4x scaler: ![downscaling from native](https://github.com/Artoriuz/glsl-chroma-from-luma-prediction/assets/1760158/db5c40f2-d704-47da-bfd1-315ff59dd920) DS from CHROMA with 4x scaler: ![from chroma](https://github.com/Artoriuz/glsl-chroma-from-luma-prediction/assets/1760158/2db5f813-cd16-4b3d-b7f8-275aefbd34e0) I didn't test 4tap or mixed since they have too many artifacts it would take too long to test them properly.

So based on my tests I seem to have solved my problem with: https://pastebin.com/raw/p6YQsrcT and will be switching to it but it would be nice if you could check it and weigh in (ofc that will probably only work well with cscale=catmull_rom). EDIT: Also works with master, fixes the artifacts with reds on dark backgrounds as well as significantly dropping the amount of processing required. No idea what's going on though. EDIT: Haha nvm it wasn't actually using Cfl, I just assumed it was because it looked better but it was just the downscaling + catmull, awkward... EDIT: Actually it turns out the cost of the shader I linked was a bit too high still (but not the result of the downscaler, just I don't have much headroom after 4x scale). Ironically the downscaler + catmull is pretty good and super cheap so I'll use that instead for 1080p.

Artoriuz commented 1 year ago

Just to summarise what I was trying to do here: The downscaler that's currently in master uses a scaling factor 2 window size, which gets way too fucking big at 4x (weighted average of 64 pixels). I tried reducing it to scaling factor 1 (while also rewriting it to an easier to understand/maintain, relative-position version of it), but that reduces its quality a little bit and I'm not sure if it's worth it since performance at 2x (normal scenario) seems fine with the bigger window (16 pixels rather than 4).

Admittedly, the correct thing to do here would be doing orthogonal downscaling since it is much faster, but I'm also getting slightly worse results with orthogonal downscaling... I'm leaving this open still because I still have some hope I'll find a way to make it faster without sacrificing quality too much.

As for https://pastebin.com/raw/p6YQsrcT, you can safely use it with any cscale, as the cscale acts as a fallback of sorts when the regression is known to be bad. I was running it with lanczos but I don't see why it wouldn't work with catrom. The change of hooking chroma instead of native and doing the spatial resampling myself was done mainly for performance reasons as that avoids having to load these pixels twice in different passes. I don't have access to my desktop right now, so I can't benchmark this on a real GPU, but it does make a decent difference on my laptop (Whiskey Lake iGPU, very very weak).

I could perhaps make a different variant aimed at 4x entirely for people who want to use it alongside luma doublers, but I guess that would be a different request.

Jules-A commented 1 year ago

The change of hooking chroma instead of native and doing the spatial resampling myself was done mainly for performance reasons as that avoids having to load these pixels twice in different passes.

Umm... The variant I linked when testing with no luma doubler is 1020μs avg frametime vs 985.... with fsrcnnx8 it's faster with 3450 vs 4420... Though that is with catmull_rom which is like 2x faster than lanczos and also looks better than lanczos.

I actually made a really weird discovery, combining Krigbillateral's downscaling part (bit too soft on it's own) with the one you're using with catmull_rom chroma scaling gives very good results and fast with just 3070 with fsrcnnx8. EDIT: catmull doesn't play nice with those shaders and already sharp content so I switched to kaiser which is actually slighty cheer too.

Artoriuz commented 1 year ago

@Jules-A Final request, can you test this? https://pastebin.com/AfpT3KHe

It should be much faster now with orthogonal downsampling, but I'm interested in seeing whether you can reproduce your artifacts when you use it alongside your other shaders.

Jules-A commented 1 year ago

It should be much faster now with orthogonal downsampling, but I'm interested in seeing whether you can reproduce your artifacts when you use it alongside your other shaders.

Yeah it's a lot faster than master now: newtest

vs Master: master

It pretty much makes the issue with artifacts that look like object z-fighting almost unnoticeable (it appear for like 1 or 2 frames) so if you aren't looking for it you probably won't notice it. However it does nothing to improve the reds on dark surfaces (actually doesn't seem to occur in the older versions without the downscaling).

The mashup of old version is still a lot faster: oldmixup

and this is with my current shader set: myshaders

Artoriuz commented 1 year ago

Alright, thanks!

One problem solved then. I think quality in general took a small hit but it shouldn't be noticeable.

Jules-A commented 1 year ago

One problem solved then. I think quality in general took a small hit but it shouldn't be noticeable.

Can't decide if it's better or worse, it does some stuff better but appears to be maybe a tad too sharp and is thinning edges a bit.

Jules-A commented 1 year ago

Looks pretty good with the older code and scaling from native with kaiser instead which is also quite a bit faster vs the newest version in master (3300 vs 3650 at 2x): mpv-shot0001

vs cc15498: mpv-shot0002

Both those versions happen to have artifacts on the credits (red in old, blue in the mix).

It's yellow and blue in current master but doesn't seem to be the downscaling code exactly since I don't see it when only using downscaling code.

Artoriuz commented 1 year ago

Don't worry, I'll probably keep refining it as time passes. I closed the issue because the performance issue is gone.

Artoriuz commented 1 year ago

Also, the artifacts in the white characters are interesting. I think the black outline is screwing up the linear regression but maybe there's something I can do about this.

Edit: Downloaded the source and I'm 99% confident this is an issue in the source itself, as it contains these coloured pixels where it's supposed to be white. The shader makes it worse though. In any case, I'm not really seeing anything to the extent of your screenshots, it's all very subtle and mostly invisible with the moving credits.

mpv-shot0001

Gonna try with the exact same episode later (SauceNAO tells me this is episode 11, can you confirm?)

Jules-A commented 1 year ago

Edit: Downloaded the source and I'm 99% confident this is an issue in the source itself

Surprised you could tell the source from just the images, it was AMAIM Warrior at the Borderline - S1E18 but I downloaded it a while ago from CR, only getting around to watching it now. It's possible they re-uploaded the source or the dub (version I'm watching) is different.

It doesn't happen with my current shaders which has changed again to using Krig's downscaling only and your latest downscaling code (set to hook Native instead of Chroma) and using hamming for cscale. It doesn't happen when using any of the native cscalers or krigbillateral.

Jules-A commented 1 year ago

This is what it looks like with my current shaders:

mpv-shot0001

Kaiser was way better in this title but with your latest downscaling code it was thinning too much in others.

Artoriuz commented 1 year ago

Surprised you could tell the source from just the images

I couldn't, I got the answer from SauceNAO =p

Edit: I downloaded episode 18 in a better quality and I can reproduce. I think fixing this is going to be a bit tricky since theoretically speaking the regression is doing what it should. After downsampling some of those black pixels surrounding the white pixels will end up with bluish chroma (which doesn't matter since the pixel is black anyway). This, however, can make the regression go to the opposite direction for gray/white pixels (white should have chroma around ~0.5 for both planes, so if black is blue and gray is around 0.5 that tells the regression it should go yellow for white).

Artifacts of other colours happen due to different colours in the background that aren't coming from the blue sky.

Without downsampling, you still get this for the same reason, albeit in a much more random fashion (random chroma in the black pixels? This could also be floating point shenanigans so who knows really).

~~I suggest opening a different issue for this as it's a different problem.~~ Last commit fixes the issue (still a bit visible if you zoom in, but increasing the limit any more than that hurts quality elsewhere).

Artoriuz / glsl-chroma-from-luma-prediction

Downscale step seems quite expensive at higher resolutions #3