Use gather for better performance

deus0ww commented 1 year ago

Possibly something like this:

    const ivec2 gatherOffsets[4] = ivec2[](ivec2( 0, 0), ivec2( 2, 0), ivec2( 0, 2), ivec2( 2, 2));
    vec4 gatherUA = HOOKED_gather(vec2((fp + gatherOffsets[0]) * HOOKED_pt), 0);
    vec4 gatherUB = HOOKED_gather(vec2((fp + gatherOffsets[1]) * HOOKED_pt), 0);
    vec4 gatherUC = HOOKED_gather(vec2((fp + gatherOffsets[2]) * HOOKED_pt), 0);
    vec4 gatherUD = HOOKED_gather(vec2((fp + gatherOffsets[3]) * HOOKED_pt), 0);
    vec4 gatherVA = HOOKED_gather(vec2((fp + gatherOffsets[0]) * HOOKED_pt), 1);
    vec4 gatherVB = HOOKED_gather(vec2((fp + gatherOffsets[1]) * HOOKED_pt), 1);
    vec4 gatherVC = HOOKED_gather(vec2((fp + gatherOffsets[2]) * HOOKED_pt), 1);
    vec4 gatherVD = HOOKED_gather(vec2((fp + gatherOffsets[3]) * HOOKED_pt), 1);
    vec4 gatherYA = LUMA_LOWRES_gather(vec2((fp + gatherOffsets[0]) * HOOKED_pt), 0);
    vec4 gatherYB = LUMA_LOWRES_gather(vec2((fp + gatherOffsets[1]) * HOOKED_pt), 0);
    vec4 gatherYC = LUMA_LOWRES_gather(vec2((fp + gatherOffsets[2]) * HOOKED_pt), 0);
    vec4 gatherYD = LUMA_LOWRES_gather(vec2((fp + gatherOffsets[3]) * HOOKED_pt), 0);

    vec2 chroma_pixels[12];
    chroma_pixels[0]  = vec2(gatherUA.z, gatherVA.z);
    chroma_pixels[1]  = vec2(gatherUB.w, gatherVB.w);
    chroma_pixels[2]  = vec2(gatherUA.x, gatherVA.x);
    chroma_pixels[3]  = vec2(gatherUA.y, gatherVA.y); 
    chroma_pixels[4]  = vec2(gatherUB.x, gatherVB.x);
    chroma_pixels[5]  = vec2(gatherUB.y, gatherVB.y);
    chroma_pixels[6]  = vec2(gatherUC.w, gatherVC.w);
    chroma_pixels[7]  = vec2(gatherUC.z, gatherVC.z);
    chroma_pixels[8]  = vec2(gatherUD.w, gatherVD.w);
    chroma_pixels[9]  = vec2(gatherUD.z, gatherVD.z);
    chroma_pixels[10] = vec2(gatherUC.y, gatherVC.y);
    chroma_pixels[11] = vec2(gatherUD.x, gatherVD.x);

    float luma_pixels[12];
    luma_pixels[0]    = gatherYA.z;
    luma_pixels[1]    = gatherYB.w;
    luma_pixels[2]    = gatherYA.x;
    luma_pixels[3]    = gatherYA.y;
    luma_pixels[4]    = gatherYB.x;
    luma_pixels[5]    = gatherYB.y;
    luma_pixels[6]    = gatherYC.w;
    luma_pixels[7]    = gatherYC.z;
    luma_pixels[8]    = gatherYD.w;
    luma_pixels[9]    = gatherYD.z;
    luma_pixels[10]   = gatherYC.y;
    luma_pixels[11]   = gatherYD.x;

deus0ww commented 1 year ago

edit 1: I cleanup the code a bit. On an M1 Mac, this version speeds up the last shader pass by about 20-25%. Also works well with MimeBilateral. edit 2: swtiched CHROMA* to HOOKED*

Jules-A commented 1 year ago

Can gather or compute be used for the downscaling passes also?

Currently I'm doing this for speed: https://pastebin.com/raw/si8RdrED . Basically only calculating every 2nd pixel but scaling to Luma. For whatever reason scaling to Luma on the first pass seems to make the biggest difference, even for width (which is currently set to just Chroma in master). NOTE: I'm not actually using the upscaling code at all since there's still too many issues with reds and it's still too heavy for my use that I need to lower luma scaling. For whatever reason this causes the inbuilt chroma scalers to perform better but I'm still unsure why.

EDIT: Okay nvm, looks like interpolation wouldn't really help in this situation, it was just co-incidence... Lol okay... Even doing something like this is way better than not doing any: https://pastebin.com/raw/gSSBnkaJ

Artoriuz / glsl-chroma-from-luma-prediction

Use gather for better performance #6