Support recombined YUV 4:4:4 encoding (Prototype, Windows-only for now)

ns6089 commented 1 week ago

Description

The continuation of https://github.com/LizardByte/Sunshine/pull/2533. It's possible to emulate YUV 4:4:4 on gpus that don't support it natively by doubling the YUV 4:2:0 pixel count and running custom recombination shaders on both encoding and decoding side. Like Microsoft did it in MS-RDPEGFX.

Prototype stage. Requires changes on moonlight's side: I currently have custom libplacebo mpv shader implemented for plvk backend, in the future it should be possible to add Direct3D11 and OpenGL shaders.

https://github.com/ns6089/Sunshine/compare/yuv444..yuv444in420

moonlight-common-c pull request: TBD moonlight-qt pull request: TBD, testing branch https://github.com/ns6089/moonlight-qt/tree/yuv444in420

What works and what doesn't

First prototype, left half of U_src and V_src planes in Y_out. Good DCT, bad motion compensation.
Second prototype. U_src in Y_out. V_src is spread across U_out and V_out in a pattern that is spatially consistent with Y_out. Good motion compensation, relatively fat DCT on U_out and V_out due to high frequencies.
Third prototype (TBD). Maybe can slightly improve the DCT by running 1/4 of V through averaging low pass filter.

To Do

decide what to do with resolutions not divisible by 2
decide what to do with portrait mode streams
decide in which part of the protocol dimension doubling will be taking place, e.g. will the client request the doubled dimension or will it be done implicitly

Screenshot

before after

Issues Fixed or Closed

Type of Change

[ ] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[ ] Dependency update (updates to dependencies)
[ ] Documentation update (changes to documentation)
[ ] Repository update (changes to repository files, e.g. .github/...)

Checklist

[ ] My code follows the style guidelines of this project
[ ] I have performed a self-review of my own code
[ ] I have commented my code, particularly in hard-to-understand areas
[ ] I have added or updated the in code docstring/documentation-blocks for new or existing methods/components

Branch Updates

LizardByte requires that branches be up-to-date before merging. This means that after any PR is merged, this branch must be updated before it can be merged. You must also Allow edits from maintainers.

[ ] I want maintainers to keep my branch updated

mirh commented 6 days ago

Awesomely crazy. Could there be anything worth doing with an emulated 4:2:2 stream then? Like, I don't know, slightly lower recombination overhead, or lower bandwidth requirements?

Or perhaps not hitting encoding limits at higher resolutions. Like, is my understanding correct that this pixel doubling would not allow for 1440p on (say) older VCE versions that max out at 4K?

ns6089 commented 5 days ago

I don't think anyone but Intel supports 4:2:2. About 4K limit, 1440p might still work depending on how exactly said limit is implemented, the overall pixel count stays within 4K range.

amd

mirh commented 5 days ago

I don't think anyone but Intel supports 4:2:2.

To be honest, I was more thinking of TVs than computers here. It's a mixed bag even there, but still it's not so rare. But now that you mention pcs, decoding is much lighter on the cpu than encoding. I don't think that would usually be a deal breaker. Or nevertheless, couldn't the client-side recombination just fake to be 4:4:4 then? Or would whatever empty padding you add ruin the image more than the results you could get with just plain 4:2:0?

About 4K limit, 1440p might still work depending on how exactly said limit is implemented, the overall pixel count stays within 4K range.

You mean if the limit is actually implemented like 4096x2160 (usual old amd) vs 4096x4096 (usual old nvidia)? Or can you really call it a day just as long as the supported total pixel count, whatever the "shape", is 7.372.800 (2560x1440x2) or more?

ns6089 commented 5 days ago

You can't encode 4:2:2 on nvidia gpus, implementing a path exclusively for intel will be too expensive.

Or can you really call it a day just as long as the supported total pixel count, whatever the "shape", is 7.372.800 (2560x1440x2) or more?

I'm already calling it a day :sunglasses: Doubling one dimension allows to minimize discontinuities in motion estimation, in contrast to tiling. Current half-naive implementation for example has single motion estimation vertical "seam" in U and V planes.

    //     Y       U     V
    // +-------+ +---+ +---+
    // |       | |   | |   |
    // |   Y   | |UR | |VR |
    // |       | |   | |   |
    // +---+---+ +---+ +---+
    // |   |   |
    // |UL |VL |
    // |   |   |
    // +---+---+

mirh commented 5 days ago

You can't encode 4:2:2 on nvidia gpus, implementing a path exclusively for intel will be too expensive.

You can't encode 4:4:4 on amd gpus either, and yet this is what this PR is about isn't it?

ns6089 commented 5 days ago

Personally, I don't see a point in supporting recombination into 4:2:2 It will still have visible artifacts while having computational overhead close to 4:4:4 and significant amount of additional development time. And this development time will be multiplied by the amount of distinct clients,

mirh commented 4 days ago

I mean, sure, of course this is already miraculous. I was just trying to think outside the box (4:2:2 is still subpar, but even the worst case scenario starts to be bearable instead).

If any I guess the improvement isn't that clear cut, because unlike with a direct cable connection it's not like there aren't already compression artifacts anyway. So if 4:4:4 couldn't fit in some whatever doubled 4:2:0 4K scenario, just lowering the resolution could also be a possible (and if not any easily immediate) alternative?

LizardByte / Sunshine