Compute shader version of NNEDI3?

bjin / mpv-prescalers

prescalers for mpv, as user shaders

GNU Lesser General Public License v3.0

355 stars 34 forks source link

Compute shader version of NNEDI3? #11

Closed haasn closed 6 years ago

haasn commented 7 years ago

I forgot whether we tested this or not

bjin commented 7 years ago

The performance improvement that compute shader based nnedi3 brings is going to be minimal, even less than what textureGatherOffset change brings. What nnedi3 do is basically fetching 32 samples (or 48 for 8x6 variant), and feed them into a huge neural network. With textureGatherOffset change, the number of texture sampling calls is reduced from 32 to 8. With compute shader, it will be further reduced from 8 to about 4(largest sampling count for all threads in a workgroup), plus barrier overhead. Not so impressive, especially for luma upscaling (the one most commonly used, it's much faster to process only one plane for nnedi3).

With compute shader, we can also use just two passes and save overhead for intermediate textures (like what RAVU compute shader did), but the benefits is also going to be minimal. The bottleneck is at the neural network part.

So, I don't plan to port nnedi3 to compute shader, at least not in recent.

haasn commented 7 years ago

@bjin But doesn't NNEDI3 benefit a lot from the four-pass change? And couldn't we merge those passes into one with CS?

Edit: Ah, never mind, you already mentioned that.

haasn commented 7 years ago

Incidentally, I realized why NNEDI3 benefits so much from the four-pass variant. It's stupidly simple: GPUs are SIMD. That means, there's no such thing as an “early return” unless the entire warp (32/64 threads) returns early at the same time.

NNEDI3's decision logic was highly regular - every second pixel passed the test. So in the end, none of the threads were able to actually return early, so you still did the same amount of work except only got 16/32 usable threads' worth of results out of it. This was a fantastically disgusting waste of resources.

The 4-pass version benefits so much because it calculates 64 non-interpolating threads (fast) at the same time followed by 64 interpolating threads (efficient) at the same time. A branchless compute shader version would have the same benefits, but additionally the potential for some work sharing.

bjin commented 6 years ago

Incidentally, I realized why NNEDI3 benefits so much from the four-pass variant. It's stupidly simple: GPUs are SIMD. That means, there's no such thing as an “early return” unless the entire warp (32/64 threads) returns early at the same time.

Yes, I realized this when I was improving the first iteration of RAVU. The two passes used the same amount of time, while in "theory" the second pass should be two times slower. (like superxbr, the first iteration of RAVU upscale the texture in first pass but only interpolate 1/4 of all pixels in the first pass, and 1/2 in second pass). This is also about the first time mpv-stats gets per pass rendering stats, so I didn't know that before.

I also changed NNEDI3 prescaling to be separated later, just to make compute shader easier to implement. Now NNEDI3 still uses four passes, but instead of interpolating three texture and merge them in the last pass, it interpolate in two direction separately: Interpolate in one pass, merging in the second pass. Incidentally, this also makes sampling a little bit faster, presumably because of the data locality (sampling is done from HOOKED only after the change)