Closed haasn closed 6 years ago
The performance improvement that compute shader based nnedi3
brings is going to be minimal, even less than what textureGatherOffset
change brings. What nnedi3
do is basically fetching 32 samples (or 48 for 8x6
variant), and feed them into a huge neural network. With textureGatherOffset
change, the number of texture sampling calls is reduced from 32 to 8. With compute shader, it will be further reduced from 8 to about 4(largest sampling count for all threads in a workgroup), plus barrier overhead. Not so impressive, especially for luma upscaling (the one most commonly used, it's much faster to process only one plane for nnedi3
).
With compute shader, we can also use just two passes and save overhead for intermediate textures (like what RAVU
compute shader did), but the benefits is also going to be minimal. The bottleneck is at the neural network part.
So, I don't plan to port nnedi3
to compute shader, at least not in recent.
@bjin But doesn't NNEDI3 benefit a lot from the four-pass change? And couldn't we merge those passes into one with CS?
Edit: Ah, never mind, you already mentioned that.
Incidentally, I realized why NNEDI3 benefits so much from the four-pass variant. It's stupidly simple: GPUs are SIMD. That means, there's no such thing as an “early return” unless the entire warp (32/64 threads) returns early at the same time.
NNEDI3's decision logic was highly regular - every second pixel passed the test. So in the end, none of the threads were able to actually return early, so you still did the same amount of work except only got 16/32 usable threads' worth of results out of it. This was a fantastically disgusting waste of resources.
The 4-pass version benefits so much because it calculates 64 non-interpolating threads (fast) at the same time followed by 64 interpolating threads (efficient) at the same time. A branchless compute shader version would have the same benefits, but additionally the potential for some work sharing.
Incidentally, I realized why NNEDI3 benefits so much from the four-pass variant. It's stupidly simple: GPUs are SIMD. That means, there's no such thing as an “early return” unless the entire warp (32/64 threads) returns early at the same time.
Yes, I realized this when I was improving the first iteration of RAVU. The two passes used the same amount of time, while in "theory" the second pass should be two times slower. (like superxbr
, the first iteration of RAVU upscale the texture in first pass but only interpolate 1/4 of all pixels in the first pass, and 1/2 in second pass). This is also about the first time mpv-stats
gets per pass rendering stats, so I didn't know that before.
I also changed NNEDI3 prescaling to be separated later, just to make compute shader easier to implement. Now NNEDI3 still uses four passes, but instead of interpolating three texture and merge them in the last pass, it interpolate in two direction separately: Interpolate in one pass, merging in the second pass. Incidentally, this also makes sampling a little bit faster, presumably because of the data locality (sampling is done from HOOKED
only after the change)
I forgot whether we tested this or not