[dxvk] Optimize for the d3d9 Strict float emulation path

Blisto91 commented 6 months ago

The d3d9 Strict float emulation path in dxvk (see links below for technical description) is not enabled by default for all drivers, even though it is more correct, because it has a performance penalty compared to the default True. Radv and now also nvk have code to optimize for this and so will both use Strict out of the box without any performance penalty and with more games functioning out of the box without visual issues.

Amdvlk currently doesn't do this and so will either have a performance penalty for any games where dxvk sets Strict by default or risk of visual issues in any games where such builtin configs doesn't exist yet. A couple of examples for illustrating the performance dip can be seen below. Note that these games are just randomly chosen and are not meant to be worst case scenarios. Also note that my test setup is pretty high end to begin with (RX6800 and 7950x) and so does not represent a typical one.

Risen

`d3d9.floatEmulation = True` ![emulation-true](https://github.com/GPUOpen-Drivers/AMDVLK/assets/47954800/ff37d4c5-3909-4ec7-8048-9129cdafcefc) `d3d9.floatEmulation = Strict` ![emulation-strict](https://github.com/GPUOpen-Drivers/AMDVLK/assets/47954800/140f1c57-a3c2-4915-9df9-62b18ef8b3e6)

Dragons Dogma

`d3d9.floatEmulation = True` ![dragon-emulation-true](https://github.com/GPUOpen-Drivers/AMDVLK/assets/47954800/c8c80aa1-4f03-4ff6-a3b8-200fb5d81afc) `d3d9.floatEmulation = Strict` ![dragon-emulation-strict](https://github.com/GPUOpen-Drivers/AMDVLK/assets/47954800/9b0832f0-9ed0-49a0-9f09-1df751fbbc70)

See original dxvk PR https://github.com/doitsujin/dxvk/pull/2294 See also radv MR https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/13436

ruiminzhao commented 4 months ago

@Blisto91 Thanks for your comments. Now I'm investigating this issue on Amdvlk. A question here: Do you have any SPIRV generated when "d3d9.floatEmulation = True" or "d3d9.floatEmulation = Strict" ? I want to know if this setting will be reflected in the shader. Then I can do the optimization according to the related flag in SPIRV. Thanks.

Blisto91 commented 4 months ago

Hi there and thank you for the response. I am not personally skilled in this area but I have asked the dxvk devs for assistance when they have a bit of time.

DadSchoorse commented 3 months ago

I see https://github.com/GPUOpen-Drivers/llpc/commit/e91a935d9e3ae526f4cd8044659609ba8daa858b added an optimization for ((b==0.0 ? 0.0 : a) * (a==0.0 ? 0.0 : b)). But dxvk also emits fma((b==0.0 ? 0.0 : a), (a==0.0 ? 0.0 : b), c) . So unless llpc lowers fma to mul+add, you should also add a pattern that optimizes the fma version to v_fma_legacy_f32/v_mad_legacy_f32. And depending on if you run constant folding before the optimizations, you also want to handle the case where the comparison+select was optimized away for one mul operand, (a * (a==0.0 ? 0.0 : b))/fma(a, (a==0.0 ? 0.0 : b), c), if b is not constant zero.

ruiminzhao commented 3 months ago

@DadSchoorse Thanks for your comment. Now I have added more patterns as you refer, now the patterns supported is listed below:

((b==0.0 ? 0.0 : a) * (a==0.0 ? 0.0 : b)) ==>fmul_legacy(a,b)
a (a==0.0?0.0:b) or (b==0.0?0.0:a) b ==>fmul_legacy(a.b)
fma((b==0.0 ? 0.0 : a), (a==0.0 ? 0.0 : b), c) ==>fma_legacy(a,b,c)
fma(a, (a==0.0 ? 0.0 : b), c) or fma(b==0.0?0.0:a, b, c) ==>fma_legacy(a,b,c)

For 2.3, one more condition is the single operand(a or b) should not be constant zero here.

Please check any missing here. Now my fix is under CI, looking forward to merge and deliver it ASAP. Thanks.

DadSchoorse commented 3 months ago

For 2.3, one more condition is the single operand(a or b) should not be constant zero here.

What I've said before may have been a bit ambiguous, so just to make sure: For a * (a==0.0?0.0:b) it's important that b is not zero. So if (b.isConstant() && b.constantValue() != 0.0) { apply_opt(); }, not if (!b.isConstant() || b.constantValue() != 0.0).

Otherwise, your list matches what radv optimizes.

DadSchoorse commented 3 months ago

Oh, another thing I just thought of, I don't see a bit size check in https://github.com/GPUOpen-Drivers/llpc/commit/e91a935d9e3ae526f4cd8044659609ba8daa858b . v_mul_legacy_f32/v_fma_legacy_f32 are 32bit only.

Blisto91 commented 1 month ago

Was this work supposed to be enabled in the 2024.Q2.1 release? I tried a quick test with my iGPU in Risen 1 and still get a big performance drop when setting d3d9.floatEmulation = Strict

AMDVLK 2024.Q2.1

`d3d9.floatEmulation = True` ![Q2 1-true](https://github.com/GPUOpen-Drivers/AMDVLK/assets/47954800/d4029273-2598-45f5-a521-15b161b2a488) `d3d9.floatEmulation = Strict` ![Q2 1-strict](https://github.com/GPUOpen-Drivers/AMDVLK/assets/47954800/d793b2b3-bc24-4981-9d96-b8161d5716b5)

RADV for comparison

`d3d9.floatEmulation = True` ![radv-true](https://github.com/GPUOpen-Drivers/AMDVLK/assets/47954800/4b162cd5-d075-4e3c-b591-ee78de7ef60f) `d3d9.floatEmulation = Strict` ![radv-strict](https://github.com/GPUOpen-Drivers/AMDVLK/assets/47954800/33a7dd5e-ed2a-4114-9fac-94e4781d8c87)

ruiminzhao commented 1 month ago

@Blisto91 Thanks for your feedback. For the cause of the issue, I have two suspects here:

My fix hasn't fit the pattern of IR pattern generate by the game.
Other fix which is related with fastmath flag has broken the pattern on which I have the optimized.

To confirm which one cause this issue, would you please(or ask dxvk devs for assistance) to dump the pipeline then I can check whether my optimization has been effective.

Thanks.

K0bin commented 1 month ago

You can dump the shaders by setting the environment variable DXVK_SHADER_DUMP_PATH=/your/path and then running the game with DXVK.

That will export the generated SPIR-V among other things.

Any D3D9 game will work, you just need to also set the environment variable DXVK_CONFIG=d3d9.floatEmulation = Strict; to enable the accurate float behavior.

Blisto91 commented 1 month ago

Linked is a dxvk shader dump from Risen 1 running on my 7950x iGPU with amdvlk 2024.Q2.1 and d3d9.floatEmulation = strict

https://drive.proton.me/urls/SF8RPVZ6CG#Rk7KIIG4d480

ruiminzhao commented 1 month ago

@Blisto91 Thanks. But unfortunatelly I can't access this link.... Maybe you can add the related files in this page attached?

Blisto91 commented 1 month ago

@ruiminzhao Hi there. I hope a 7zip wrapped in a zip is fine as most formats Github allows doesn't compress enough on their own. Risen-amdvlk-strict-float.zip

ruiminzhao commented 1 month ago

@ruiminzhao Hi there. I hope a 7zip wrapped in a zip is fine as most formats Github allows doesn't compress enough on their own. Risen-amdvlk-strict-float.zip

Thanks. I can get the log now and will have a look later.

ruiminzhao commented 1 month ago

The root cause has been found, the pattern used widely is like: " “ %2272 = select reassoc nnan nsz arcp contract afn i1 %2270, float 0.000000e+00, float %2271 %2273 = insertelement <3 x float> poison, float %2272, i64 0 … %2276 = select reassoc nnan nsz arcp contract afn i1 %2274, float 0.000000e+00, float %2275 %2277 = insertelement <3 x float> %2273, float %2276, i64 1 … %2280 = select reassoc nnan nsz arcp contract afn i1 %2278, float 0.000000e+00, float %2279 %2281 = insertelement <3 x float> %2277, float %2280, i64 2 ..

%2293 = select reassoc nnan nsz arcp contract afn i1 %2291, float 0.000000e+00, float %2292 %2294 = insertelement <3 x float> poison, float %2293, i64 0 … %2297 = select reassoc nnan nsz arcp contract afn i1 %2295, float 0.000000e+00, float %2296 %2298 = insertelement <3 x float> %2294, float %2297, i64 1
… %2301 = select reassoc nnan nsz arcp contract afn i1 %2299, float 0.000000e+00, float %2300 %2302 = insertelement <3 x float> %2298, float %2301, i64 2 … %2303 = fmul reassoc nnan nsz arcp contract afn <3 x float> %2281, %2302 “

" It hasn't been caught, it needs to reorder the process like this: " Many other transforms Scalarizer pass fmul_legacy / fma_legacy matching "

GPUOpen-Drivers / AMDVLK

[dxvk] Optimize for the d3d9 Strict float emulation path #346