dartraiden / NVIDIA-patcher

Adds 3D acceleration support for P106-090 / P106-100 / P104-100 / P104-101 / P102-100 / CMP 30HX / CMP 40HX / CMP 50HX / CMP 70HX / CMP 90HX / CMP 170HX mining cards as well as RTX 3060 3840SP and RTX 3080 Ti 20GB.
361 stars 31 forks source link

170hx(any cmp hx card)can run higher and higher fp32 flops than before #73

Open jetcat8848 opened 12 months ago

jetcat8848 commented 12 months ago

i tried mod a OpenCL benchmark(disable fma to prevent GPU use it,the code like this:

diff --git a/src/lbm.cpp b/src/lbm.cpp index d99202f..28aeb25 100644 --- a/src/lbm.cpp +++ b/src/lbm.cpp @@ -286,6 +286,8 @@ void LBM_Domain::enqueue_unvoxelize_mesh_on_device(const Mesh* mesh, const uchar }

string LBM_Domain::device_defines() const { return

"\n #pragma OPENCL FP_CONTRACT OFF" // prevents implicit FMA optimizations "\n #define fma(a, b, c) ((a) * (b) + (c))" // shadows OpenCL explicit function fma() "\n #define def_Nx "+to_string(Nx)+"u" "\n #define def_Ny "+to_string(Ny)+"u" "\n #define def_Nz "+to_string(Nz)+"u" OK,the moded OpenCL benchmark runs,and the 170hx fp32 flops increased to 6.285 Tflops,the original fp32 flops just only 0.395Tflops,6.285/0.395=16,so,i think the nvidia driver prevented gpu use full speed on FMA! 554794FE-2FCF-4808-9EAD-FE53D4BD9B14 5EA16114-40C5-4585-A650-0BD733AAA351 2420A824-9332-40E0-A9B5-2DB43FC81C0A

jetcat8848 commented 12 months ago

10DE 20C2 the devicr ID is a CMP170HX mining card,i installed a nvidia gird A100-20C driver to run it! ![Uploading D4C351D7-32C8-4685-A388-E3D177234F02.jpeg…]()

jetcat8848 commented 12 months ago

AED06D52-6C40-470A-A830-4ADFDC636786 1E311B2C-4000-45EE-9B1A-A194AA243FA3 0D22035A-0199-4577-A106-C2F51207014D 5BD26C45-06C1-4CD3-AD25-6085601A59B2 86A6707C-BA06-4C43-B05D-A6BD5DCCAA29

bah86 commented 11 months ago

Do you have any idea how to disable fma in the driver?

astronautduckpc commented 11 months ago

Is this the same problem that the cmp70hx and cmp90hx have reduced performance? Described here in open sources

jetcat8848 commented 11 months ago

您知道如何在驱动程序中禁用 fma 吗?

sorry!i have no idea....

jetcat8848 commented 11 months ago

这与 cmp70hx 和 cmp90hx 性能下降的问题相同吗?此处在开源中进行了描述

yes!it is the same!nvidia use efuse to tag fma speed (reduce to: 1/8,1/16,1/32...1/2^n,n=1,2...5),and the nv driver knows how to running!

astronautduckpc commented 11 months ago

这与 cmp70hx и cmp90hx.

да! это то же самое! NVIDIA использует efuse для обозначения скорости fma (уменьшите до: 1/8,1/16,1/32...1/2^n,n=1,2...5), и драйвер nv знает, как работать!

and how to fix or work around this?)

Skylord4321 commented 9 months ago

this is incredible information! so they used Efuse within the driver to hinder the mining card performance!

InnovativeOSS117 commented 4 months ago

For info in Nvidia Drivers, there exists fuse definitions related to FMA speed reduction. What is interesting is that there also exists fuses to override this speed reduction.

I see 2 approach: either find in driver where those fuses are read and patch the driver, but the problem is to locate where to find it. Second is to fuse the override SM speed, but good luck finding the tool to fuse the chip.

InnovativeOSS117 commented 4 months ago

Flashing an A100 bios results in the card not booting in secured mode, making it unusable. But:

Flashing a 10gb board with an 8gb bios changes :

InnovativeOSS117 commented 3 months ago

All my tweakings lead me conclude that the patch is working well: the memory bandwidth limitation is lifted with the patch but the one thing missing is the WDDM (graphic) activation of the driver. The card stays in TCC (compute mode) The previous method of modifying the register value "AdpaterType" from 2 to 0 crashes the driver. The only thing I could find about it is in Chinese ebay a guy that managed to enable WDDM for a a100 that have the same core. Screenshot_2024-08-16-12-54-16-38_0dff84d2da4d0ad536cbb4d749024dd6 Screenshot_2024-08-16-12-41-46-90_0dff84d2da4d0ad536cbb4d749024dd6 Screenshot_2024-08-15-19-14-27-27_0dff84d2da4d0ad536cbb4d749024dd6 Screenshot_2024-08-15-19-14-09-28_0dff84d2da4d0ad536cbb4d749024dd6

Demianvan commented 1 month ago

@jetcat8848 Very interesting, I do some benchmarks with AIDA64 in my 90xh compared with 3070, and the results are very similar or superior inclusive in front 170HX. Can you share the results in GPGPU AIDA64 with your 170hx?

As you can see, 90HX (3080) and 3070 have almost same results (except FP32/16). The "Very Interesting Thing" its with FP64, that are Untouched, and same with all INT capabilities (mainly for mining algorithms).

BM_CMP90XH

BM_3070_OC

cmp90hx_OC_gpgpu.txt

3070_OC_gpgpu.txt

Looks like operations are directly in math factor way, dependant from FP64:

RTX gaming FP32=(FP64x64) CMP mining FP32=(FP64x2)

Really this looks as driver thing, and confirms the way of patch a math factor in a fuse definition. Unlock this thing, and this cards turns in the most reliable GPUs for training and executing LLMs/SD generators. And, maybe, we are approaching a TABOO about the FP16 1:1 limitation for artificially segment the market.

My full testings and overclock are detailed in this thread. https://github.com/dartraiden/NVIDIA-patcher/issues/45#issuecomment-2387835771

tech-qroll commented 1 day ago

Это действительно интересно. Спасибо за проделанную работу! Было бы круто сделать большую статью с подробным объяснением всего этого для новичков, тогда все смогут покупать шахтерские видеокарты, и наслаждаться производительностью)