Open jetcat8848 opened 12 months ago
10DE 20C2 the devicr ID is a CMP170HX mining card,i installed a nvidia gird A100-20C driver to run it! ![Uploading D4C351D7-32C8-4685-A388-E3D177234F02.jpeg…]()
Do you have any idea how to disable fma in the driver?
Is this the same problem that the cmp70hx and cmp90hx have reduced performance? Described here in open sources
您知道如何在驱动程序中禁用 fma 吗?
sorry!i have no idea....
这与 cmp70hx 和 cmp90hx 性能下降的问题相同吗?此处在开源中进行了描述
yes!it is the same!nvidia use efuse to tag fma speed (reduce to: 1/8,1/16,1/32...1/2^n,n=1,2...5),and the nv driver knows how to running!
这与 cmp70hx и cmp90hx.
да! это то же самое! NVIDIA использует efuse для обозначения скорости fma (уменьшите до: 1/8,1/16,1/32...1/2^n,n=1,2...5), и драйвер nv знает, как работать!
and how to fix or work around this?)
this is incredible information! so they used Efuse within the driver to hinder the mining card performance!
For info in Nvidia Drivers, there exists fuse definitions related to FMA speed reduction. What is interesting is that there also exists fuses to override this speed reduction.
I see 2 approach: either find in driver where those fuses are read and patch the driver, but the problem is to locate where to find it. Second is to fuse the override SM speed, but good luck finding the tool to fuse the chip.
Flashing an A100 bios results in the card not booting in secured mode, making it unusable. But:
Flashing a 10gb board with an 8gb bios changes :
All my tweakings lead me conclude that the patch is working well: the memory bandwidth limitation is lifted with the patch but the one thing missing is the WDDM (graphic) activation of the driver. The card stays in TCC (compute mode) The previous method of modifying the register value "AdpaterType" from 2 to 0 crashes the driver. The only thing I could find about it is in Chinese ebay a guy that managed to enable WDDM for a a100 that have the same core.
@jetcat8848 Very interesting, I do some benchmarks with AIDA64 in my 90xh compared with 3070, and the results are very similar or superior inclusive in front 170HX. Can you share the results in GPGPU AIDA64 with your 170hx?
As you can see, 90HX (3080) and 3070 have almost same results (except FP32/16). The "Very Interesting Thing" its with FP64, that are Untouched, and same with all INT capabilities (mainly for mining algorithms).
Looks like operations are directly in math factor way, dependant from FP64:
RTX gaming FP32=(FP64x64) CMP mining FP32=(FP64x2)
Really this looks as driver thing, and confirms the way of patch a math factor in a fuse definition. Unlock this thing, and this cards turns in the most reliable GPUs for training and executing LLMs/SD generators. And, maybe, we are approaching a TABOO about the FP16 1:1 limitation for artificially segment the market.
My full testings and overclock are detailed in this thread. https://github.com/dartraiden/NVIDIA-patcher/issues/45#issuecomment-2387835771
Это действительно интересно. Спасибо за проделанную работу! Было бы круто сделать большую статью с подробным объяснением всего этого для новичков, тогда все смогут покупать шахтерские видеокарты, и наслаждаться производительностью)
i tried mod a OpenCL benchmark(disable fma to prevent GPU use it,the code like this:
diff --git a/src/lbm.cpp b/src/lbm.cpp index d99202f..28aeb25 100644 --- a/src/lbm.cpp +++ b/src/lbm.cpp @@ -286,6 +286,8 @@ void LBM_Domain::enqueue_unvoxelize_mesh_on_device(const Mesh* mesh, const uchar }
string LBM_Domain::device_defines() const { return
"\n #pragma OPENCL FP_CONTRACT OFF" // prevents implicit FMA optimizations "\n #define fma(a, b, c) ((a) * (b) + (c))" // shadows OpenCL explicit function fma() "\n #define def_Nx "+to_string(Nx)+"u" "\n #define def_Ny "+to_string(Ny)+"u" "\n #define def_Nz "+to_string(Nz)+"u" OK,the moded OpenCL benchmark runs,and the 170hx fp32 flops increased to 6.285 Tflops,the original fp32 flops just only 0.395Tflops,6.285/0.395=16,so,i think the nvidia driver prevented gpu use full speed on FMA!