sha256d/t optimizations slower with AVX

JayDDee commented 2 years ago

This issue is to document an issued discovered while implementing new optimizations of the sha256 hash function for sha256d and sha256t in cpuminer-opt-3.18.3. These optimizations are described in detail in the following document in sections 6.2 & 6.3.

http://www.nicolascourtois.com/bitcoin/Optimising%20the%20SHA256%20Hashing%20Algorithm%20for%20Faster%20and%20More%20Efficient%20Bitcoin%20Mining_Rahul_Naik.pdf

In effect they reduce the number of instructions required to perform a transform in certain situations. In the case of early rejection the optimization is modified and expanded from the hat described in thedocument but it tested correctly on all architectures with no false positives or false negatives. In other words valid hash was never falsely rejected and invalid hash was allways rejected early.

It was found that implementing these optimizations on a CPU with AVX resulted in a severe performance loss. This loss was not observed with AVX512 or AVX2, the optimizatons improved performance as expected. The AVX build also performed better on AVX512 or AVX2 capable CPUs. This suggests an issue specific to the CPU architecture or the CPU itself.

The issue will not be pursued at this time so this will serves as a record for future reference.

Test environment:

CPU: i5-2400 OS: Ubuntu 20.04, Build: GCC 9.3

Through trial and error I discovered the problem was related to the order the functions were defined. The full transform was defined first, followed by 3 rounds prehash, final 61 rounds, & early rejection. When the final 61 rounds optimization was substututed for a full transform performance dropped around 20% for sha256t. Substituting the early rejection optimization for the final transform resulted in a similar degradation in performance. When both were used the hash rate was cut nearly in half.

For toubleshooting purposes the optimizations were removed from the customized functions resulting in identical copies of the full transform with only the names being different. This also resulted in dramatic losses. After a few more trials it was determined that the performance was dependant on the order the functions were defined, declaration order in the header file had no effect. The function defined first regardless of its name and intended use was allways much faster.

The obvious questions:

why does this only affect AVX (and SSE2) but not AVX2 or AVX512.
why does it affect a CPU limited to AVX bit not a more capable CPU using an AVX build?
how could the definition order affect performance and why so much?

JayDDee commented 10 months ago

The root cause may have been identified. There is a problem with the optimization itself when used with low difficulty mining. If difficulty is low enough target[7], needed by H, may not be zero. The test currently assumes it's always zero. This results valid hash being discarded and a low effective hash rate. The odd behaviour previously observed may have been due to random instances where target[7] was zero and the early exit test worked as intended. This would be more likely with higher level optimizations like AVX2 & AVX512 which would support higher diff. A fix will require factoring the target in the early exit test.

JayDDee commented 10 months ago

There were two problems with my implementation of the optimization. One was assuming target[7] was always going to be zero which caused good hash to be discarded instead of being submitted. This was the cause of the low effective hash rate. Another bug caused low difficulty rejected shares that mysteriously passed the pre-submission difficulty test. The fix slightly reduces the gain from the optimization on AVX512. Limitations of the AVX2 instruction set including lack of unsigned compare & no mask registers make it ineffective. It will be fixed for AVX512 but removed for AVX2.

JayDDee commented 10 months ago

There will be a fix for AVX2. The unsigned arithmetic issue have all been resolved relatively efficiently with one exception. Whe the target is the minimum signed integer, ie 0x80000000, the test is not reliable. In such cases the early exit is not attempted with no ill effects. This issue will be closed upon the next release.

Edit: the INT_MIN issue was resolved to give 100% accuracy but the performance was still not improved. The title issue remains unresolved with little hope or motivation to pursue it. There's no need to wait for the next release because it will do nothing for this issue

Closing now.

JayDDee / cpuminer-opt

sha256d/t optimizations slower with AVX #344