Unfinished work - Githubissues

JayDDee commented 4 years ago

This issue is opened to document architectural changes that require changes to the scanhash function of each algo. These changes may not have been propagated to all algl algos for various reasons.

The reason for most of the changes is to streamline the code by reducing instructions. Sale share reduction is the goal of one change, and the generic scanhash will reduce the work of propagating other changes

defining a series of generic scanhash functions that can be used by multiple algos to replace their individual custom scanhash functions for specific cases:
- one way linear hashing
- N way hashing for each N (4, 8, 16) and the format of the hash to be tested: 64 bit interleaved, 32 bit interleaved, or de-interleaved.

The remaining scanhash changes are automatically implemented for algos that can use a generic scanhash function.

vectored byte swap and interleaving of input data, for various N ways, for 32 and 64 bit data.
byte-swap the nonce only when necessary, when a valid share is found, instead of byte-swapping every nonce tested.
implement new hash for test including pre-test before de-interleaving N way hash.
submit shares in scanhash loop then continue hashing instead of returning to the main thread loop to submit shares.
thread id argument added to hash call to enable restart flag checking.

There are also changes to the hash functions of each algo:

use union overlay instead of struct for the context holder for algos that use a lot of contexts,
implement midstate prehash when first function use a block size of 64 bytes or less,
use full versions of chained hash functions instead of the 3 step init, update & close,
write final hash directly to output buffer instead of using an intermediate buffer and memcpy,
implement intermediate stale work detection for low hash rate algos to reduce stale shares.
use rintrlv instead of 2 step dintrlv, intrlv when interleaved data needs to be interleaved in a different format.
ensure hash function returns a default 1 if thread restart checking is not used.

JayDDee commented 4 years ago

The implementation of a generic scanhash is complicated with n-way parallel hashing with chained algorithms. Each function in the chain may be interleaved 64 bit words, interleaved 32 bit words or not interleaved. he first and last functions may have different interleaving which must be handled differently by scanhash.

This results in up to 9 different generic scanhash functions to handle each situation for each architecture. The full requirement is 22 individual scanhash functions. Calling them generic may seem a bit ambitious but they can be used by most of the chained algorithms and still represents a significant reduction in code duplication.

SSE2: 4 way 32 bit words (4 cases) AVX2: 8 way 32 bit words, 4 way 32 bit words, 4 way 64 bit words (9 cases) AVX512: 16 way 32 bit words, 8 way 32 bit words, 8 way 64 bit words (9 cases)

Algorithms that perform a midstate prehash are not considered at this time. Support would require a gate function for prehash as each algo has its own custom prehash..

JayDDee commented 4 years ago

x17, xevan and sonoa algorithms are currently up to date with all mods, including generic scanhash.

JayDDee commented 2 years ago

Allium & Lyra2Z AVX512 & AVX2 are up to date with 2 stage blake256 prehash optimization using linear SIMD for the first stage and Nway parallel for the second. X17 AVX512 & AVX2 have blake512 second stage prehash, first stage not possible. Generic scanhash is not used with prehashing.

JayDDee commented 1 year ago

Many chained algorithms have redundant endian byte swaps that can be eliminated. Blake is often the first hash function in a chain and it either performs a bswap32 (blake256) or bswap64 (blake512). Prior to calling blake a bswap32 is done on the block header.

I the case of blake256 it's fully redundant and both can be eliminated. In the case of blake512 it results in a simple swapping of 32 bits in each 64 bit word which also results in the nonce shifting.

An "LE" version of the blake transform functions is added to implement this optimization as werll as associated changes to scanhash.

JayDDee commented 1 year ago

The blake family of core hash fucntions can be optimized with linear vectoring (one way). Blake256 & blake2s can use SSE2 while blake512 & blake2b can use SSE2 or AVX2. For practical reasons only blake256 and blake2b have been so optimized at this time. With the exception of midstate prehashing, only possible with small blakes, parallel N-way is usually preferable.

Edit: blake2s is included in v3.21.3

EDIT: No, blakes2s won't be included. Testing has shown a negative impact from prehashing blake2s using serial SIMD over parallel hashing. Other algos have not had this problem. blake2s was also slower with centralized prehash, serial and parallel, so that won't be impemented for blake2s either

JayDDee commented 1 year ago

Another midstate optimization.

Centralize midstate prehash by doing it in stratum thread or when a miner thread returns from getwork and sharing the result with all miner threads. Previously each miner thread would do the prehash for itself.

JayDDee commented 1 year ago

Some old algos have been found not to have proper stats reporting when using an old CPU (#392). Some will be fixed in v3.21.3 but there may be more remaining. They will be fixed as discovered if they can be tested. Testing these algos is difficult, pun intended.

YetAnotherRussian commented 1 year ago

There's a good candidate to add (pufferfish2bmb) https://github.com/De-Crypted/dcrptd-miner/tree/master/Algorithms if the're any plans on adding new algos. I see some new (not really) sha algos in the latest release.

JayDDee commented 1 month ago

The use of Nway notation in hash functions is being changed to Nx64 or Nx32 where appropriate. This notation is already used for interleave functions. This is needed for algos that have implementations using different data size. For example Hamsi can be implemented using Nx32 or Nx64. Cubehash can be implemented as pure parallel using Nx32 or a hybrid serial-parallel using Nx128. Nx64 requires larger vectors, and therefore higher features, than Nx32.

JayDDee / cpuminer-opt

Unfinished work #266