EthereumCommonwealth / Proposals

BSD 4-Clause "Original" or "Old" License
26 stars 7 forks source link

Switch to a ProgPoW Ethash Algorithm #47

Closed dwarner5522 closed 5 years ago

dwarner5522 commented 6 years ago

Proposal

Summary The following is a proposal for an alternate Ethash proof-of-work algorithm - “ProgPoW” - tuned for commodity hardware in order to close the efficiency gap available to specialized ASICs.

Abstract The security of proof-of-work is built on a fair, randomized lottery where miners with similar resources have a similar chance of generating the next block. For Ethash networks – the release of Bitmain’s Antminer E3 Ethash ASIC enable’s certain participants to gain a much greater chance of generating the next block, and undermine the distributed security. ASIC-resistance is a misunderstood problem. FPGAs, GPUs and CPUs can themselves be considered ASICs. Any algorithm that executes on a commodity ASIC can have a specialized ASIC made for it; most existing algorithms provide opportunities that reduce power usage and cost. Thus, the proper question to ask when solving ASIC-resistance is “how much more efficient will a specialized ASIC be, in comparison with commodity hardware?”

A ProgPoW based Ethash algorithm can be tuned for commodity GPUs where there is minimal opportunity for ASIC specialization. This prevents specialized ASICs without resorting to a game of whack-a-mole where the network changes algorithms every few months.

Motivation Several factors have contributed to the need for Callisto to change to a ProgPoW algorithm. Some of these factors include:

• Bitmain’s release of the Antminer E3 Ethash ASIC miner

• The coming Exodus of Ethereum Miners With the recent decision made at the Ethereum Core Devs Meeting Constantinople Session #1 [08/31/18] (http://bit.ly/2O3hhpJ) to reduce their block reward for ethereum from 3 Ether down to 2 Ether in October 2018 and their committed migration to PoS, small mining farms will no longer be profitable mining Ethereum. Therefore, a large percentage of these miners will be looking to other Ethash based networks to mine. This can be a very good thing for Callisto so long as the playing field is equal among miners. For Callisto - A large mining community will lead to greater popularity and also force many top crypto exchanges to trade CLO on their books.

• A similar small cap coin called Bitcoin Interest (BCI) (https://www.bitcoininterest.io/) has already implemented ProgPoW with great success. This coin has some similarities to the proposed Callisto Cold Staking solution.

Source Info and Credits: Official Code - Github page here: https://github.com/ifdefelse/ProgPOW Article written by OhGodAGirl - https://medium.com/@OhGodAGirl/the-problem-with-proof-of-work-da9f0512dad9 Credit for the majority of this write-up goes to: RadixPi & ifdefelse posted on Ethereum Improvement Proposal 1057 - https://eips.ethereum.org/EIPS/eip-1057

Description

ProgPoW utilizes almost all parts of a commodity GPU, excluding: • The graphics pipeline (displays, geometry engines, texturing, etc); • Floating point math.

Making use of either of these would have significant portability issues between commodity hardware vendors, and across programming languages. Since the GPU is almost fully utilized, there’s little opportunity for specialized ASICs to gain efficiency. Removing both the graphics pipeline and floating point math could provide up to 1.2x gains in efficiency, compared to the 2x gains possible in standard Ethash,

Backwards Compatibility This algorithm is not backwards compatible with the existing Ethash, and will require a fork for adoption. Furthermore, the network hashrate will halve as the time spent in the core is now balanced with time spent in memory.

Test Cases This PoW algorithm was tested against six different models from two different manufacturers. Selected models span two different chips and memory types from each manufacturer (Polaris20-GDDR5 and Vega10-HBM2 for AMD; GP104-GDDR5 and GP102-GDDR5X for NVIDIA). The average hashrate results are listed below. Additional tests are ongoing. As the algorithm nearly fully utilizes GPU functions in a natural way, the results reflect relative GPU performance that is similar to other gaming and graphics applications.


| Model | Hashrate (MH/s) | | ——— | ————— | | RX580 | 9.4 | | Vega56 | 16.6 | | Vega64 | 18.7 | | GTX1070Ti | 13.1 | | GTX1080 | 14.9 | | GTX1080Ti | 21.8 | ——————————-

Implementation ProgPoW is based on Ethash and follows the same general structure. The algorithm has five main changes from Ethash, each tuned for commodity GPUs while minimizing the possible advantage of a specialized ASIC.

The name of the algorithm comes from the fact that the inner loop between global memory accesses is a randomly generated program based on the block number. The random program is designed to both run efficiently on commodity GPUs and also cover most of the GPU’s functionality. The random program sequence prevents the creation of a fixed pipeline implementation as seen in a specialized ASIC. The access size has also been tweaked to match contemporary GPUs.

In contrast to Ethash, the changes detailed below make ProgPoW dependent on the core compute capabilities in addition to memory bandwidth and size.

Changes keccak_f1600 (with 64-bit words) to keccak_f800 (with 32-bit words).

On 64-bit architectures f1600 processes twice as many bits as f800 in roughly the same time. As GPUs are natively 32-bit architectures, f1600 takes twice as long as f800. ProgPow doesn’t require all the bits f1600 can consume, thus reducing the size reduces the optimization opportunity for a specialized ASIC.

Increases mix state.

A significant part of a GPU’s area, power, and complexity is the large register file. A large mix state ensures that a specialized ASIC would need to implement similar state storage, limiting any advantage.

Adds a random sequence of math in the main loop.

The random math changes every 50 blocks to amortize compilation overhead. Having a random sequence of math that reads and writes random locations within the state ensures that the ASIC executing the algorithm is fully programmable. There is no possibility to create an ASIC with a fixed pipeline that is much faster or lower power.

Adds reads from a small, low-latency cache that supports random addresses.

Another significant part of a GPU’s area, power, and complexity is the memory hierarchy. Adding cached reads makes use of this hierarchy and ensures that a specialized ASIC also implements a similar hierarchy, preventing power or area savings.

Increases the DRAM read from 128 bytes to 256 bytes.

The DRAM read from the DAG is the same as Ethash’s, but with the size increased to 256 bytes. This better matches the workloads seen on commodity GPUs, preventing a specialized ASIC from being able to gain performance by optimizing the memory controller for abnormally small accesses. The DAG file is generated according to traditional Ethash specifications, with an additional PROGPOW_SIZE_CACHE bytes generated that will be cached in the L1.

ProgPoW can be tuned using the following parameters. The proposed settings have been tuned for a range of existing, commodity GPUs:

• PROGPOW_LANES: The number of parallel lanes that coordinate to calculate a single hash instance; default is 32. • PROGPOW_REGS: The register file usage size; default is 16. • PROGPOW_CACHE_BYTES: The size of the cache; default is 16 x 1024. • PROGPOW_CNT_MEM: The number of frame buffer accesses, defined as the outer loop of the algorithm; default is 64 (same as Ethash). • PROGPOW_CNT_CACHE: The number of cache accesses per loop; default is 8. • PROGPOW_CNT_MATH: The number of math operations per loop; default is 8.

ProgPoW uses FNV1a for merging data. The existing Ethash uses FNV1 for merging, but FNV1a provides better distribution properties.

ProgPow uses KISS99 for random number generation. This is the simplest (fewest instruction) random generator that passes the TestU01 statistical test suite. A more complex random number generator like Mersenne Twister can be efficiently implemented on a specialized ASIC, providing an opportunity for efficiency gains.

uint32_t fnv1a(uint32_t &h, uint32_t d) { return h = (h ^ d) * 0x1000193; }

typedef struct { uint32_t z, w, jsr, jcong; } kiss99_t;

// KISS99 is simple, fast, and passes the TestU01 suite // https://en.wikipedia.org/wiki/KISS_(algorithm) // http://www.cse.yorku.ca/~oz/marsaglia-rng.html uint32_t kiss99(kiss99_t &st) { uint32_t znew = (st.z = 36969 (st.z & 65535) + (st.z >> 16)); uint32_t wnew = (st.w = 18000 (st.w & 65535) + (st.w >> 16)); uint32_t MWC = ((znew << 16) + wnew); uint32_t SHR3 = (st.jsr ^= (st.jsr << 17), st.jsr ^= (st.jsr >> 13), st.jsr ^= (st.jsr << 5)); uint32_t CONG = (st.jcong = 69069 * st.jcong + 1234567); return ((MWC^CONG) + SHR3); }

The LANES*REGS of mix data is initialized from the hash’s seed.

void fill_mix( uint64_t hash_seed, uint32_t lane_id, uint32_t mix[PROGPOW_REGS] ) { // Use FNV to expand the per-warp seed to per-lane // Use KISS to expand the per-lane seed to fill mix uint32_t fnv_hash = 0x811c9dc5; kiss99_t st; st.z = fnv1a(fnv_hash, seed); st.w = fnv1a(fnv_hash, seed >> 32); st.jsr = fnv1a(fnv_hash, lane_id); st.jcong = fnv1a(fnv_hash, lane_id); for (int i = 0; i < PROGPOW_REGS; i++) mix[i] = kiss99(st); }

The main search algorithm uses the Keccak sponge function (a width of 800 bits, with a bitrate of 448, and a capacity of 352) to generate a seed, expands the seed, does a sequence of loads and random math on the mix data, and then compresses the result into a final Keccak permutation (with the same parameters as the first) for target comparison.

bool progpow_search( const uint64_t prog_seed, const uint64_t nonce, const hash32_t header, const uint64_t target, const uint64_t g_dag, // gigabyte DAG located in framebuffer const uint64_t c_dag // kilobyte DAG located in l1 cache ) { uint32_t mix[PROGPOW_LANES][PROGPOW_REGS]; uint32_t result[4]; for (int i = 0; i < 4; i++) result[i] = 0;

// keccak(header..nonce)
uint64_t seed = keccak_f800(header, nonce, result);

// initialize mix for all lanes
for (int l = 0; l < PROGPOW_LANES; l++)
    fill_mix(seed, l, mix);

// execute the randomly generated inner loop
for (int i = 0; i < PROGPOW_CNT_MEM; i++)
    progPowLoop(prog_seed, i, mix, g_dag, c_dag);

// Reduce mix data to a single per-lane result
uint32_t lane_hash[PROGPOW_LANES];
for (int l = 0; l < PROGPOW_LANES; l++)
{
    lane_hash[l] = 0x811c9dc5
    for (int i = 0; i < PROGPOW_REGS; i++)
        fnv1a(lane_hash[l], mix[l][i]);
}
// Reduce all lanes to a single 128-bit result
for (int i = 0; i < 4; i++)
    result[i] = 0x811c9dc5;
for (int l = 0; l < PROGPOW_LANES; l++)
    fnv1a(result[l%4], lane_hash[l])

// keccak(header .. keccak(header..nonce) .. result);
return (keccak_f800(header, seed, result) <= target);

}

The inner loop uses FNV and KISS99 to generate a random sequence from the prog_seed. This random sequence determines which mix state is accessed and what random math is performed. Since the prog_seed changes relatively infrequently it is expected that progPowLoop will be compiled while mining instead of interpreted on the fly.

kiss99_t progPowInit(uint64_t prog_seed, int mix_seq[PROGPOW_REGS]) { kiss99_t prog_rnd; uint32_t fnv_hash = 0x811c9dc5; prog_rnd.z = fnv1a(fnv_hash, prog_seed); prog_rnd.w = fnv1a(fnv_hash, prog_seed >> 32); prog_rnd.jsr = fnv1a(fnv_hash, prog_seed); prog_rnd.jcong = fnv1a(fnv_hash, prog_seed >> 32); // Create a random sequence of mix destinations for merge() // guaranteeing every location is touched once // Uses Fisher–Yates shuffle for (int i = 0; i < PROGPOW_REGS; i++) mix_seq[i] = i; for (int i = PROGPOW_REGS - 1; i > 0; i--) { int j = kiss99(prog_rnd) % (i + 1); swap(mix_seq[i], mix_seq[j]); } return prog_rnd; }

The math operations that merge values into the mix data are ones chosen to maintain entropy. // Merge new data from b into the value in a // Assuming A has high entropy only do ops that retain entropy // even if B is low entropy // (IE don't do A&B) void merge(uint32_t &a, uint32_t b, uint32_t r) { switch (r % 4) { case 0: a = (a 33) + b; break; case 1: a = (a ^ b) 33; break; case 2: a = ROTL32(a, ((r >> 16) % 32)) ^ b; break; case 3: a = ROTR32(a, ((r >> 16) % 32)) ^ b; break; } }

The math operations chosen for the random math are ones that are easy to implement in CUDA and OpenCL, the two main programming languages for commodity GPUs. // Random math between two input values uint32_t math(uint32_t a, uint32_t b, uint32_t r) { switch (r % 11) { case 0: return a + b; case 1: return a * b; case 2: return mul_hi(a, b); case 3: return min(a, b); case 4: return ROTL32(a, b); case 5: return ROTR32(a, b); case 6: return a & b; case 7: return a | b; case 8: return a ^ b; case 9: return clz(a) + clz(b); case 10: return popcount(a) + popcount(b); } }

The main loop: // Helper to get the next value in the per-program random sequence

define rnd() (kiss99(prog_rnd))

// Helper to pick a random mix location

define mix_src() (rnd() % PROGPOW_REGS)

// Helper to access the sequence of mix destinations

define mix_dst() (mix_seq[(mix_seq_cnt++)%PROGPOW_REGS])

void progPowLoop( const uint64_t prog_seed, const uint32_t loop, uint32_t mix[PROGPOW_LANES][PROGPOW_REGS], const uint64_t g_dag, const uint32_t c_dag) { // All lanes share a base address for the global load // Global offset uses mix[0] to guarantee it depends on the load result uint32_t offset_g = mix[loop%PROGPOW_LANES][0] % DAG_SIZE; // Lanes can execute in parallel and will be convergent for (int l = 0; l < PROGPOW_LANES; l++) { // global load to sequential locations uint64_t data64 = g_dag[offset_g + l];

    // initialize the seed and mix destination sequence
    int mix_seq[PROGPOW_REGS];
    int mix_seq_cnt = 0;
    kiss99_t prog_rnd = progPowInit(prog_seed, mix_seq);

    uint32_t offset, data32;
    int max_i = max(PROGPOW_CNT_CACHE, PROGPOW_CNT_MATH);
    for (int i = 0; i < max_i; i++)
    {
        if (i < PROGPOW_CNT_CACHE)
        {
            // Cached memory access
            // lanes access random location
            offset = mix[l][mix_src()] % PROGPOW_CACHE_WORDS;
            data32 = c_dag[offset];
            merge(mix[l][mix_dst()], data32, rnd());
        }
        if (i < PROGPOW_CNT_MATH)
        {
            // Random Math
            data32 = math(mix[l][mix_src()], mix[l][mix_src()], rnd());
            merge(mix[l][mix_dst()], data32, rnd());
        }
    }
    // Consume the global load data at the very end of the loop
    // Allows full latency hiding
    merge(mix[l][0], data64, rnd());
    merge(mix[l][mix_dst()], data64>>32, rnd());
}

}

Please refer to the official code located at ProgPOW for the full code, implemented in the standard ethminer.

Source Info and Credits:

Official Code - Github page here: https://github.com/ifdefelse/ProgPOW Article written by OhGodAGirl - https://medium.com/@OhGodAGirl/the-problem-with-proof-of-work-da9f0512dad9 Credit for the majority of this write-up goes to: RadixPi & ifdefelse posted on Ethereum Improvement Proposal 1057 - https://eips.ethereum.org/EIPS/eip-1057

Funding goal

None

ghost commented 6 years ago

There is no working implementation for ProgPOW yet, maybe we can discuss once there is a working coins with ProgPOW

Neofito89 commented 5 years ago

There is no working implementation for ProgPOW yet, maybe we can discuss once there is a working coins with ProgPOW

BCI now is working with ProgPOW. I think is an interesting topic. Next fork in eth has been moved, but still is gonna happen soon or later.

cypheron commented 5 years ago

Implementations are readily available: https://github.com/ifdefelse/go-ethereum https://github.com/ethereum/go-ethereum/pull/17731

Callisto sets one of his main goals research and development. Having a reference implementation of ProgPOW ready on a ETH/ETC-like chain would be a great achievement. This would set a positive signal to other teams.

However, I expect that ETH (and likely ETC) team is going to commit to ProgPOW. Algo changes can be merged over to Callisto subsequently.

ayman-alawneh commented 5 years ago

Any coin looking to increase its value should fork out from an asic friendly algorithm simply because asics are ran by big wearhouses that flood the market with coins to cover their expenses and profit as fast as they can Miners tend to hold coins for long term profits