ifdefelse commented 6 years ago

Abstract

We propose an alternate proof-of-work algorithm - “ProgPoW” - tuned for commodity hardware in order to close the efficiency gap available to specialized ASICs.

The security of proof-of-work is built on a fair, randomized lottery where miners with similar resources have a similar chance of generating the next block.

For Zcash - a community based on widely distributed commodity hardware - specialized ASICs enable certain participants to gain a much greater chance of generating the next block and undermine the distributed security.

ASIC-resistance is a misunderstood problem. FPGAs, GPUs and CPUs can themselves be considered ASICs. Any algorithm that executes on a commodity ASIC can have a specialized ASIC made for it; most existing algorithms provide opportunities that reduce power usage and cost. Thus, the proper question to ask when solving ASIC-resistance is “how much more efficient will a specialized ASIC be, in comparison with commodity hardware?”

This proposal presents an algorithm that is tuned for commodity GPUs where there is minimal opportunity for ASIC specialization. This prevents specialized ASICs without resorting to a game of whack-a-mole where the network changes algorithms every few months.

Motivation

Since proof-of-work will continue to underpin the security of the Zcash network, it is important that this consensus isn’t dominated by a single party in control of a large portion of the compute power. The existence of the Z9 ASIC miner proves that the current Equihash algorithm allows significant efficiency gains from specialized hardware. The ~150mb of state in Equihash is large but possible on an ASIC. Furthermore, the binning, sorting, and comparing of bit strings could be implemented on an ASIC at extremely high speed.

Thus, a new hash algorithm is needed to replace the current Equihash. This algorithm must be permanently resistant to domination by specialized hardware and avoid further PoW code forks.

Design Specification

ProgPoW preserves the 256-bit nonce size for Zcash and thus does not require any changes to light clients. However, it entirely replaces the rest of the Equihash PoW algorithm.

ProgPoW is based on Dagger Hashimoto and follows the same general structure. Dagger Hashimoto was selected as a base due to its simple algorithm and memory-hard nature. Dagger Hashimoto has also withstood a significant amount of real-world “battle testing.” Finally, starting from a simple algorithm makes the changes easy to understand and thus easy assess for strengths and weakness. Complexity tends to imply a larger attack surface.

The name of the algorithm comes from the fact that the inner loop between global memory accesses is a randomly generated program based on the block number. The random program is designed to both run efficiently on commodity GPUs and also cover most of the GPU's functionality. The random program sequence prevents the creation of a fixed pipeline implementation as seen in a specialized ASIC. The access size has also been tweaked to match contemporary GPUs.

ProgPoW utilizes almost all parts of a commodity GPU, excluding:

The graphics pipeline (displays, geometry engines, texturing, etc);
Floating point math. Making use of either of these would have significant portability issues between commodity hardware vendors, and across programming languages.

Since the GPU is almost fully utilized, there’s little opportunity for specialized ASICs to gain efficiency. Removing both the graphics pipeline and floating point math could provide up to 1.2x gains in efficiency, compared to the 2x gains possible in Dagger Hashimoto, 50x gains possible for CryptoNight, and ~100x gains possible in Equihash.

The algorithm has five main changes from Dagger Hashimoto, each tuned for commodity GPUs while minimizing the possible advantage of a specialized ASIC. In contrast to Dagger Hashimoto, the changes detailed below make Prog-PoW dependent on the core compute capabilities in addition to memory bandwidth and size.

Changes keccak to blake2s. Zcash already uses BLAKE2 in various locations, so it makes sense to continue using BLAKE2. GPUs are natively 32-bit architectures so blake2s is used. Both blake2b and blake2s provide the same security but are tuned for 64-bit and 32-bit architectures respectively. Blake2s runs roughly twice as fast on a 32-bit architecture as blake2b.
Increases mix state. A significant part of a GPU’s area, power, and complexity is the large register file. A large mix state ensures that a specialized ASIC would need to implement similar state storage, limiting any advantage.
Adds a random sequence of math in the main loop. The random math changes every 50 blocks to amortize compilation overhead. Having a random sequence of math that reads and writes random locations within the state ensures that the ASIC executing the algorithm is fully programmable. There is no possibility to create an ASIC with a fixed pipeline that is much faster or lower power.
Adds reads from a small, low-latency cache that supports random addresses. Another significant part of a GPU’s area, power, and complexity is the memory hierarchy. Adding cached reads makes use of this hierarchy and ensures that a specialized ASIC also implements a similar hierarchy, preventing power or area savings.
Increases the DRAM read from 128 bytes to 256 bytes. The DRAM read from the DAG is the same as Dagger Hashimoto’s, but with the size increased to 256 bytes. This better matches the workloads seen on commodity GPUs, preventing a specialized ASIC from being able to gain performance by optimizing the memory controller for abnormally small accesses. The DAG file is generated according to traditional Dagger Hashimoto specifications, with an additional ProgPoW_SIZE_CACHE bytes generated that will be cached in the L1.

ProgPoW can be tuned using the following parameters. The proposed settings have been tuned for a range of existing, commodity GPUs:

ProgPoW_LANES: The number of parallel lanes that coordinate to calculate a single hash instance; default is 32.
ProgPoW_REGS: The register file usage size; default is 16.
ProgPoW_CACHE_BYTES: The size of the cache; default is 16 x 1024.
ProgPoW_CNT_MEM: The number of frame buffer accesses, defined as the outer loop of the algorithm; default is 64 (same as Dagger Hashimoto).
ProgPoW_CNT_CACHE: The number of cache accesses per loop; default is 8.
ProgPoW_CNT_MATH: The number of math operations per loop; default is 8.

ProgPoW uses FNV1a for merging data. The existing Dagger Hashimoto uses FNV1 for merging, but FNV1a provides better distribution properties.

ProgPoW uses KISS99 for random number generation. This is the simplest (fewest instructions) random generator that passes the TestU01 statistical test suite. A more complex random number generator like Mersenne Twister can be efficiently implemented on a specialized ASIC, providing an opportunity for efficiency gains.

Backwards Compatibility

This algorithm is not backward compatible with the existing Equihash implementation and will require a fork for adoption. Furthermore, the network hash rate will change dramatically as the units of compute are re-normalized to this new algorithm.

Testing

This PoW algorithm was implemented and tested against six different models from two different manufacturers. Selected models span two different chips and memory types from each manufacturer (Polaris20-GDDR5 and Vega10-HBM2 for AMD; GP104-GDDR5 and GP102-GDDR5X for NVIDIA). The average hashrate results are listed below. Additional tests are ongoing.

As the algorithm nearly fully utilizes GPU functions in a natural way, the results reflect relative GPU performance that is similar to other gaming and graphics applications.

Model	Hashrate (MH/s)
RX580	9.4
Vega56	16.6
Vega64	18.7
GTX1070Ti	13.1
GTX1080	14.9
GTX1080Ti	21.8

Sample Code

Sample code is available here.

Schedule and Funding Request

This proposal is submitted to the foundation to review and adopt as a priority. We ask that any funding goes toward any outside development, review, testing, and implementation efforts as the foundation deems appropriate. We humbly suggest that this proposal be integrated to the soonest planned fork (as feasible) to minimize disruption to the network.

T3N2aWsK commented 6 years ago

Here are my observations after going through the code provided in the proposal:

There are no definitions of DAG_SIZE or ProgPoW_CACHE_WORDS.
The code in the proposal could use some tuning of the order in which functions are presented so that it's easier to understand its structure for a first-time reader.
The algorithm isn't actually well defined, as the evaluation order of function parameters is unspecified, and there are e.g. 3 calls to kiss99() in the parameter list for the call to math().
The inner loop uses the functions merge() and kiss99() a lot, and the most complex parts of those are multiplications with constants. The Hamming weights of those constants are 2, 5, 6 and 9, allowing an ASIC to save energy by computing them as 2-, 5-, 6- and 9-way additions of less than 32 bits.
The proof-of-capacity part (ref. the data64 variable) can easily be supported with a relatively low-cost RAM. This is read-only (as in very infrequently updated) and can be shared between many staggered instances of ProgPoWLoop loading 256 consecutive bytes at a time.

ifdefelse commented 6 years ago

@T3N2aWsK , @PrometheusXDE, @solardiz, @donamelo We thank you for your feedback. We encourage folks to continue to review this proposal from a technical perspective. We hope reviewers can start with a deeper technical understanding of what is being proposed, especially from the hardware side of things.

re: DAG_SIZE. We should have been clearer with this. However, we assumed that Dagger Hashimoto was sufficiently documented so that it didn't need to be repeated here. This is the size of the DAG and the number of 256 byte words in it. The size is based on Dagger Hashimoto's formula for determining DAG size from block number.
re: code ordering. We always welcome feedback on improving clarity. How would you like to see the code presented?
re: function parameters. We believe the ordering is well defined in C. Parameters are evaluated from left to right. Would specifying that this is C code help clear up the confusion?
re: ASIC on inner loop. We can see how you might be confused by the concept of programmatic workload generation. The inner loop you are referencing is part of the programmatic generation of the workload algorithm and not actually part of the PoW workload itself. The calls to kiss99 only happen during generation to determine which ops and registers to execute. For the code executed during mining, that part is compiled away.
re: proof-of-capacity. The size of the DAG is currently 2-3 GB and can be easily increased. However, we didn't want to disadvantage miners with smaller memory sizes. We are willing to consider increasing the memory requirement. Note that the bandwidth (throughput) is just as important. Dagger Hashimoto requires both significant memory bandwidth and capacity to support for mining. Again Dagger Hashimoto's memory hardness is well battle-tested: it requires not just capacity, but also high throughput and predictable latency-INsensitivity. In fact, we've made it even more latency-insensitive by increasing the read size. We have done an extensive cost-efficiently analysis based on RAM pricing from the trough 1+ year ago and the recent RAM price spike for various types of RAM and the corresponding silicon area the interface would demand. Do this analysis for any silicon and you'll see it's very hard to beat GPUs for price per throughput-capacity.

MoneroCrusher commented 6 years ago

@PrometheusXDE was just writing this response and wanted to say wait a couple days and you'll see @ifdefelse refute every single point proficiently (just like they did in their reddit thread https://www.reddit.com/r/EtherMining/comments/8k4zc7/comment/dz5v2r3?st=JIO8LL6E&sh=54bb7b39) but they beat me to it!

@T3N2aWsK 's account is just 2 days old (edit: not 5) and clearly he/she misunderstood the algo (or maybe even wanted to spread FUD, as there are big economical consequences for some people (ppl that bought ASICs + manufacturers)).

About the author (from my perspective):

best open source cryptonight miner (xmr-stak) is based on their work.
ETHLargementPill, which increased Nvidia GPU Ethash hashrate by over 50%. They gave it out for free.
various tools by OhGodACompany.
helping out folks in various forums, chats etc

So in my eyes there's no reason to distrust them. But ProgPOW is open source anyway so.. I don't care either if they do that in their self-interest (ie running mining GPUs).

What I (and presumably every cryptocurrency dev) care about is decentralisation, and blockchain promises it. It is up to the software devs to keep that promise and ProgPOW is the best way to achieve that for now.

It makes GPUs king in town and luckily AMD and NVIDIA have a public face to lose in case of shady business while Bitmain et al don't (or they already did? Re: Antbleed).

There's never perfect decentralisation but for now it's better to go the way of most possible alignment with decentralisation (AMD/Nvidia/Intel/Samsung) than to go full retard and let ASIC manufacturers control 80% of every single algo. I promise you: they do pre-mine. They won't sell 50 day break-even miners to consumers, their investors wouldnt let them. The consumer will always be the ones chasing the carrot.

So, I'm all for ProgPOW! Implement it in a testnet and see how it goes. Then mainnet.

One critique: DAG size shouldn't be 2-3GB (or even higher). It should be way less, so everyone can participate. Even integrated graphics. As I remember DAG-size increase was implemented to defeat ASICs, right? So this will be moot once ProgPOW is implemented, as ASICs would always be as fast as GPUs. Correct me if I'm wrong :)

PrometheusXDE commented 6 years ago

I dont agree, because the hacker who hackd my account is an dickhead

solardiz commented 6 years ago

@MoneroCrusher @PrometheusXDE Folks, please don't treat reviews pointing out (potential) issues as attacks on the algorithm or on its authors, and please don't treat the existence of issues (even if confirmed) as indicative of incompetence or whatever. It is perfectly normal for a reviewer (especially a volunteer) to be confused at first and find non-issues. If e.g. 1 in 10 issues is for real, that's worth it. And some of the rest might point to a need for better documentation. It is also perfectly normal for there to be issues in an algorithm as initially proposed. That's why independent review is needed. BTW, thank you for linking to that Reddit thread - I wasn't aware of it - it's very good news that someone intends to try implementing this on FPGA, and again it's no problem they misunderstood things at first.

@ifdefelse No, the evaluation order of function arguments in C is in fact unspecified, and may/will vary between compilers, etc. I haven't checked whether/where in the code you have that issue, though - I'm just commenting in general.

T3N2aWsK commented 6 years ago

@ifdefelse Thanks for your feedback.

The intention of my comments is neither to shoot down nor endorse this project, but to help with the analysis of its merits and making it better. The developers are obviously free to use my comments to improve their code and algorithm.

Thanks for the clarification.
As a general rule I would suggest presenting it either top-down or bottom-up. Bottom-up would be the natural choice in C, defining each function before its first use.
I haven't verified by checking the latest standard documents, but I have never seen the parameter evaluation order being specified for C or C++. For a discussion on this topic see e.g. https://stackoverflow.com/questions/9566187/function-parameter-evaluation-order
Ok, I missed the relative const-ness of prog_seed when focusing on ProgPoWLoop(). mix_dst(), mix_src() and rnd() all use kiss99() in a different way. mix_dst() has its values generated and stored in a lookup table, and then rnd() and mix_src() share state while generating the pseudorandom load and operation sequence. This is not a very clear and readable way of defining the algorithm. With the calls to kiss99() being part of precomputation and compilation, we are left with merge() which uses multiplication by 33. This will of course be implemented with a 27-bit adder in hardware, but any good coder or compiler will also replace it with a shift and an add in software. Hence this won't be a big win for ASIC.
Ack. With the pseudorandom sequence precomputed, there is not so much left in the inner loop, making this consume significant bandwidth.

ifdefelse commented 6 years ago

@T3N2aWsK on point 3, we stand corrected. Thanks for pointing this out. We will update the code. We will also update the documentation based on your feedback on point 2 when we get a chance.

alexcryptan commented 6 years ago

This was a useful read, as well as the comments: https://medium.com/@OhGodAGirl/the-problem-with-proof-of-work-da9f0512dad9

I am wondering if it is possible to do ProgPow style algo tailored to CPU. Seems a better thing than tuning for GPU.

MoneroCrusher commented 6 years ago

@alexcryptan Youd probably have problems with botnets then. But I also think a mix between GPU and CPU would be best, slightly favouring GPU.

alexcryptan commented 6 years ago

@MoneroCrusher with 2-3GB or required RAM it would be "botnet resistant" - very noticeable to the end user.

bitcartel commented 6 years ago

@ifdefelse ProgPoW was discussed and mentioned a few times at the Zcon0 mining workshop: https://forum.z.cash/t/what-happened-at-the-zcon0-mining-workshop-and-more/30062/

tromer commented 6 years ago

@ifdefelse, we are finalizing decisions on the grants. Are there updates about the ProgPoW design, progress, plans or funding requirements that we should be aware of?

greerso commented 6 years ago

Ethereum Core developers get answers to questions on ProgPow from IfDef~~Else~~. https://youtu.be/z2mefVnZHpw?t=48m45s

OhGodAGirl commented 6 years ago

@ifdefelse, we are finalizing decisions on the grants. Are there updates about the ProgPoW design, progress, plans or funding requirements that we should be aware of?

Hey Eran!

There are no main changes to the ProgPoW design, and progress is going well - we're working closely with the ETH team for integration. Nothing has changed on the funding, but I will be updating the EIP to simply add more information/transparency and more clarity to a lot of areas, and that will translate to the ZIP, naturally.

So there's more on the education front coming.

ifdefelse commented 6 years ago

Hello team,

We are unable to update our official ZGP proposal at this time, but I would like to make some brief statements in order to clarify some concerns of the public.

We are aware that significant work needs to be done on the education level to have ProgPoW make sense to the general public. When we originally wrote this EIP and spec, we positioned it as a piece just for hardware designers and engineers, or people with experience like ourselves - not the general public. We're now learning the chaos that causes, and we're going to adjust accordingly.
We seek to release open source prototype implementations of an Equihash variant of ProgPoW, as well as the Ethash variant of ProgPoW (currently being worked on for enterprise level use by various members of the Ethereum community), in both the mining and node space. We'd hope that the engineers at the Zcash Foundation could assist us on this part, because we are not node developers ourselves, and as the entire premise of our Medium article is simply 'do what you do best, and leave the others to the rest', we'd hope that the node development could be left to experienced node developers, the miner development left to experienced miner developers, and the hardware design left to us.
There is a public implementation of ProgPoW called Bitcoin Interest that gives real world testing data for the power consumption and hashrate of ProgPoW. Note that it is unique code, and doesn't necessarily follow the spec, but the premise is the same (utilizing all parts of the GPU- the difference is in the node and verification times).
AMD and NVIDIA have independently reviewed ProgPoW, and so have multiple engineers.
We are seeking to make a modified request of 10k, maximum, for the research, development and design of ProgPoW for Equihash (the open source implementation). This will simply cover the burn rate of one developer, and unused funds would be returned back to the ZCash Foundation. We are not able to do the development, due to the nature of where we are now in October- back in May, things were a lot different.
The justification for Dagger Hashimoto was simply due to the natural memory hardness of the Ethash algorithm. ProgPoW acts as an extension of this, but there is no reason that the variants of Equihash can't be adapted to be naturally tuned to all existing parts of a GPU card, thus creating a ProgPoW Equihash variant.

The official ZGP proposal will be updated to match these posts accordingly.

Oh, and on our identities, and our backgrounds, we'll be tying our skillset more into the medium post that is an extension of what was already posted. But please keep in mind that an algorithm should be reviewed based on it's own merits- not on the identities of the people who created it. That is why Satoshi Nakamoto chose to remain anonymous. The technology needs to speak for itself.

solardiz commented 6 years ago

Thanks for the update @ifdefelse! It would also be helpful to have some pointers on a few things you referenced. I searched for Bitcoin Interest and found that it's a cryptocurrency that recently hard-forked to use ProgPoW (or something similar, as you say). But I can't similarly search for AMD's and NVIDIA's reviews of ProgPoW, for lack of sufficiently focused keywords to search for. Are their reviews public?

"Equihash variant of ProgPoW" aka "ProgPoW for Equihash" sounds exciting at first, but starts to fall apart the more I think of it. I assume ProgPoW would be used for the memory filling phase (in place of the sequential writes with BLAKE2b, which Zcash's Equihash currently uses) and would be followed by the usual Equihash processing? If so, the Equihash phase would be as (moderately) ASIC-friendly as Zcash's Equihash currently is, and we'd need to ensure that either the memory filling phase contributes to a larger portion of the total processing cost or the frequent and large data transfer from a GPU-like to a more specialized Equihash ASIC would be prohibitively expensive. With this added requirement, having Equihash at the end of the processing feels superfluous - we're simply designing a new sequential memory-hard scheme (with the usual memory usage vs. verification time trade-off), and could use its computed hash value directly. Having Equihash's collision search step at the end makes sense if the collision-producing hashes can be recomputed quickly, thereby providing fast verification, but this weakens the memory filling (is inconsistent with it being sequential memory-hard). Is this correct, or do you have something different in mind?

A potentially better "Equihash variant of ProgPoW" aka "ProgPoW for Equihash" would be to enhance Equihash's collision search phase, applying it to a ProgPoW'ish function of the memory contents rather than to the actual memory contents directly. This would need to be done in a way requiring that the computation be done dynamically on every ex-memory lookup rather than in advance (which would essentially replace the memory contents, to be followed by classic Equihash, thereby defeating our attempt at requiring GPU-like hardware), and that's tricky (perhaps achievable through inclusion of frequent memory writes, so that the collision search would be over changing data, or/and through doing the collision search over a much larger amount of expanded data, where each lookup of the virtual, say, 1 TB of data would be based on a function of multiple lookups from the real, say, 3 GB of data). Do you possibly have specific ideas on it? Overall, I think this is a research project requiring multiple iterations of (re-)design and attacks and tweaks. It's not just an engineering effort, and thus I see no way how it'd reasonably fit in the moderate budget you suggest, unless the research phase has somehow already been done?

sonyamann commented 6 years ago

I'm thrilled to inform you that the Grant Review Committee and the Zcash Foundation Board of Directors have approved your proposal, pending a final compliance review. Congratulations, and thank you for the excellent submission!

Next steps: Please email josh@z.cash.foundation from an email address that will be a suitable point of contact going forward. We plan to proceed with disbursements following a final confirmation that your grant is within the strictures of our 501(c)(3) status, and that our payment to you will comply with the relevant United States regulations.

We also wish to remind you of the requirement for monthly progress updates to the Foundation’s general mailing list, as noted in the call for proposals.

Before the end of this week, the Zcash Foundation plans to publish a blog post announcing grant winners to the public at large, including a lightly edited version of the Grant Review Committee’s comments on your project. The verbatim original text of the comments can be found below.

Congratulations again!

Grant Review Committee comments:

Proposed by the pseudonymous team "ifdefelse". This team developed ProgPoW, an alternative proof of work algorithm aiming to be ASIC-resistant but GPU-friendly. Under this proposal (in its revised form), the team will integrate ProgPoW with Zcash, in the sense of releasing a proof-of-concept fork of the Zcash node software that uses ProgPoW instead of Equihash, as well as compatible open-source mining software and operating testnet. This will facilitate empirically evaluation of the difficulty and network behavior of such integration.

While not all committee members believe the notion of ASIC-resistance is an important one to cryptocurrencies, the committee gave submissions related to proof-of-work and ASIC-resistance the benefit of the doubt, judging these submissions under the assumption that the Zcash community is interested in ASIC-resistant developments.

Within this space, the committee sees the ProgPoW as having a high potential merit. We observe that ProgPoW is considered by many in the community as a promising alternative to classic proof of work, and an interesting variation of GPU-friendly ASIC resistance. We recognize the large body of work contributed by ifdefelse in developing an implementation of ProgPoW. We also observe that ProgPoW is being considered for use in Ethereum, with experimental integration underway.

We believe that the suggested prototype integration of ProgPoW with the Zcash code base will significantly further the ongoing discussion of potential PoW changes in ZCash, by demonstrating the technical feasibility, identifying difficulties, enabling experimentation with dynamics of hardness adjustment under various loads, and also -- should ProgPoW be chosen for future Zcash network upgrades -- reducing the time for developing that upgrade.

Thus, funding of this proposal is recommended. Moreover, we encourage the Zcash Foundation to support complementary efforts to independently evaluate the security properties of ProgPoW, within or outside the Grants program.

mms710 commented 5 years ago

@ifdefelse Hi! I'd love to help provide any engineering support you might need from the engineers at Zcash Company. Would you be interested in getting a more regular line of communication set up between you and Zcash Company engineers so you can ask questions or get help? If so, would you be okay using a Rocketchat channel or do you have another preferred method of communication?

acityinohio commented 5 years ago

@ifdefelse please check your email, we are awaiting a response from you

acityinohio commented 5 years ago

@ifdefelse Due to several months of radio silence from the grant winners, we are closing this issue and not funding the grant. We have updated the 2018-Q2 blog entry to reflect this decision. We sincerely hope the ProgPoW team continues its research.

ZcashFoundation / GrantProposals-2018Q2

ProgPoW: A programmatic (dynamic) Proof-of-Work algorithm tuned for commodity hardware #15

Motivation

Design Specification

Backwards Compatibility

Testing

Sample Code

Schedule and Funding Request