Closed ifdefelse closed 5 years ago
This looks like a solid pre-proposal to me. I'm not familiar with the original Dagger Hashimoto and thus can't fully evaluate it yet; nevertheless, the proposed changes all make sense to me from the perspective of using GPUs fully and optimally.
My (limited) understanding is that Dagger Hashimoto is an asymmetric PoW just like Equihash is - that is, validating a proof is lightweight and nearly instant, whereas computing a proof is heavy and slow (but not too slow as to make the PoW non-progress-free - the hashrates on the order of 10 to 20 per second given in the proposal are OK'ish in that respect, meaning there are lots of PoW computations per block interval).
However, can't we achieve a similar effect with a simpler PoW? I think an instantiation of yescrypt/pwxform with low-level parameters to fit and maximally use a GPU can have similar properties, except only for not generating "a random sequence of math" and technically being symmetric. The latter can be offset by setting yescrypt's tunable costs to be low and using a large ROM. A major drawback is that it means only moderately fast validation (not nearly instant) - like taking a millisecond per CPU core - and that the machine doing the validation needs to have enough RAM to accommodate yescrypt's ROM.
As to "a random sequence of math", how would it be generated so that it utilizes most of a GPU's computing resources yet lacks optimization potential (or has obvious optimization potential and no more, so that the GPU miners can utilize it easily)? I think it'd be tough to make use of more than a handful of different instructions in there while also meeting our other desirable properties. So target the different execution units (perhaps integer MADs and local/shared memory lookups/writes, like yescrypt's pwxform does? anything else we can easily use?) while keeping the instruction set use rather limited? Would this be very different from a non-random sequence then, in terms of relative resource utilization on GPUs vs. more specialized ASICs? That's unclear to me.
A "random number generator" may be unnecessary as a separate component given that per this proposal we'd effectively add something similar to yescrypt pwxform's S-boxes. A very simple transformation like a pwxform round and subsequent S-box writes results in changing S-box contents that also pass randomness tests.
The stated expected limitations of specialized ASICs' "up to 1.2x gains in efficiency, compared to the 2x gains possible in Dagger Hashimoto" look overly optimistic to me, although not necessarily impossible. When we're talking very small gains like 1.2x, it becomes especially important to consider what's included in GPU prices available to miners - and it's by far not just the production costs. The inefficiencies in gaming GPUs' (components) markets when those GPUs are purchased by/for miners are probably in excess of 20% compared to a market for a company's specialized mining ASICs (let alone that company's own use of the ASICs).
It's also important to consider whether our PoW algorithm would be somewhat tolerant to hardware errors (which is another way to improve usable yield and lower production costs) or not at all. Equihash is somewhat tolerant. It sounds to me like this new PoW won't be as tolerant, which is a good thing.
Please read the above as a friendly review and challenging of the proposal to help make it better.
I am not familiar with FNV1a, but FWIW per the table at https://github.com/leo-yuriev/t1ha variations of t1ha (including those optimized for 32-bit) are the fastest non-cryptographic hashes that pass SMHasher tests, whereas FNV1a_YoshimitsuTRIAD is slightly faster yet but fails tests.
I just found that the code and more documentation are already available at https://github.com/ifdefelse/ProgPOW
This answers my question on what kind of random math is being used, but doesn't address the concerns associated with it: optimization potential that miners will compete in exploring, only few instructions being used, and this not necessarily being better than optimally-pre-weighted use of MADs - in fact, introducing relatively lightweight ops to the mix might leave a GPU's MAD units under-utilized and thus let an ASIC get away with implementing fewer MAD units.
Quoting from the author's documentation linked above:
void merge(uint32_t &a, uint32_t b, uint32_t r)
{
switch (r % 4)
{
case 0: a = (a * 33) + b; break;
case 1: a = (a ^ b) * 33; break;
case 2: a = ROTL32(a, ((r >> 16) % 32)) ^ b; break;
case 3: a = ROTR32(a, ((r >> 16) % 32)) ^ b; break;
}
}
uint32_t math(uint32_t a, uint32_t b, uint32_t r)
{
switch (r % 11)
{
case 0: return a + b;
case 1: return a * b;
case 2: return mul_hi(a, b);
case 3: return min(a, b);
case 4: return ROTL32(a, b);
case 5: return ROTR32(a, b);
case 6: return a & b;
case 7: return a | b;
case 8: return a ^ b;
case 9: return clz(a) + clz(b);
case 10: return popcount(a) + popcount(b);
}
}
(In the implementation, this choice is compile-time (kernel source code is produced to be compiled for the target GPU), so the maybe-inefficiency of modulo division is not an issue.)
Out of these many operations, only the two MULs are heavy (and they miss on the opportunity to be MADs, to more fully utilize that hardware).
BTW, the focus on staying 32-bit for GPU friendliness may be excessive, especially if we also have a goal to maintain some register pressure so that the large register files are used nearly-fully. Modern GPUs do have some "64-bit" instructions operating on 32-bit elements in pairs of registers for each operand, and in a sense it's wasteful not to use those. For example, at least at NVIDIA PTX asm level (and probably also at ISA level and in hardware, but I didn't confirm) there's 64-bit addition (that is, with carry between the two 32-bit halves) as part of the high 32-bit MAD instruction. From a Japanese yescrypt-0.5'ish coin miner:
__device__ __forceinline__ uint2 mad64(const uint32_t a, const uint32_t b, uint2 c)
{
#if 0
return vectorize((uint64_t)a * (uint64_t)b) + c;
#else
asm("{\n\t"
"mad.lo.cc.u32 %0, %2, %3, %0; \n\t"
"madc.hi.u32 %1, %2, %3, %1; \n\t"
"}\n\t" : "+r"(c.x), "+r"(c.y) : "r"(a), "r"(b)
);
return c;
#endif
}
It feels wasteful not to use powerful instructions like this, which probably utilize the MAD units' circuitry to the max achievable with integer math.
Do we really achieve anything good by sometimes no-op'ing those heavy MADs in favor of something trivial like a bitwise-OR, which might also be redundant? I doubt it.
Oh, also the funnel shifts probably consume non-trivial amount of die area. But then it's better to use them as such, not reduce them to the simpler bit rotates. So perhaps just use a fixed mix of 64-bit MADs, funnel shifts (of 64-bit values by variable counts - but then we could run into differences in how shift counts in excess of bit counts are processed by different GPUs/CPUs - some will require explicit masking to be compatible with others, which would be wasteful), and shared/local memory lookups and writes.
The GPUs' instruction processing probably introduces little overhead (in die area and energy consumption) since it's amortized due to wide SIMD. What's avoided in a pre-defined circuit is the need to store intermediate results in the large global register file with indexed access, rather than in temporary registers placed right inbetween the units. So what we might want is non-pre-defined routing of data between a few relatively large units (like those I mentioned, in comparison to e.g. a mere bitwise-OR). It isn't important to mix different instructions in arbitrary order (in fact, we might end up under-utilizing our execution units if we do), but rather to re-route the outputs/inputs inbetween fixed groups of instructions (e.g., in one code revision a MAD could take one of its inputs from an immediately preceding S-box lookup, and in another it'd take that input from an S-box lookup done 5 instruction groups earlier, etc.)
Unlike CPUs, modern GPUs also tend to allow indexed access to general-purpose registers (but for tiny indices, and we're limited to 255/N registers anyway). We could also use that, and then it's even runtime variability in the routing, which another ASIC would have to duplicate.
Skimming over the code, I now see that FNV1a and KISS99 are only used in host code to generate a randomized kernel for GPU, and are not used on GPU. This wasn't clear to me from the proposal (perhaps in part because I am not familiar with the original Dagger Hashimoto). Then my earlier suggestion that we don't need an RNG because we already have enough randomness from S-box lookups doesn't apply here. OTOH, this also means that neither the hash nor the RNG need to be fast.
@solardiz Thanks for the extensive feedback! This is wonderful. We are digesting. Please reach out to ifdefelse at protonmail if you would like to chat with the team directly.
Thank you for the detailed review and comments. A bunch of answeres below.
the hashrates on the order of 10 to 20 per second
Those should have been labeled as mega-hash/sec. Sorry for the confusion.
However, can't we achieve a similar effect with a simpler PoW?
No. Any fixed hashing sequence can be implemented on an ASIC (or FPGA) with an order of magnitude better efficiency than a CPU-based implementation. The random sequence of math is what forces this algorithm to run on a GPU instead of on a specialized, fixed-sequence ASIC.
t1ha vs fnv1a and KISS99
I wasn't familiar with t1ha, but it looks very interesting.
KISS99/FNV1a are used both in generating the random math sequence for the inner loop and outside the loop.
In the sequence generation fnv1a is used to seed KISS99, and then KISS99 is used for producing the instruction/register sequence.
If you look at the non-loop kernel code here: https://github.com/ifdefelse/ProgPOW/blob/master/libethash-cuda/CUDAMiner_kernel.cu
FNV1a is again used to seed KISS99. KISS99 is used to initialize the values of mix[] (the per-lane registers).
FNV1a is also used for reducing mix[] to a single per-lane value, and then reducing the pre-lane values to a single 128 bit result.
I think FNV1a is fine for seeding KISS99. The reduction could replace FNV1a with T1HA0_32le, though it probably won't make much of a difference.
64-bit ops
All GPU 64-bit ops are implemented using 32-bit ops. So doing a single 64-bit op or 2 32-bit ops makes effectively no difference. From someone implementing a CPU ALU having carry-out to support a 64 bit op via 2 32-bit ops is essentially free, this doesn't differentiate in any interesting way.
register pressure
The ProgPoW_REGS define directly modifes the register pressure, and is a better way of controling it than occasional 64-bit ops.
MAD
I had originally supported explict 3-input MAD, and also some 3-input logical ops, but the added complexity didn't seem worth it. The random sequence will natrually produce some MADs and 3-input LOPs anyway, which the compiler can make use of.
funnel shift vs bit rotate
The hardware implementation of either is effectively the same, again not a significant differentiator.
indexed register file
AMD supports indexed RF, though I think it's only accessible via assembly. Nvidia does not. I intentionally picked ops that are common between both arcitectures. Focusing on just CUDA I would have also made use of shfl, prmt, and lop3
fixed instruction order
If there was a fixed instruction order where only the inputs/outputs locations changed an ASIC could directly build that pipeline. Only by changing both the instruction order and the input/output locations does it guarantee a real instruction processing system, not a fixed (and much more efficient) pipeline.
Thank you for your answers.
Regarding the speeds, what are the validation speeds on CPU? And proof size?
I agree with many of your points: there isn't a lot of difference between using "64-bit" ops vs. using 32-bit ones and doubling the register pressure via your algorithm's parameters (the maybe-improvement from making use of the carry-out/carry-in is very slight), yes the hardware implementation of 32-bit rotates is probably of almost same complexity as funnel shift's(*), and you seem to be correct that NVIDIA doesn't allow runtime indexing into the register file (unlike AMD GCN).
(*) But I still think it'd take extra resources to route two separate 32-bit inputs to the circuit, instead of one such input. AMD has funnel shift, too: it's amd_bitalign()
in their OpenCL.
Regarding bitwise ops (even 3-input ones), I continue to think they're too simple to be worth using (except where we might have to e.g. for non-linearity with other ops). On every such op, we'd waste more time and energy on register file access than on the op itself.
I also continue to see the unclear optimization potential of those instruction sequences as a concern, and to that I add that your clz()
and popcount()
based steps produce tiny results (most result bits at zero), so any subsequent e.g. integer multiplication could be routed to a smaller and quicker 32x6-bit or even 6x6-bit circuit that an ASIC might provide for such cases.
Regarding pipelining in an ASIC, I continue to think that it may become impractical with varying locations of inputs to each algorithm step (would-be pipeline stage) and/or too much state to carry along the pipeline, even if each step's operations are fixed and known at ASIC design time.
For example, consider implementing bcrypt in an ASIC: each Blowfish round has a fixed set of ADDs and XORs, but the input locations keep changing. Indeed, our and others' implementations of bcrypt on FPGA don't pipeline it - instead, they include multiple iterated bcrypt instances per core, which is similar to what's done in software when cracking bcrypt hashes on a CPU core.
For another example, a person on our team is now working on sha512crypt on FPGA. SHA-512 seems like a perfect fit for pipelining, and almost all of the higher-level logic of sha512crypt is pre-determined (aside from variance by password and salt length, which affect how many times SHA-512's compression function is invoked, but we can group candidate passwords by same length and crack one salt at a time to overcome this runtime variance). So he started with exploring fully-pipelined designs for SHA-512 (thus, with ~80 or ~160 in-flight computations), but after quite some trial and error realized that the size of SHA-512 inputs to carry along the pipeline and the complexity of the high-level algorithm make this prohibitively wasteful. For the current design, he switched to modules of four iterated SHA-512 cores (each computing two SHA-512 hashes at a time, and with slight overlap between previous and new sets of two-hash computations, so at times with up to four in-flight computations) controlled with specialized soft CPU cores (with 16-way SMT, thus one soft CPU core controls the four SHA-512 cores times four in-flights from its different hardware threads). The CPU cores are small in comparison to their groups of four SHA-512 cores. This may sound crazy, but it proved more efficient than a fully-pipelined design so far. Indeed, this design still uses a little bit of pipelining, but to an extent similar to what we see in CPUs and GPUs.
Sorry for this "off-topic" detail, but I think it illustrates that data dependencies can dictate optimal hardware design. I recognize that these examples I gave are for FPGAs rather than ASICs, but I expect similar considerations in ASIC design. I don't readily have an example to illustrate my other sub-point, namely that variability in instruction stream alone (without complex data dependencies on intermediate computation results from much earlier than the previous clock cycle) isn't as good at reducing potential ASIC advantage. Is use of both approaches (dependencies on older intermediate results and variable instruction stream) even better, and how much better? My current understanding is that it's potentially only slightly better, but it carries additional risks and pitfalls to avoid (redundancy, small intermediate results, other optimization potential, and lower utilization of the largest and naturally highest-latency execution units that we have).
Thanks again for all the comments.
Regarding the speeds, what are the validation speeds on CPU? And proof size?
Proof size is just 64-bit nonce + 128-bit intermediate state (that can be used for DOS protection). This is a small change from ethash's 64-bit nonce + 256 bit intermediate. It's fairly easy to change.
tiny results
The results of the math get passed as the B argument to merge(), which either adds or xors B into a modified version of A. Making a specialized 6-bit adder or XOR is probably more overhead than simply using a normal 32-bit ALU for everything. Merge is also setup so that even if the B input is 0 the result will be a version of A with high entropy.
bitwise ops power is dominated by register file
That's a very true statement, that the actual executing of a 32-bit logical op is drastically lower power than reading/writing the register file. This is where specialized ASICs get their efficiency gains compared to a CPU/GPU. I like including as many ops as possible to force the execution hardware to be as general purpose as possible, but keeping or removing LOPs won't make too much of a difference either way.
pipelining
Your bcrypt and sha512crypt examples are about an outer loop. Having a pipeline of 64 bcrypt cores in series or a single bcrypt core that's iterated on 64 times is going to have very similar efficiency. The gains mostly come from pipelining the inner loop. A small CPU sequencing the outer loop makes a lot of sense.
I could build a ASIC pipeline to handle a fixed sequence of math where only the inputs changed. The fixed math sequence would have a bit of control logic on the side programmed with which registers to read/write at each step. The ASIC would still need a large register file to store all the state like a GPU, but it could be much higher performance. Say the pipeline is 32 logical stages you could have 32 difference hash calculations in flight at once. If the register file is 32-way banked they'd never conflict with each other. This pipeline would be 32x higher throughput than a GPU that has to process a single instruction at a time.
You seem to be right about the (non-)tiny results - I missed the fact that you merge()
so often. I think it's this code (in the code generator):
ret << math("data32", mix_src(), mix_src(), rnd());
ret << merge(mix_dst(), "data32", rnd());
However, the operation based on clz()
usually ignores most of its input bits, so maybe a preceding e.g. MUL + merge()
could have its multiplication circuit optimized. OTOH, I can see how it'd be difficult to do when targeting fixed timings and having to fallback to the full, slower circuit once in a rare while. I think more importantly those bit-counting operations are simply too cheap in hardware, much like bitwise logical ones. So their use is more like a defense against other/weird GPU/CPU archs that might lack such instructions than a defense against ASICs.
What you say about bcrypt doesn't make sense to me. (Maybe you're unfamiliar with it? That's OK; I'm also unfamiliar with a few things you referenced previously. But these non-overlaps make it harder for us to get on the same page.) This is about the inner loop. Assuming you'd try to build a pipeline of 64 Blowfish rounds (and then build 64 bcrypt cores by iterating around that pipeline), you'd have only the 64-bit Blowfish blocks propagating along the pipeline efficiently, yet you'd have to manage 64 x 4 KiB = 256 KiB worth of S-boxes near that pipeline, with routing from each 4 KiB set to each pipeline stage, which will waste die area (on the complex addressing and routing) and introduce latency at best on par with a CPU's L2 cache's - and I don't see a way to hide that latency since each round's lookup indices are only known after computation of the previous round (previous pipeline stage) and each round's results are similarly needed by the next round (next pipeline stage). Thus, you'd have a larger circuit that runs slower (per hash computed) than a fully-iterated circuit would run (with only the control logic shared and thus amortized across several such iterated bcrypt instances). [What does make sense, though, is interleaving two bcrypt instances' uses of the same S-box RAMs (of double capacity) to partially hide the access latencies - this is similar to what's done when cracking bcrypt on CPUs and it should also fit e.g. BlockRAM in Xilinx's 7-series FPGAs.]
I think you actually try to achieve a defense similar to bcrypt's with your "cache accesses", which you currently make as numerous as your random math steps.
For sha512crypt, it's about both inner and outer loops. Neither uses variable indexing like we have for bcrypt, and the inner loop (SHA-512 compression function) is a reasonably good fit for pipelining - but the amount of state+inputs to carry along a fully-pipelined implementation is prohibitively large. As I understand, Bitcoin mining FPGA designs and ASICs do use fully-pipelined double-SHA-256 and thus do carry that state+inputs, but there must be optimizations due to similarity of inputs (only the nonce changes) and for SHA-256 the inputs are twice smaller. So we may have a borderline case with SHA-512, and with the added complexity of needing to interface a non-trivial higher-level algorithm to SHA-512 (would need to manage 80+ in-flight sha512crypt's in the specialized CPU) we may have just passed the point where fully-pipelined SHA-512 made sense.
As I understood your example, the 32-stage pipeline + 32-way register file would consume at least 32x the area of a fully-iterated design (excluding control logic, which could be amortized across several such fully-iterated instances - somewhat like we have with wide SIMD in GPUs), and will likely incur higher delays (thus, would have to run at a lower clock rate) to access registers in the larger register file. Yes, you will have 32x higher throughput per clock cycle but at >32x the area and at a lower clock rate, defeating the purpose of this design.
Pipelining is great, but it has its limitations. For example, for descrypt a 400-stage pipeline (25 iterations x 16 DES rounds) can work great (and it's been done). DES and descrypt fit pipelining perfectly. Even then, going for a mixed e.g. 25 iterations of 16 stages design is only a slight slowdown (which has also been done and made sense as part of a more complex design). As soon as there's any extra complication for pipelining, fewer stages or even a non-pipelined design could work better.
MUL instructions are good in that they're already optimally pipelined on a GPU (across 4+ clock cycles). Any instruction that runs on a CPU (yes, CPU) in one clock cycle (of latency) is not a good fit since this indicates that its latency on a GPU is mostly part of the GPU design rather than inherent to the computation.
I vote For implementation, For changing algo to ProgPOW. Equal playground is ideal. If this would work, ifdefelse and Radix should be the new ZEC lead developers and zooko should be fired.
I vote For implementation, For changing algo to ProgPOW.
I vote for Implementation for Changing algo to ProgPOW
I vote for Implementation for Changing algo to ProgPOW
I vote for Implementation for Changing algo to ProgPOW
I vote for Implementation for Changing algo to ProgPOW
Gentlemen, I appreciate the "Votes" of confidence in this proposal. Keep in mind this is just the first step in a process.
The funding of this and other proposals will be reviewed by the Foundations Grant Review Committee as outlined here:
https://github.com/ZcashFoundation/GrantProposals-2018Q2/blob/master/README.md
And will be subject to ratification by the Zcash Foundation Board if approved for funding.
After which the implimentation of a change of PoW could be considered as outlined by Josh Cincinnati @acityinohio in this post:
Thanks for letting us know, but is it any problem that we continue voting here?
@mineZcash Thanks for clarification shawn @Miner356 OhGodAGirl said you could voice support here.
and dont forget, accoring to zooko's interview with coindesk... if the Foundation tries to go through with this there will more than likely be a chain split and he will continue with the ASIC chain... atleast until sapling... or until he gets hung for treason
Shawn can clarify on that :+1:
I don't see a problem with voting here, but I will leave that call to @amiller or @acityinohio . The only issue that could occur is if the conversation about specific details and solutions regarding the actual PoW is overwhelmed by users voting.
Perhaps the Ballot that Andrew is working on would be able to provide more accurate accounting of Votes cast: https://github.com/ZcashFoundation/Elections/pull/1
@mineZcash when will the ballot go live? will we all be able to vote?
@mineZcash The purpose of Zcash was Asic resistance, what do you guys think we miners want? It should not be necessary to vote, we're just wasting time. Please save resources and audit the code and say what you think. Thank you
@cryptomined That's a good question for @amiller @Miner356 The process that we are following has been clearly outlined by the Zcash Foundations Executive Director: https://z.cash.foundation//blog/statement-on-asics/ . Please read through it when you get a chance.
@mineZcash ok I will follow you to keep me informed about what is going on. Thank you
@mineZcash I had to research what the trends of computing would be in general and I came to the conclusion that ASICs at this moment are the fastest but not for long, I think that within 3 to 4 years GPUs CPUs will compete with ASICs. https://www.youtube.com/watch?time_continue=31&v=55JK5bDLoAg https://www.youtube.com/watch?v=rgbnv1UpAcE Intel AMD and Nvidia will use the same technology in the gaming industry, they will be forced to use this technology because Moore's law does not allow much further advancement in hardware technology, and of course Moore's law also applies to ASICs This technology is cheaply manufactured and efficient.
I vote For implementation too and thanks for that
We appreciate the votes of confidence, but more peer review is needed; if you, or anyone you know, has time to review and audit the code, please feel free to reach out to the team via email: ifdefelse@protonmail.com, or through Twitter, with @OhGodAGirl, or via Gitter.
@ifdefelse The proposal doesn't specify if you are seeking funding to implement code, or to cover the expense of expert analysis (although @solardiz seems to be doing that already!). Perhaps funding isn't required and you are submitting the proposal for the foundation to review and adopt as a priority (at which point the foundation would implement). I think clarifying this would help the review committee.
@Miner356 @mineZcash I'm happy to see informal voting/polling/protesting occurring somewhere in this repo, even though I think there are betters places for it
@ifdefelse I was also gonna ask what bitcartel said... I'm excited this proposal has generated a technical discussion! What do you have in mind as the role of a grant? Maybe during the discussion phase we can get a list of suggested review or validation tasks and see if you or someone else would want to carry them out. Is there an open source GPU implementation of ProgPow yet?
I vote for implementation asap too!
I vote For implementation too
@bitcartel, @amiller, We ask that any funding go toward the outside development, review, testing, and implementation efforts as the foundation deems appropriate. This proposal is submitted to the foundation to review and adopt as a priority. We humbly suggest that this proposal be integrated to the soonest planned fork (as feasible) to minimize disruption to the network.
In the meantime, we are working on the miner code here. We would certainly benefit from advice from the community in this effort and will continue to integrate feedback (thanks @solardiz ). For example, how should the block headers be treated now that the solution is no longer needed? (leave it zeroed?)
Next, we are looking to start work on the client code in about two to three weeks. Once that is complete, we will do some preliminary benchmarking as a sanity check before handing off the rest of the work to the community.
Thanks again to everyone for the votes of confidence.
I've added to the funding details to the proposal statement at the top.
I vote for implementation.
Just a heads up, Ethereum devs will be discussing ProgPOW in their Friday meeting: https://twitter.com/OhGodAGirl/status/997064114701787136?s=19
Seems that this is not the place for this discussion but I 2nd the motion. I look forward to ETH meeting outcome. Not sure if this is dumb but I got a question, can we set up a donation address for funding the implementation of ProgPoW into Zcash?
I vote For implementation to change the algo to ProgPOW.
If this would work, I ALSO agree ifdefelse and Radix should be the new ZEC lead developers and zooko should be fired.
An image of a quote of Zooko as to my reasoning for him being fired [Below]:
Here's an image to "support" the FACT Zooko has a conflict of interest in this debate and should be fired:
Even Daira admits he has a conflict of interest; as seen in the image below:
@ProwdClown this is a Zcash Foundation Grant issue, it has nothing to do with Zcash Company management or Zooko so arguments about who should be fired or whatever are irrelevant.
The Foundation has no power to fire or replace anyone from the Company, and no one from the Company can fire someone from the Foundation. The Zcash Foundation can fund Grant proposals as it sees fit, it doesn't rely on Zcash Company's opinions.
If users feel this is a good PoW proposal, constructive feedback would be to help ifdefelse as they mentioned above:
"We appreciate the votes of confidence, but more peer review is needed; if you, or anyone you know, has time to review and audit the code, please feel free to reach out to the team via email: ifdefelse@protonmail.com, or through Twitter, with @OhGodAGirl, or via Gitter"
@ifdefelse, you wrote:
We ask that any funding go toward the outside development, review, testing, and implementation efforts as the foundation deems appropriate
Do you have specific suggestions of such work items, and who would do them?
The proposals are meant to be for funds awarded to the proposer, who will then use them (perhaps using subcontractors/employees/consultants) to execute the plan.
Excellent work offered on a silver platter by excellent programmers. What more could Zcash devs ask for? Implement it and let's be done with ASICs for good.
@ifdefelse, the above information is crucially needed by the grant review committee in order to evaluate your proposal.
@ifdefelse ping!
@tromer @MoneroCrusher Sorry about the late response to this. This grant proposal was submitted for review priority and not for funding. While we do not need any funding for the work on our side, we are a very small team of part-time collaborators and would welcome any other contributors to the code and its review, funded or not. Would the foundation be willing to help source the following roles as sub-grants at time of completion? 1. project/community manager 2. client implementer/maintainer 3. miner implementer/maintainer 4. hardware optimization reviewer
@ifdefelse, while the proposal is technically very interesting, the Foundation itself doesn't have personnel who can fulfill these review, dev and communication roles for your PoW. One way to proceed is for you to team up with suitable people and have them funded by this grant program; but you would need to be proactive in reaching out to them.
The Zcash Foundation Grant Review committee has reviewed your pre-proposal, including the above discussion, to evaluate its potential and competitiveness relative to other proposals. Every pre-proposal was evaluated by at least 3 (and typically more than 4) committee members .
The committee's opinion is that your pre-proposal is a promising candidate funding in this round, and the committee therefore invites you to submit a full proposal. Please submit a full proposal by June 15th, following the detailed structure described in the Call for Proposals. We encourage you to submit a draft as early as possible, to allow for community feedback.
Is there a CPU-only (preferably, plain C) implementation of ProgPoW? I understand the OpenCL kernel can be run on a CPU if it doesn't use vendor-specific extensions, but I'd prefer to also have something in plain C or at worst C++, thus not needing an OpenCL runtime. This would be useful for review, experiments, as well as to serve as a reference implementation for correctness testing of the OpenCL and CUDA implementations against this reference. I understand that C(++) isn't a very good fit for dynamically adjusted code, but the same applies to CUDA, and this isn't literally a show-stopper - it only means that development tools need to be installed.
If this exists, it'd help my potential contributions. If it doesn't exist yet, then maybe I should include creation of such an implementation in my proposal.
(I continue to think that dynamically generated code might be overkill, and carefully tuned dynamically generated data dependencies could be sufficient to achieve similar "ASIC resistance". But I realize that current ProgPoW does use dynamically generated code.)
Posting the full Grant Proposal: zgp-15.pdf
Thank you for posting the proposal, @ifdefelse.
This is a very aggressive schedule you propose, which is great given Zcash's immediate need for a PoW scheme change, and I think you'll deliver, but it raises concerns of the extent of review and testing and tweaking that may be done on ProgPoW in this limited time. Part of the problem we have with Equihash now is that its choice of parameters was also a rushed decision with no time for adjustment even when much more efficient implementations appeared and the parameters were already seen as non-optimal shortly before launch. I'm afraid it'll be similar in that respect.
Do you already have specific candidates in mind for the "2 more people"? Are they really available to start work on the project full-time(?) on such short notice?
(I might also post a proposal today. I think the research, review, tweaking, testing needed here is quite separate from client and miner development, so I wouldn't expect the same people to do both kinds of work.)
FWIW, I contacted @eXtremal-ik7 about his possible help with miner development, and he's potentially interested, but obviously needs specifics to start any work. eXtremal previously contributed to Zcash miners: https://github.com/mbevand/silentarmy/commit/b879b795141b95d0878e93bd9ae5cab120149891 https://bitcointalk.org/index.php?topic=1660023.0
Abstract
We propose an alternate proof-of-work algorithm - “ProgPoW” - tuned for commodity hardware in order to close the efficiency gap available to specialized ASICs.
The security of proof-of-work is built on a fair, randomized lottery where miners with similar resources have a similar chance of generating the next block.
For Zcash - a community based on widely distributed commodity hardware - specialized ASICs enable certain participants to gain a much greater chance of generating the next block and undermine the distributed security.
ASIC-resistance is a misunderstood problem. FPGAs, GPUs and CPUs can themselves be considered ASICs. Any algorithm that executes on a commodity ASIC can have a specialized ASIC made for it; most existing algorithms provide opportunities that reduce power usage and cost. Thus, the proper question to ask when solving ASIC-resistance is “how much more efficient will a specialized ASIC be, in comparison with commodity hardware?”
This proposal presents an algorithm that is tuned for commodity GPUs where there is minimal opportunity for ASIC specialization. This prevents specialized ASICs without resorting to a game of whack-a-mole where the network changes algorithms every few months.
Motivation
Since proof-of-work will continue to underpin the security of the Zcash network, it is important that this consensus isn’t dominated by a single party in control of a large portion of the compute power. The existence of the Z9 ASIC miner proves that the current Equihash algorithm allows significant efficiency gains from specialized hardware. The ~150mb of state in Equihash is large but possible on an ASIC. Furthermore, the binning, sorting, and comparing of bit strings could be implemented on an ASIC at extremely high speed.
Thus, a new hash algorithm is needed to replace the current Equihash. This algorithm must be permanently resistant to domination by specialized hardware and avoid further PoW code forks.
Design Specification
ProgPoW preserves the 256-bit nonce size for Zcash and thus does not require any changes to light clients. However, it entirely replaces the rest of the Equihash PoW algorithm.
ProgPoW is based on Dagger Hashimoto and follows the same general structure. Dagger Hashimoto was selected as a base due to its simple algorithm and memory-hard nature. Dagger Hashimoto has also withstood a significant amount of real-world “battle testing.” Finally, starting from a simple algorithm makes the changes easy to understand and thus easy assess for strengths and weakness. Complexity tends to imply a larger attack surface.
The name of the algorithm comes from the fact that the inner loop between global memory accesses is a randomly generated program based on the block number. The random program is designed to both run efficiently on commodity GPUs and also cover most of the GPU's functionality. The random program sequence prevents the creation of a fixed pipeline implementation as seen in a specialized ASIC. The access size has also been tweaked to match contemporary GPUs.
ProgPoW utilizes almost all parts of a commodity GPU, excluding:
Since the GPU is almost fully utilized, there’s little opportunity for specialized ASICs to gain efficiency. Removing both the graphics pipeline and floating point math could provide up to 1.2x gains in efficiency, compared to the 2x gains possible in Dagger Hashimoto, 50x gains possible for CryptoNight, and ~100x gains possible in Equihash.
The algorithm has five main changes from Dagger Hashimoto, each tuned for commodity GPUs while minimizing the possible advantage of a specialized ASIC. In contrast to Dagger Hashimoto, the changes detailed below make Prog-PoW dependent on the core compute capabilities in addition to memory bandwidth and size.
Changes keccak to blake2s. Zcash already uses BLAKE2 in various locations, so it makes sense to continue using BLAKE2. GPUs are natively 32-bit architectures so blake2s is used. Both blake2b and blake2s provide the same security but are tuned for 64-bit and 32-bit architectures respectively. Blake2s runs roughly twice as fast on a 32-bit architecture as blake2b.
Increases mix state. A significant part of a GPU’s area, power, and complexity is the large register file. A large mix state ensures that a specialized ASIC would need to implement similar state storage, limiting any advantage.
Adds a random sequence of math in the main loop. The random math changes every 50 blocks to amortize compilation overhead. Having a random sequence of math that reads and writes random locations within the state ensures that the ASIC executing the algorithm is fully programmable. There is no possibility to create an ASIC with a fixed pipeline that is much faster or lower power.
Adds reads from a small, low-latency cache that supports random addresses. Another significant part of a GPU’s area, power, and complexity is the memory hierarchy. Adding cached reads makes use of this hierarchy and ensures that a specialized ASIC also implements a similar hierarchy, preventing power or area savings.
Increases the DRAM read from 128 bytes to 256 bytes. The DRAM read from the DAG is the same as Dagger Hashimoto’s, but with the size increased to 256 bytes. This better matches the workloads seen on commodity GPUs, preventing a specialized ASIC from being able to gain performance by optimizing the memory controller for abnormally small accesses. The DAG file is generated according to traditional Dagger Hashimoto specifications, with an additional ProgPoW_SIZE_CACHE bytes generated that will be cached in the L1.
ProgPoW can be tuned using the following parameters. The proposed settings have been tuned for a range of existing, commodity GPUs:
ProgPoW uses FNV1a for merging data. The existing Dagger Hashimoto uses FNV1 for merging, but FNV1a provides better distribution properties.
ProgPoW uses KISS99 for random number generation. This is the simplest (fewest instructions) random generator that passes the TestU01 statistical test suite. A more complex random number generator like Mersenne Twister can be efficiently implemented on a specialized ASIC, providing an opportunity for efficiency gains.
Backwards Compatibility
This algorithm is not backward compatible with the existing Equihash implementation and will require a fork for adoption. Furthermore, the network hash rate will change dramatically as the units of compute are re-normalized to this new algorithm.
Testing
This PoW algorithm was implemented and tested against six different models from two different manufacturers. Selected models span two different chips and memory types from each manufacturer (Polaris20-GDDR5 and Vega10-HBM2 for AMD; GP104-GDDR5 and GP102-GDDR5X for NVIDIA). The average hashrate results are listed below. Additional tests are ongoing.
As the algorithm nearly fully utilizes GPU functions in a natural way, the results reflect relative GPU performance that is similar to other gaming and graphics applications.
Sample Code
Sample code is available here.
Schedule and Funding Request
This proposal is submitted to the foundation to review and adopt as a priority. We ask that any funding goes toward any outside development, review, testing, and implementation efforts as the foundation deems appropriate. We humbly suggest that this proposal be integrated to the soonest planned fork (as feasible) to minimize disruption to the network.