Cryptonight + random math discussion

SChernykh commented 5 years ago

Let's discuss my approach to random math in Cryptonight. You can post test results in this thread: https://github.com/SChernykh/CryptonightR/issues/2

Basic algorithm description:

https://github.com/SChernykh/CryptonightR/blob/master/README.md

Reference implementation (variant4 code in the following files):

The same implementation in Monero code base:

https://github.com/SChernykh/monero/tree/CryptonightR

Optimized CPU miner:

xmrig

Optimized GPU miner:

Pool software:

Test pools:

http://killallasics.moneroworld.com/

@moneromooo-monero this is what I was talking about for the next fork. The only difference with CNv2 is the part where it does math.

@tevador @hyc your comments, suggestions?

tevador commented 5 years ago

The random sequence changes every block. It depends either on block height or previous block hash (TBD).

That probably means that GPU kernels will need to be recompiled every block. Have you tested the impact on GPUs?

SChernykh commented 5 years ago

That probably means that GPU kernels will need to be recompiled every block

Yes. If we choose block height as a seed for random math, they can precompile in background, so there will be no performance hit. It's a bit more tricky with previous block hash, but if we take hash from 2 blocks back, it can also be precompiled.

P.S. I don't have a GPU implementation that is dynamic yet. I've done only static test with different random sequences. But it shouldn't be hard to implement it.

SChernykh commented 5 years ago

I think it's also safer to use block height as a seed because it also makes possible to use auto-generated optimized code in daemon/wallet software. Updated readme.md:

The random sequence changes every block. Block height is used as a seed for random number generator. This allows CPU/GPU miners to precompile optimized code for each block. It also allows to verify optimized code for all future blocks against reference implementation, so it'll be guaranteed safe to use in monero daemon/wallet software.

hyc commented 5 years ago

Allowing precompilation also allows searching for easy nonces.

SChernykh commented 5 years ago

There are no easy nonces - code changes only once per block and miners don't control it.

SChernykh commented 5 years ago

Updated readme.md:

Further development plans Reference implementation in Monero's code base (slow-hash.c, dependent files and tests): December 9th, 2018 Optimized CPU miner (xmrig): December 15th, 2018 Optimized GPU miner (xmrig-amd): December 20th, 2018 Pool software: December 24th, 2018 Public testing: January 2019

tevador commented 5 years ago

There are no easy nonces - code changes only once per block and miners don't control it.

Profit-switching miners can still take advantage of it, especially if this tweak is adopted by multiple currencies.

SChernykh commented 5 years ago

@tevador Hashrate shouldn't change between different blocks: code generator doesn't just emit random code, it emulates CPU's pipeline and produces just enough code to account for latency required. This is the main idea, and it's already working very good - hashrate changes only 1-2% in my tests with different seeds, and I plan to fix remaining differences.

SChernykh commented 5 years ago

Reference implementation (variant4 code in the following files):

The same implementation in Monero code base:

https://github.com/SChernykh/monero/tree/CryptonightR

Optimized CPU miner:

xmrig

Optimized GPU miner:

Pool software:

Test pools:

http://killallasics.moneroworld.com/

SChernykh commented 5 years ago

Test pool is up and running, you can now test CryptonightR with xmrig/xmrig-amd (see links above).

Update (2019-02-18): this test pool was for the original CryptonightR without tweaks, it's incompatible with the final version.

SChernykh commented 5 years ago

Please test with different CPUs/GPUs and compare hashrate/power consumption with CryptonightV2. I need data for further CryptonightR tweaking.

SChernykh commented 5 years ago

I've created a separate issue for test results: https://github.com/SChernykh/CryptonightR/issues/2

SChernykh commented 5 years ago

Here's a dump of today's discussion in IRC:

<sech1> moneromooo are you here? What do you say about CryptonightR - is it good enough for the next
PoW tweak? Testing in Wownero network will iron out the kinks pretty quickly by the end of January.
<moneromooo> Is it a tweak or a large change ?
<hyc> adding random ops in the middle of cryptonight? more than just a tweak ;)
<moneromooo> AFAIK sech1 has a tweak that will be published in the next couple months. And that had
better be just a tweak :)
<moneromooo> And that random instructions change can maybe be used for the next fork, if we have
people competent to review that kind of code.
<sech1> moneromooo this is a tweak, the other one which is not published yet (without randomness)
is also a tweak
<moneromooo> Where are the diffs ?
<sech1> https://github.com/SChernykh/monero/commit/d756eca751afc9febd941708c8671155f08a6129#diff-7000dc02c792439471da62856f839d62
<sech1> This is slow_hash.c
<sech1> variant4_random_math.h is also small file
<sech1> Everything with links to code is here: https://github.com/SChernykh/CryptonightR
<sech1> I thought we had enough qualified people in this channel?
<sech1> And enough time to review/test it. It's already being tested in Wownero testnet.
<moneromooo> I think I'm going to have to avoid the work "tweak".
<sech1> On the scale from 0 to RandomX where would you put this change?
<moneromooo> You run blake from a buffer to that same buffer. Is that 100% safe ?
<moneromooo> I don't know, I did not look at RandomX.
<sech1> I looked at blake code, it writes to output in the end and uses internal buffers
<sech1> RandomX is total change from Cryptonight
<sech1> Like 100% replacement
<sech1> So when blake starts writing to output, it doesn't read from input anymore.
<sech1> At least this is what Monero's version does.
<moneromooo> Why the height as extra input ? The hashed blob does indirectly contain the height.
<sech1> To allow precompilation for GPUs - they need to know the code a bit before new block hits.
<sech1> Or they'll suffer from 0.5-1 second pause every time new block appears
<sech1> ProgPOW uses height as well, it's safe to use as a seed for RNG
<moneromooo> While I don't know much about hardware, I'd expect the main loop on an ASIC to be
"run all these 6 ops in parallel, and mux-select the right path".
<moneromooo> But I suppose that's kinda naive maybe.
<sech1> Then it'll be limited to the slowest of 6 operations
<cjd> are they not dependent ?
<sech1> and there are 60-80 operations generated on average
<moneromooo> Not necessarily.
<sech1> all inter-dependent
<cjd> oh nvm i get it
<moneromooo> Oh, I didn't mean the whole loop, but one loop run.
<moneromooo> The switch thing. It means the CPU gets to run the branches.
<sech1> ASIC can do it of course, but this random math has the same property as div+sqrt in
CNv2 - high latency
<moneromooo> Maybe inconsequential ?
<sech1> Switch thing is only in the reference code
<sech1> miner code is auto-generated and linear
<sech1> one operation = 1 x86 instruction
<sech1> well, 2 instructions most of the time for ROR/ROL
<sech1> mov ecx, counter/ROR reg, cl
<sech1> but modern CPUs can chew it as fast as like it's 1 instruction
<cjd> hmm, if you have a 20% chance that an op is a multiply, then I would be thinking to design
an ASIC with which can perform 20% multiplies in parallel and then give the thing enough threads
that each circuit is kept busy (for the most part)
<sech1> We need ASIC experts who can tell us how they would implement it
<cjd> indeed, I'm not one
<sech1> cjd 3/8 operations are MUL, and this is done for a purpose
<cjd> AFAICT the worst thing should be when you have one block which is almost all addition, and
then the next block is almost all multiplication, etc
<sech1> to achieve high computation latency - even higher than div+sqrt in CNv2
<cjd> because then my adder circuits are all busy, and next block all of my multiply circuits
are busy and my adders are idling
<moneromooo> OK I see how it's fixed for the search.
<sech1> Code generator accounts for this - it generates code that runs with the same speed on CPUs
<sech1> it emulates an abstract CPU with 2 ALUs and generates code to fill these 2 ALUs
<sech1> for specified amount of clock cycles
<sech1> so it'll never generate only MUL instructions, it'll always be a mix of different
operations with different registers
<moneromooo> Looks ok at first glance. Definitely not a tweak though. I might be using that term
incorrectly.
<moneromooo> I mean small change.
<moneromooo> Did you get people with relevant expertise to look at it ?
<sech1> if we want to move to ASIC resistant algo eventually, we have to move in bigger steps
than CNv1 tweak
<sech1> Not yet, ASIC designers haven't looked at it yet
<sech1> It's in very early stage, we barely made it up and running for now
<moneromooo> Hashing/crypto expertise I mean.
<sech1> No. I only did some tests like "run random sequence 1000000 times it didn't degrade to all
zeroes or small loop"
<sech1> *and check it didn't degrade
<sech1> This is why ADD is 3-way with random constant
<sech1> and SUB/XOR are always done with different registers
<sech1> Without these 2 changes, it degraded to 0 in this test
<sech1> moneromooo We do have people with hashing/crypto expertise in MRL, don't we?
<moneromooo> Crypto, yes. Hashing, I don't think so.
<sech1> The main property required from random math in the main loop is to be random aka not
reducing entropy. My tests show it's ok in this aspect.
<cjd> if you do    Blake(input || Program(input))  you will tend to stay out of too much trouble
<cjd> This is what Percivial did in the original scrypt, to bound it's potential badness
<cjd> WRT hashing, too much entropy erasure leads to entropy starvation, loops, degradation to zero,
etc...   Too little entropy erasure allows someone to run the hash in reverse for preimaging it (I
don't know if this is a concern here)
<cjd> you'll notice salsa20 does   x[ 4] ^= R(x[ 0]+x[12], 7);    there's only XOR and ROTL except
there's one ADD, the ADD is erasing 1 bit of entropy because of the possibility of rolling over
<cjd> so you have 20*32 additions, that's 640 bits of entropy destroyed, plenty to cause a state
explosion if anyone tries to backtrack it
<sech1> If you're talking about generated random code, it blends in 4*32 new bits of entropy on each
iteration
<sech1> It certainly can't lose more than that
<sech1> Even with 1000000 iterations without new bits of entropy, it didn't degrade to 0
<cjd> I would tend not to worry about it as long as there's a Hash(input || Program(input))
<sech1> so this generated random code is actually kind of RNG with rather long period
<sech1> "Hash(input || Program(input))" - no, that's not how it works in CryptonightR
<tevador> I don't think you need to worry about entropy much since all output will be processed by
AES afterwards

Gingeropolous commented 5 years ago

@timolson , have you weighed in on this re: asic design?

Gingeropolous commented 5 years ago

also, lets ping @dave-andersen .... hey hows it going? Its your lovable friends over at monero again :)

timolson commented 5 years ago

I left a brief review in the RandomX project... is CryptonightR a parallel effort or similar? I could give an hour or two if this is something different to look at, but my schedule's very tight through mid-to-late January. I'd be glad to do a deeper investigation after about 4 weeks from now.

SChernykh commented 5 years ago

CryptonightR is a modification to Cryptonight whereas RandomX is done completely from scratch. The main purpose of CryptonightR is to be the next PoW for Monero until RandomX is ready. The only difference from CNv2 is integer math part - it's random generated now, just like in ProgPoW. It's also supposed to be more computationally expensive (+higher computation latency) than div+sqrt in CNv2.

SChernykh commented 5 years ago

An example of generated random math: https://github.com/SChernykh/CryptonightR/blob/master/CryptonightR/random_math.inl

timolson commented 5 years ago

I’ll assume you care about modifying existing ASIC’s not ASIC-from-scratch which is a different question.

Time for a reveal: when we heard you were going to tweak the PoW, we added a bespoke VLIW processor to our design which could intercept various points along the CryptoNight datapath, perform programmable calculations, then inject the result back into the standard CryptoNight datapath. It potentially (subject to some details I don’t have time to look up) could have handled this new random math without changing our silicon at all! The VLIW could do shift/rotate/xor, then a “primary” operation like add or divide, then a trailing set of bitops before injection. In effect, any bitops would have been “free” because of the VLIW design. And yes there was enough register space. Not sure if our instruction buffer was long enough... I think so. What it couldn’t do is generate the random code itself. It needed to be programmed ahead of time. We could probably have used our SoC controller to do the program generation, but programming the chip involves extra IO which could be slow.

Also, each VLIW instruction cost one cycle (three for division). The length of your programs would have absolutely crushed our speed, since our inner loop was only 4-6 cycles.

Anyone with a Monero ASIC from before the announcement of the tweak threat would not have such a coprocessor, but someone developing an ASIC after your tweaking announcement could very well have planned like we did for the tweaks.

However, v2 added new datapaths to memory which was a real blow. An ASIC designer would have needed to go through a completely new physical layout for v2, and also added some kind of coprocessor like we did in anticipation of future tweaks. I find this unlikely, but possible.

If you are changing the program every nonce, it would be a big problem for such a coprocessor design, since it would need to be reprogrammed every nonce. If you are changing the program slowly like ProgPoW, your random math may not be safe. Slow-changing programs also opens the door to FPGA implementations.

In terms of an ECO on a chip that doesn’t have a coprocessor, it is probably too much to change, even if they left plenty of room.

I would recommend one or more of the following:

If the program changes slowly, change it every nonce instead, which forces reprogramming and external IO
Find a new strange operation from CPU’s that’s not something obvious like add or mul or aesround. We didn’t implement any float ops, for example.
Repeat the idea of v2 and create new memory channels. This kind of thing hurts a lot, but going from a new physical layout to packaged chip can still be done in maybe 4 months if they are good.
Make the programs even longer. This reduces any help from the CryptoNight part of the ASIC and emphasizes the new math part. Coprocessors will definitely be slower than production CPU’s and if the program is very long at all, the processor’s speed on the math will outweigh the ASIC’s speed on CryptoNight.

Overall, I think the threat of modifying an existing ASIC to handle this tweak is low, and if it did handle the tweak, the chip would be slow. However, you may consider the changes above out of an abundance of caution. If our chip had gone to production, and we had redone layouts to keep up with v2, I would be giggling about this tweak. I think we could have handled it with maybe zero changes, but only because we had this on-chip VLIW coprocessor in anticipation of your tweaks.

This was a super-quick review and I could be missing something really important in your design. LMK if that’s the case and I’ll jump back into the convo.

SChernykh commented 5 years ago

1 - This will kill GPU performance too, so not an option for now.

2 - Possible for further tweaking, it's not hard to add something new to current algorithm design as long as it's a single x86 instruction and fast enough on GPU.

3 - I tried a lot of things, but couldn't find "memory path" tweak that doesn't hurt CPU performance (yet).

4 - They're already as long as they can be without slowing down CPU. They can maybe a bit longer if they have fewer MUL and more simple ADD/ROTATE/XOR operations, but it'll also change current performance ratio between CPU/GPU.

The length of your programs would have absolutely crushed our speed, since our inner loop was only 4-6 cycles.

That's the whole point of this tweak, explicitly mentioned in readme:

Code generator ensures that minimal required latency for ASIC to execute random math is at least 3 times higher than what was needed for DIV+SQRT in CryptonightV2: current settings ensure latency equivalent to a chain of 18 multiplications while optimal ASIC implementation of DIV+SQRT has latency equivalent to a chain of 6 multiplications.

It makes single ASIC core limited both by memory access and computation part - whatever is slower, hopefully making single ASIC core not faster or even slower than single CPU core. Performance/power ratio is still much better for ASIC, but RandomX will get there eventually.

If our chip had gone to production, and we had redone layouts to keep up with v2, I would be giggling about this tweak. I think we could have handled it with maybe zero changes, but only because we had this on-chip VLIW coprocessor in anticipation of your tweaks.

Zero changes, but still with huge drop in performance?

timolson commented 5 years ago

I like #2 best if you can find the right operation. v1 was annoying because we didn’t plan for sqrt. Add/mul/bitops are too predictable.

tevador commented 5 years ago

That's why RandomX uses a lot of floating point math. If you want a good FPU in your chip, you will probably need to license some IP core.

timolson commented 5 years ago

It was not a matter of licensing... Synopsys will synthesize IEEE floats just fine out of the box, and licensing something else isn’t a problem. It was simply a matter of choice: we didn’t think floats were worth the die area, since it’s unusual to use floats in a PoW.

timolson commented 5 years ago

Zero changes, but still with huge drop in performance?

Yep. I didn’t say RandomX wasn’t effective against tweaked ASIC’s... just pointing out some ways to make it stronger.

Just curious, have you seen any evidence of ASIC’s returning after v1/v2 tweaks? PM me if you’d like to keep it quiet.

Also let me say: WhatsMiner is kicking BitMain’s butt, and BitMain is rumored to be divesting their mining operations, with Jihan Wu being demoted as well. The whole “ASIC’s are monopolized and scary” argument is certainly weakening.

SChernykh commented 5 years ago

have you seen any evidence of ASIC’s returning after v1/v2 tweaks?

v1 ASICs - yes, including CN/xtl variant. All CNv1 coins are 3-4 times less profitable now per 1 KH/s. CNv2 and CN/heavy coins - no signs of ASICs.

timolson commented 5 years ago

Excellent, that would align with expectations. v2 was the real killer with new datapaths to memory.

I’d say your risk of ASICs existing that have both a coprocessor and v2 capability are very low. RandomX should be fine and is probably even overkill at this point.

OH just to be sure... the new math is in the inner loop near the AES and MUL, right? Not in the initialization or finalization loops?

SChernykh commented 5 years ago

the new math is in the inner loop near the AES and MUL, right?

It replaces div+sqrt, so it's in the inner loop.

SChernykh commented 5 years ago

Testing showed that hashrate fluctuations on CPUs are bigger than we want, especially on AMD Bulldozer, so code generator will be revised.

ga10b commented 5 years ago

A small concern is that there is no native bit rotation instruction on GPU. Other operations look good. They may achieve ~0.5 instructions per clock on NVIDIA GPUs.

In fact, a carefully designed ASIC could still outperform GPU by spending more resource/area on the bottlenecks. The memory bandwidth can be greatly improved using more smaller DRAM partitions and parallel memory controllers with address interleaving. The random math cannot utilize GPU’s float point ALUs, tensor cores and certain on chip memory, which occupies much more area than the tiny integer ALUs. An ASIC implementation could just build more simplified integer ALUs, multi-bank RFs with a very simple decoder for better TLP. It is also possible to achieve chained operations with reconfigurable ALU-array.

SChernykh commented 5 years ago

A small concern is that there is no native bit rotation instruction on GPU

You're wrong. It's called funnel shift in NVIDIA PTX: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#logic-and-shift-instructions-shf And V_ALIGNBIT_B32 in AMD GCN which runs just as fast as other bitwise logic operations.

In fact, a carefully designed ASIC could still outperform GPU by spending more resource/area on the bottlenecks

Read the description, I'm not saying that ASIC is impossible. This code further reduces ASIC advantage. I estimate that ASIC can't be more than 2-3 times efficient per watt than GPU with this algorithm.

timolson commented 5 years ago

In fact, a carefully designed ASIC could still outperform GPU by spending more resource/area on the bottlenecks. The memory bandwidth can be greatly improved using more smaller DRAM partitions and parallel memory controllers with address interleaving. The random math cannot utilize GPU’s float point ALUs, tensor cores and certain on chip memory, which occupies much more area than the tiny integer ALUs. An ASIC implementation could just build more simplified integer ALUs, multi-bank RFs with a very simple decoder for better TLP. It is also possible to achieve chained operations with reconfigurable ALU-array.

👍👍 Listen to this guy.

This is why ProgPoW will also fail to be ASIC-resistant.

timolson commented 5 years ago

I estimate that ASIC can't be more than 2-3 times efficient per watt than GPU with this algorithm.

IMO 2-3x is nowhere near good enough. That’s an enormous advantage compared to the thin-margin economics of mining. GPU/CPU mining will lose money just on the electricity without considering capex.

ga10b commented 5 years ago

@SChernykh cuda ptx is not native instruction. It is good to see how this instruction is translated to the native code on different architectures using CUDA binary utils. I guess we probably need to call special intrinsic function to generate the desired instruction. Here is a table of instruction throughout https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#arithmetic-instructions

hyc commented 5 years ago

@timolson do you think ASICs will be less than 2x more expensive than mass-market GPUs?

ga10b commented 5 years ago

Given the same process, there is no way a general purpose processor could make ASIC less than ~2x efficient in terms of perf/$ or perf/watt unless we make the program doing every possible task that the processor is designed for, such as graphics and AI. It seems random math is aiming toward the right direction but far from the goal. Even some AI chips could beat GPU by 10x in terms of power efficiency.

SChernykh commented 5 years ago

@ga10b This is temporary solution for the next 6 months of course - RandomX is much more advanced, check tevador's repository.

NVIDIA GPUs don't have performance issues with cn/r, so I think rotation instructions are natively supported.

tevador commented 5 years ago

@ga10b Can you have a look this? https://github.com/tevador/RandomX

The documentation is a bit outdated (memory access is being reworked at the moment), but it should be enough for a brief review.

SChernykh commented 5 years ago

@ga10b I use __funnelshift_l, __funnelshift_r intrinsics for rotations in CUDA code, the table you linked shows that bit shifts are only 2 times slower than additions, so it's fast enough.

It is good to see how this instruction is translated to the native code on different architectures using CUDA binary utils. I guess we probably need to call special intrinsic function to generate the desired instruction. Here is a table of instruction throughout

SChernykh commented 5 years ago

I've updated random code generator. The first version had a number of issues (high hashrate fluctuations on some CPUs) and also one small coding error that resulted in lower minimal theoretical latency (parameter L in the table) of generated programs than expected. Here is a comparison of 1st and 2nd versions of the code generator:

Parameter	Version 1	Version 2
Nominal L (cycles)	54	45
Actual L (cycles)	28-40, 35.56 on average	45-47, 45.86 on average
Program length	25-108, 62.458 on average	60-69, 63.1 on average

Testnet pool should be up and running with the updated version next week (maybe even this weekend).

MoneroChan commented 5 years ago

Hi, @SChernykh , just wondering re: timolson's 2 suggestions.
1) 'Changing the program every nonce instead" affects 'GPU speed', and 4) 'Making programs even longer" affects 'CPU speed'

What if, we use 'Both' 1) and 4) at varying levels, as 'control levers' to adjust the CPU / GPUs performance ratio, and at the same time reduce FPGA's and ASICS performance?

I'm thinking if it hurts FPGAs and ASICs more than CPU/GPUs, it may outweigh the drop in performance. Are there any estimates available for the % performance drops?

Thanks,

SChernykh commented 5 years ago

@MoneroChan There is no way to estimate this, I need to actually implement it to see the impact on CPU/GPU. GPU will have to either use something very close to reference code with switch-case for every instruction so they will be hit very hard because of code divergence, or they'll have to run all 6 possible instructions at every step, save and load results to/from LDS, so it'll require ~18 times more GPU instructions to execute. Either way GPUs will be hit hard if random program changes every nonce.

MoneroChan commented 5 years ago

Thanks @SChernykh . it looks like the situation is suddenly changing very quickly for the worse. Any thoughts on the 70%++ New XMR hashrate that Suddenly came online in the past 2 weeks?

Based on the ongoing reddit discussion, FPGA's or ASICS w/FPGA controllers is strongly suspected.

Your offer for a 18X Slower but competitive GPU with per nonce change, is starting to look like a better option than if there were No more gpus left mining.

Any thoughts? :-/

SChernykh commented 5 years ago

@MoneroChan GPUs won't get 18x slower, they still have plenty of computing power available. They will get 3-4, maybe 5 times slower, but the problem is that ASIC won't get slower if the program changes every nonce instead of every 2 minutes - program generator is simple enough to integrate into the ASIC pipeline. GPUs will get much slower, but future ASICs won't, so the network as a whole will be more vulnerable.

tevador commented 5 years ago

Sooner or later, we will have to choose a CPU-only or a GPU-only PoW. Keeping both brings too many limitations in fighting ASICs.

jorgealonso108 commented 5 years ago

My Thoughts...and just thoughts from a WOW "CPU" miners perspective.

I see little difference in the comparison of GPU to ASIC or FPGA

with that said and being fairly new at mining (since summer of 2018). I have mined cn/2 since Sept 2018 and was not aware that people could rent hash power in such a magnitude. So I did a little research this morning and looked at what was available to rent cn/2 hash power....at "any given moment" people can rent up to 20Mh/s of cn/2 with a few clicks of a mouse. Being a small miner and an advocate of POW and the crypto ecosystem, a question came to mind. Small miners have less chance of getting rewards than large miners, small pools have less chance of getting rewards than large pools, small miners on small pools vs large miners on small pools, and so on and so on, etc. The larger your budget the larger your hash...So my mind pondered a few questions....how does POW have any fairness scale at all? How can it grow an ecosystem? How do you get people involved and keep them involved in the current infrastructure? I know POW spends a fair amount of time on dev because I see some of it on Git.

I come to the conclusion that you would only feel this way because you are invested in GPU's.

Now as a CPU advocate, if purpose-built hardware becomes available that is economic and faster that GPU's I would buy it or rent it. Further review and study would have to be made to create or evolve a different environment.

Are we really just creating or supporting an existing GPU environment? I will continue to mine WOW with CPU's

ASIC and FPGA resistance seems to be a state of mind. I program in Verilog

All GPU miners have a CPU.

MoneroChan commented 5 years ago

I agree with @tevador, and we can strategically use the 'math' for the next hardfork to slowly start shifting the performance ratio to our target hardware (CPU or GPU) to allow a 'smooth gradual transition', people will be less likely to complain, so earlier the better.

Personally, I'm invested in decentralization, either CPU or GPU is fine by me.

hyc commented 5 years ago

The program generator must run on the GPU, to avoid the compile/download overhead per nonce.

I've stated my opinion before though - our primary focus should be CPU; GPU optimization can come later if someone feels compelled to do it.

ifdefelse commented 5 years ago

This is why ProgPoW will also fail to be ASIC-resistant.

I was just pointed at this thread. Most of the above quoted response doesn't apply to ProgPoW. I've left a response in relation to ProgPoW here: https://github.com/ifdefelse/ProgPOW/issues/24#issuecomment-455929288

The program generator must run on the GPU, to avoid the compile/download overhead per nonce.

Why? We plan to decrease the ProgPoW period to change the random program about every 2 minutes, similar to the Monero block time. It only takes ~1 second to compile a kernel and it won't be hard to have the CPU compile GPU program N+1 while program N is executing. There's no overhead on having the miner call a different program when the block switches.

Btw feel free to use our miner as a reference for how to do on-line compilation of randomly generated programs, for both OpenCL and CUDA. Note that to on-line compile CUDA programs you'll need to distribute nvrtc. See this issue for details: https://github.com/AndreaLanfranchi/ethminer/issues/39

Now a quick review of CryptonightR. The high level idea is fairly similar to ProgPoW in that a random sequence is generated every few minutes and compiled offline. This random sequence is will almost certainly brick, or drastically reduce performance, on any existing or late-production ASIC. However it does not fundamentally prevent ASICs from being manufactured. As Tim pointed out ASIC manufacturers are building in more flexibility to their ASICs so they can handle tweaks exactly like these.

I would not be surprised if in the near future an ASIC is produced that is simply a bunch of ARM A53s, or similar highly efficient ARM core, attached to a large pool of on-die memory. Adapting this system for a new algorithm variant would be as easy as recompiling some new miner software.

To get an idea of the gains possible from this consider that an iPhone A12 scores around 11,000 multi-core Geekbench while consuming <10 watts (it's hard to find precise power numbers). An AMD Threadripper 2950X scores 35,000 but consumes 180 watts. That's a 3x perf difference but a 20x power difference, making the ARM 6x as efficient.

On top of the CPU core efficiency differences there are 2 fundamental points that make Cryptonight (all variants) attractive to ASICs are:

Small access size. Cryptonight V2 reduces the ASIC advantage by increasing the size from just 16 bytes to 64 bytes. However GPU memory controllers access DRAM in groups of 128 or 256 bytes, so there's still a 2x-4x advantage available for an ASIC with external memory.
Relatively small memory usage. Multiple 2MB buffers can easily fit on-chip on an ASIC. For example a Xilinx VU9P has 33.75 MB of UltraRAM (remember that FPGAs spec in Mega-bits, so the 270 Mb is more usefully 33.75 MB). Accessing on-die memory is significantly lower power than off-chip DRAM.

Given the above I would not be surprised if an FPGA bitstream was developed fairly quickly. There'd be one small part to generate the random sequence for the current nonce. The rest of the logic would be simple in-order CPU cores that would have the AES and general math needed to execute the random sequence. The overall performance on a VU9P should be around that of a 16-core x86 CPU (since 16 2mb buffers could fit) at significantly lower power.

I think any algorithm that is design to be "CPU-friendly" can have at least a 2x, and sometimes a 10x, ASIC made for it since a sea of low-power ARM cores could run it efficiently.

All that said I think CryptonightR is an improvement to CryptonightV2, so there's no reason not to go to it. However it's not an end-point.

I'll also do a review of RandomX in the near future. I think having a random program per-nonce actually makes the algorithm more ASIC-friendly, but I need some time to work out the details.

SChernykh commented 5 years ago

@ifdefelse

Btw feel free to use our miner as a reference for how to do on-line compilation of randomly generated programs, for both OpenCL and CUDA

I think you misunderstood something. We had it since the beginning, including precompilation for block N+1 while block N is running. Feel free to use our xmrig CryptonightR branch as a reference for how to do on-line compilation of randomly generated programs, for both OpenCL and CUDA.

SChernykh commented 5 years ago

To get an idea of the gains possible from this consider that an iPhone A12 scores around 11,000 multi-core Geekbench while consuming <10 watts (it's hard to find precise power numbers). An AMD Threadripper 2950X scores 35,000 but consumes 180 watts. That's a 3x perf difference but a 20x power difference, making the ARM 6x as efficient.

Geekbench != good reference. It's not 100% parallelizable unlike PoW in general. It's better to compare server ARM processors like Cavium ThunderX with their x86 counterparts in 100% parallelizable server tasks. They are not much more efficient.

SChernykh / CryptonightR

Cryptonight + random math discussion #1