Developing POW Mining ALGO for CPU & GPU support with ASIC/FPGA resistance.

CaptainDero commented 2 years ago

Suggestions are invited for following: POW Mining ALGO for CPU & GPU support with ASIC/FPGA resistance.

Please share your suggestions, expertise, papers etc. to develop above algorithm.

mastercyb commented 2 years ago

What is a problem with current AstroBWTv2? There is a chance that recent hashrate spikes are more related to some network instability

CaptainDero commented 2 years ago

Improving ASIC/FPGA resistance.

CaptainDero commented 2 years ago

Anti-FPGA/ASIC POW Mining ALGO
===============================

FPGA relies on doing more work per clock by using a custom circuit design. Plan to defeat about using loops and certain branches taking more time, leading to ineffective usage of resources and pipeline stalls.
NOTE : Above will not defeat FPGA as FPGA can do everything in a single cycle. In-fact we would need 32 different operators but we will use only following 12, Rest is defeated by using branch behavior of the large switch. The different instructions to defeat SIMD will be +, - , *, XOR, NOT, AND, Shift Left, Shift Right, Reverse Bits, popcnt(1s). ROTATE Right, Rotate Left, [Rotate left/rotate with rotate amount is considered different instruction] and expand to 6 instructions

step_3[i] += step_3[i] // +
step_3[i] -= (step_3[i] ^ 91)// XOR and -
step_3[i] *= step_3[i] // *
step_3[i] = step_3[i]^step_3[pos2] // XOR
step_3[i] =  ^step_3[i]  // binary NOT operator
step_3[i] = step_3[i] & step_3[pos2] // AND
step_3[i] = step_3[i] << (step_3[i]&3) // shift left
step_3[i] = step_3[i] >> (step_3[i]&3) // shift right
step_3[i] = bits.Reverse8(step_3[i]) // reverse bits
step_3[i] = step_3[i] ^ byte(bits.OnesCount8(step_3[i])) // ones count bits
step_3[i] = bits.RotateLeft8(step_3[i], int(step_3[i]) ) // rotate  bits by random
step_3[i] = bits.RotateLeft8(step_3[i], 1 ) // rotate  bits by 1
step_3[i] = step_3[i]^bits.RotateLeft8(step_3[i], 2 ) // rotate  bits by 2
step_3[i] = bits.RotateLeft8(step_3[i], 3 ) // rotate  bits by 3
step_3[i] = step_3[i]^bits.RotateLeft8(step_3[i], 4 ) // rotate  bits by 4
step_3[i] = bits.RotateLeft8(step_3[i], 5 ) // rotate  bits by 5

Few high end FPGAs in market are: Virtex UltraScale+ VU19P having 35 billion transistors, 9 million logic cells.
The Stratix 10 GX 10M 10.2 million logic elements. So we plan to smartly place a million if then else with some random code sprinkled, It will be far above the emulation capacity this may mean constants, prime numbers etc. Some of the steps are nonlinear due to number of reasons such as buffer length is different between each run.

NOTE: Please provide any suggestions to improve the ALGO & increase FPGA resistance.

Update 1: Source code released

secretsquirl86 commented 2 years ago

As a newbie my opinions don't matter that much but I'm reading Blockchain 2035, Digibtye, an older chain, uses a multi-algo approach to securing and distributing the chain.

Jared has a solid entrepreneur background, according to my observations about the book. I think it was that influence that helped solidify the concept as a way to offer incentive to a variety of different player levels. Enterprise down to the single CPU single vote concept talked about in other more famous white papers.

I wonder if Dero might consider this, instead of fighting the system, why not give the enterprise level operations and equal opportunity at turning a profit. Surely enterprises and individuals alike will see the value in securing a private smart contract platform like dero.

Again, not sure if this is hugely relevant to the topic, but seems like a sweet system, from my perspective.

https://dgbwiki.com/index.php?title=MultiAlgo

Activated in September 2014 from Myriadcoin source code, this hard fork allowed for multi-algorithm mining. Its purpose was to create a number of different proof of work (PoW) mining methods to accommodate the different types of mining capabilities that exist, such as dedicated ASIC mining, GPU and CPU mining. This allows for a larger number of people to access DigiByte mining and therefore it creates a more decentralized blockchain with the coins reaching groups who were unable to mine the coin on its original single-algorithm (Scrypt)fork.

Additional Source: https://github.com/myriadteam/myriadcoin

flyflag777 commented 2 years ago

So we plan to smartly place a million if then else with some random code sprinkled, It will be far above the emulation capacity this may mean constants, prime numbers etc.

What does this mean, is there a more detailed explanation?

CaptainDero commented 2 years ago

So we plan to smartly place a million if then else with some random code sprinkled, It will be far above the emulation capacity this may mean constants, prime numbers etc.

What does this mean, is there a more detailed explanation?

Source is updated please check here.

Lolliedieb commented 2 years ago

I just wanted to point out that this branchy thing not only is very much anti FPGA / ASIC; but also very anti GPU. Doing a GPU implementation of that loop is maybe not impossible, but not efficient at all. I would rather love to see going more towards more memory usage - this also is a problem for many FPGAs and limits their efficiency down to what the memory can deliver putting their speed usually to the same ball park as customer grade GPUs - but with GPUs being much more affordable / multi use. For example making the number of elements to sort in BWT phase would do pretty much exactly that - as it had been with AstroBWT v1.

That said: I heavily doubt from my experience that v2 has a problem with FPGAs at the moment - implementing this algorithm on those devices is already very hard and will not give same rewards as other FPGA dominated algorithms. I guess current secret miners rather use an improved GPU implementation. Note that if the algo stays on v2 we will have something like that in public very soon :)

Lolliedieb commented 2 years ago

Additionally: I have not made a space calculation of this thing on FPGA yet (but maybe the devs should), but I would assume that different to GPUs an FPGA can do all the 256 branches in parallel - so only a few clock cycles - write them all to a on chip scratch memory and then read back just the right one from it to put it back to the dataset. In order to save SRAM on the chip this would be done by pulling the inner loop to outside. I will emphasis how it could work. Note that "thread" here means independent instance that is done in parallel on the chip

for i := pos1; i < pos2; i++ {
       sram[thread] = thread_function(step_3[i]);
       step_3[i] = sram[op];   
}

this would allow an FPGA to run this loop approx 256 times faster then a GPU that needs to have all threads running in lock steps and where the threads of a work group can not do independent ops - given that there is enough space on the FPGA grid (this should be calculated).

But if there is enough space and algo comes like this - it will likely be CPU only for a while until FPGAs take over - without the intermediate step of GPUs, because they got no chance on this loop.

arodd commented 2 years ago

I only understand the concepts at a high level, so forgive me if this is a silly question. I didn't see the validation logic for validating a hash is valid. How do you ensure someone doesn't implement a variation that just uses a fixed branch without all of the possible random code case variations? IE: just force a predicted branch that would still produce a valid hash?

flyflag777 commented 2 years ago

I just wanted to point out that this branchy thing not only is very much anti FPGA / ASIC; but also very anti GPU. Doing a GPU implementation of that loop is maybe not impossible, but not efficient at all. I would rather love to see going more towards more memory usage - this also is a problem for many FPGAs and limits their efficiency down to what the memory can deliver putting their speed usually to the same ball park as customer grade GPUs - but with GPUs being much more affordable / multi use. For example making the number of elements to sort in BWT phase would do pretty much exactly that - as it had been with AstroBWT v1.

That said: I heavily doubt from my experience that v2 has a problem with FPGAs at the moment - implementing this algorithm on those devices is already very hard and will not give same rewards as other FPGA dominated algorithms. I guess current secret miners rather use an improved GPU implementation. Note that if the algo stays on v2 we will have something like that in public very soon :)

I don't quite understand your concern, if you think the current secret miners are GPUs, should our new algorithm also limit GPUs to avoid the current mess of mining and give CPU some profit

Lolliedieb commented 2 years ago

Well, most algorithms that are supposed to be CPU only mine able are vulnerable to botnets or are potentially well suited for FPGAs - or sometimes even both. With GPU mining this is usually not the case provided that the public available mining software is close to the optimum possible. The latter is not the case here, public GPU miners for v2 are way from ideal, but a solution is in a making.

BrendaAtchley commented 2 years ago

MinotaurX(LCC) or YesPoWer algo Both are mature algorithms why not consider it ? Why is it always in the Monero system In the Monero system, there are inherent flaws

BrendaAtchley commented 2 years ago

Let's hypothesize for a moment. DERO join LCC MinotaurX algo Litecoin Popularity + DERO Anonymous Privacy in my opinion The market is hot

CaptainDero commented 2 years ago

Using ALGO of other blockchains brings another series of issues.

OhGodAPet commented 1 year ago

Additionally: I have not made a space calculation of this thing on FPGA yet (but maybe the devs should), but I would assume that different to GPUs an FPGA can do all the 256 branches in parallel - so only a few clock cycles - write them all to a on chip scratch memory and then read back just the right one from it to put it back to the dataset. In order to save SRAM on the chip this would be done by pulling the inner loop to outside. I will emphasis how it could work. Note that "thread" here means independent instance that is done in parallel on the chip
for i := pos1; i < pos2; i++ {
       sram[thread] = thread_function(step_3[i]);
       step_3[i] = sram[op];   
}
this would allow an FPGA to run this loop approx 256 times faster then a GPU that needs to have all threads running in lock steps and where the threads of a work group can not do independent ops - given that there is enough space on the FPGA grid (this should be calculated).

But if there is enough space and algo comes like this - it will likely be CPU only for a while until FPGAs take over - without the intermediate step of GPUs, because they got no chance on this loop.

Yeah, it's REALLY not hard to implement their algo without branching... in addition, pretty much all the branches would take the same amount of time on an FPGA, making their claim of pipeline stalls irrelevant (also, you can actually just stuff the results in a damn FIFO, the hell is wrong with you?)

If I was gonna do it, I'd do it for GPU, though, simply cause it's somewhat more fun, a lot less work, and... just amusing. I saw the algo and giggled like a lunatic.

deroproject / DIPS

Developing POW Mining ALGO for CPU & GPU support with ASIC/FPGA resistance. #3