RFv2 speed - Githubissues

itwysgsl commented 5 years ago

Hello again @bschn2, I have concerns about v2 speed. Don't you think that rainforest became way to slow? My MacBook's Intel i7 produce ~30 H/s and it's not enough to mine at least one block on lowest diff on test network 😅

wtarreau commented 5 years ago

Strange about these 30 H/s, it's 40 times slower than my i7 (mine does 1200). Did you build with -march=native -O3 ? How did you test, using rf_test or did you simply iterate over rh_hash() with a NULL rambox ? This last one will solely depend on memory speed since it needs to initialize it. That's what I figured when implementing the bench mode in rf_test. You have to pre-allocate a rambox and pass it to rf_hash(), which restores it at the end. At first I didn't like the principle but thinking about the reasons (making it very hard for FPGAs and ASICs to try to emulate the RAM) I finally found it very smart, as the cost is only incurred to those trying to massively scale it ;-)

wtarreau commented 5 years ago

My old ATOM D510 does 55 H/s :-)

itwysgsl commented 5 years ago

@wtarreau oh, I guess there is little missunderstanding. It gives me ~30 H/s per thread using cpu miner source code example from rfv2 branch. I just tried recompile test example with -march=native -O3 option and now it give me ~ 130 H/s per thread :D

itwysgsl commented 5 years ago

Interesting, what purpose of this -march=native -O3 option?

wtarreau commented 5 years ago

On Thu, Apr 04, 2019 at 11:48:04PM -0700, iamstenman wrote:

@wtarreau oh, I guess there is little missunderstanding. It gives me ~30 H/s per thread using cpu miner source code example from rfv2 branch. I just tried recompile test example with -march=native -O3 option and now it give me ~ 130 H/s per thread :D

Much better!

Interesting, what purpose of this -march=native -O3 option?

-O3 is the optimisation level. It's the highest on most compilers. Recently gcc introduced -Ofast which is more or less the same. -march=native indicates that certain features present on the CPU you're building on will be enabled. This is recommended when you don't want to manually enable/disable each and every feature and you know you're building to run on your local machine.

You can check the output of these :

$ gcc -dM -xc -E -< /dev/null | less

$ gcc -march=native -dM -xc -E -< /dev/null | less

-dM dumps all known macros -xc says the source is a C file -E says just emit the preprocessed output So you end up with a lot of defines.

For example you'll find some AES, SSE, AVX and whatever is available on this CPU that is not available on another one. This is why my server's Atom was dying in "illegal instruction" previously when I tried to run it from the executable made for the i7 :-)

Willy

itwysgsl commented 5 years ago

Hello again @wtarreau. After some tests (~20 minutes mining on testnet on lowest difficulty without any blocks) I still think that RFv2 in current state way to slow :(

wtarreau commented 5 years ago

On Fri, Apr 05, 2019 at 06:04:18PM +0000, iamstenman wrote:

Hello again @wtarreau. After some tests (~20 minutes mining on testnet on lowest difficulty without any blocks) I still think that RFv2 in current state way to slow :(

It's very possible, I don't know much what all this implies in fact :-) Probably that lowering the number of rounds would help a lot, though maybe some other parts of the algo are very expensive as well. What would be an correct hashrate in your opinion ? I'm interested by pure curiosity in checking on various devices (arm, x86) the impacts of various knobs.

Willy

wtarreau commented 5 years ago

For example, I just checked and just changing this multiples the perf by 3 :

if (__builtin_clrsbl(old) > 5) {
if (__builtin_clrsbl(old) > 4) {

(...)

loops = sin_scaled(msgh) * 3;
loops = sin_scaled(msgh) + 1;

Maybe that's enough in your case ?

Willy

wtarreau commented 5 years ago

On Fri, Apr 05, 2019 at 09:34:08PM +0200, Willy Tarreau wrote:

For example, I just checked and just changing this multiples the perf by 3 :

if (__builtin_clrsbl(old) > 5) {

if (__builtin_clrsbl(old) > 4) {

(...)

loops = sin_scaled(msgh) * 3;

loops = sin_scaled(msgh) + 1;

Maybe that's enough in your case ?

I can even go further by removing two sets of divbox+scramble in the round. My i7 reaches 4500 H/s there.

Willy

wtarreau commented 5 years ago

And I can reach 6k hashes/s on the i7 by flattening the curve giving the number of loops to reduce the extremes :

static uint8_t sin_scaled(unsigned int x) {

return pow(sin(x / 16.0), 5) * 127.0 + 128.0;
return sqrt(sqrt(pow(sin(x / 16.0), 5) + 1.0)) * 100.0; }

(rpi does 3900 here and npi 10300).

Willy

LinuXperia commented 5 years ago

I was never a fan of this difficulty target calculation idea.

The whole Bitcoin difficulty target calculation is a second class solution and does not really works.

Imagine the hashrate goes very high and the difficulty increases to maximum. Then only 1 Hash or a very small number of hashes can be below such a extreme set Target and producing such a Hash may even not be possible as no data in the block header can produce such a hash.

It is highly questionable even if it will be possible to create the needed hashes for MicroBitcoin as 21 trilions of coins need to be mined instead the 21 million coins bitcoin has.

We may very well get into problems with not having enogh of hashes to be mined in MicroBitcoin as trillions of Coins are planned to be mined.

Becouse of this lot of wasted hash calculation are produced and energy is wasted.

Reducing the Loops as i see benefits big Power Hungry CPUs over small power efficient CPUs and is against the Idea of the RainForest Hash Algorithm.

I suggest to go the path that EquiHash uses and is implemented as a example in Bitcoin Gold or ZCash and instead of finding a hash that is below a target finding instead half hashes colusions inside the rambox.

I really recommend to drop this whole target calculation and instead use the EquiHash way to find colusions inside the Rambox like Equihash does it.

If my memory serves me right EquiHash creates 1 RamBox and does populate the RamBox for every Nonce using 1024 Loops with Salsa20 calculated Hashes.

After this it looks for Hash Colusions inside this RamBox and if it finds it append it after the Nonce in the Blockheader and submit it to the Network.

Becouse of this the Block Header of EquiHash has 80 Bytes plus the 32 Byte Hash colusion hash.

Maybe something simillar can be done also with the RainForest Hash Algorithm ?

bschn2 commented 5 years ago

Gentlemen,

please be careful if you start to change the number of loops, iterations per round, or the loop curve!

@wtarreau at least make sure you round the pow up by adding 1.5 and not 1.0. This aside, your suggestion looks reasonable to me as it gives more a body-like shape which stays longer in the higher range hence reduces the peaks. In any case you must check the average number of history buffer entries and double it in the define (well maybe less than double now with the curve change but it must at least be 20-25% larger to avoid repopulating the rambox from sratch at the end). If the number of entries is lower than 512 then you need to decrease the clrsbl threshold so that it writes more often.

The divbox+scramble calls you suggested to remove were indeed here only to balance the power between low-end CPUs and high-end ones. There are exactly 3 which are doubled and could safely be reduced to 1 each (the original smhasher tests were run with both configurations).

And yes, please keep an eye on raspberry-pi type of devices as it seems capital to me that such machines are about as fast as regular PCs if we really want to incentive to energy savings. Your numbers are fine to me as I initially targetted 1k to 10k H/s/core so we're pretty much in this area here.

jdelorme3 commented 5 years ago

And how many times more faster it is on large devices and small devices for equihash ? Are you sure we don't favour the large ones only there ?

jdelorme3 commented 5 years ago

Also the hash verifycation time count alot. I think that rainforest is great for this, it cost but not way two much.

bschn2 commented 5 years ago

Dear @LinuXperia,

rfv2 's rambox is not far from what you describe since the rambox is modified by every lookup based on the hashed message (thus includes the nonce). However keep in mind that Salsa20 was designed to be extremely fast on x86 processors (typically less than 4 cycles per byte), and that this can hardly be considered fair for emerging countries where such hardware simply is not available, all people have is a previous generation smartphone to do everything. There the power often comes from local solar panels so maintaining a decent capacity to such devices is very important for the overall fossil energy consumption.

wtarreau commented 5 years ago

@bschn2 good catch for 1.5, I'll try this to stay safe. Thanks for confirming the divbox that could be reduced. Regarding the write ratio if you notice I already adjusted it, but granted, i didn't check the values. I'll do and will prepare a patch will all this soon.

LinuXperia commented 5 years ago

@bschn2 I am sorry it looks like i did not express my self very good how to improve the actual situation with Rainforest so that we instead look to find a rare Hash with leading Zeros we look to find a collusion of each calculated Nonce Hash in the Rambox.

here is my easy to implement improvement suggestion:

The Problem is the part that require us to check for a specific rare hash with leading zeros in front of the Hash. Such a rare hash occurence requires a lot of brute force hash calculation which is against what rainforest algorithm stands for.

My working approach that solves this problem without changing a lot of the code is that instead we look to find a rare end hash with leading zeros in front and be under a target hash we use the amount of the leading zeros in the nBits field value as amount of leading bytes of the calculated Nonce endhash to be matched in the rambox.

Lets say the lowest dificlulty require us to have a hash with two leading zeros. Instead that we now do hash bruteforce till a rare hash is found with two leading zeros what we do now with rainforest is we match two bytes of the calculated end hash and look if such a byte variation collusion exist in the rambox using memcmp().

The Value of how many leading Zeros are required is stored in the 80 Bytes pdata Blockheader that each thread has. So implementing this is very easy and should work like a charme.

If somebody finds such a Byte End Hash Collusion very fast then the difficulty adjust automaticly the nbits field and require us to find a hash with let say 8 leading zeros now which is more harder than before.

We again instead looking to find a endhash with 8 leading zeros with rainforest just match 8 bytes of each nonce end hash and look in the the rambox if such a Byte combination exist.

If yes and this was found again faster than the 1 Minute requirement that Microbitcoin has for mining a block the difficulty Algorithm will adjust the nbits field value to find a hash with lets say 16 leading Zeros which for us and the Rainforest algoirthm means we need to map 16 Bytes of each calculated Nonce endhash in the rambox.

If finding such a 16 Byte Hash Combination needs now 2 Minutes instead the required 1 Minute then the dificulty algotithm will drop the difficulty in the nbits field to look for 14 leading zeros aka 14 Bytes in the rambox and make it easier than before.

By this it automaticly adjust everything so we stay inside the 1 Minute Time frame for mining a Block without that we need to bruteforce Hash calculations to find a rare hash with leadin zeros.

itwysgsl commented 5 years ago

Hello again @bschn2 @wtarreau. I just tested this https://github.com/bschn2/rainforest/commit/3b35a37990a546856953566cd967b35b90d27733 commit and it's actually ~3-4 times faster on my i7. But it's still not fast enough. I tried to mine around 10 minutes on same testnet with low diff and etc, but it still the same. After that I started experimenting by dividing amout of loops at this line for 3, 6, 12 and so on (don't ask why, I just wanted to test speed 😅): https://github.com/bschn2/rainforest/blob/3b35a37990a546856953566cd967b35b90d27733/rfv2_core.c#L726 And here is what I got:

Divided by 3: 1111.15 H/s/thread without any blocks in a long time
Divided by 6: 2533.53 H/s/thread
Divided by 12: 5446.98 H/s/thread
Divided by 24: 11606.02 H/s/thread at this point I start getting blocks with more or less decent speed

Maybe this test would be helpful in some way :)

bschn2 commented 5 years ago

@itwysgsl do you mean you don't find shares or you don't find blocks ? Not finding shares is indeed problematic but should simply be a matter of difficulty. Indeed, at 1.11kH/s you scan a full 16-bit range every 60 seconds so the pool's difficulty is low enough, you must find these shares. What is your difficulty in this case ?

If what you don't find is a block, this sounds normal as the purpose is that the chances to find a block are equally shared among miners so if you have 1000 miners one will mine the block while the 999 other ones will not. So on average if you emit a new block every minute, for 1000 miners each of them would on average find a block every 1000 minutes. But again even then it's a matter of adjusting the difficulty as if the target is 0x0000FFFF...FFFF then at 1kH/s you will find it in 60 seconds.

Last point, I'm a bit surprised by your i7's performance here, did you enable correct build options ? This is roughly 5 times slower than mine without dividing, hence 15 times slower. Did you enable -O3 and -march=native ?

bschn2 commented 5 years ago

Oh and by the way, many thanks for sharing your observations!

bschn2 commented 5 years ago

@LinuXperia I'm still unsure I really understand the principle you're describing. I think it is very similar to hashing except that you look up some bits in the rambox. What I don't understand is how you populate it and how you validate a share or a block afterwards. Also in any case the computation time spent is required as a proof of work. Whether you find the bits in the rambox or anywhere else inside the hash algorithm, it's the same, you have to iterate over nonces so that most participants find shares to be paid and that one of them finds the block. So it's unclear to me what your method brings at this point.

wtarreau commented 5 years ago

@itwysgsl I am also surprised by your numbers, how do you test ? Is it with the patched cpuminer maybe ? Have you tried "rfv2_test -b -t $(nproc)" ? I must confess I have not checked how or when it initializes the rambox. I hope it does it only once once the first call, but I don't know. This could explain your low performance if it does a full rambox for each hash.

LinuXperia commented 5 years ago

@LinuXperia Whether you find the bits in the rambox or anywhere else inside the hash algorithm, it's the same, you have to iterate over nonces so that most participants find shares to be paid and that one of them finds the block. So it's unclear to me what your method brings at this point.

@bschn2 The way i understand it is that the 110 H/s/Thread are way to low to find a Block in about 60 Seconds as a Single Miner at the lowest difficulty.

Becouse of this Problem improvements are needed so a Single Miner using a Single Board Computer running Microbitcoin Rainforest Miner is able to mine a Block as Solo Miner in about 60 Seconds. The test of just a single miner runing just one node and one Miner failed as he was not able to Mine any blcoks in the given time.

So the minimal requirement to test on one Block Chain Node using one Miner to mine one Block in the given period failed.

His hashing Numbers looks okey as he has the same Hash Speed on his i7 as i have on my i7

Becouse of this i suggested to abdon the bitcoin aproach looking to find a rare hash with leading zeros and instead use the equihash apraoch in finding bit collusions.

Finding bit collusions make it easier to mine blocks as we dont need to bruteforce a huge amount of hashes and loose time till we found such a rare hash.

wtarreau commented 5 years ago

@LinuXperia how do you measure this performance, and how is this lowest difficulty calculated or configured (sorry I'm not much aware of all this, I'm only using cpuminer to validate the thermal robustness of my build farm). With rfv2_test I'm seeing numbers 8 times larger than yours: $ gcc -march=native -O3 -o rfv2_test rfv2_test.c -pthread -lm $ ./rfv2_test -b -t 1 847 hashes, 1.021 sec, 1 thread, 829.374 H/s, 829.374 H/s/thread 860 hashes, 1.000 sec, 1 thread, 859.975 H/s, 859.975 H/s/thread 849 hashes, 1.000 sec, 1 thread, 848.988 H/s, 848.988 H/s/thread 858 hashes, 1.000 sec, 1 thread, 857.988 H/s, 857.988 H/s/thread ^C And on ARM: $ ./rfv2_test -b -t 1 1334 hashes, 1.071 sec, 1 thread, 1245.832 H/s, 1245.832 H/s/thread 1336 hashes, 1.000 sec, 1 thread, 1335.768 H/s, 1335.768 H/s/thread 1333 hashes, 1.000 sec, 1 thread, 1332.841 H/s, 1332.841 H/s/thread ^C

bschn2 commented 5 years ago

@LinuXperia well, I really don't understand the method you're trying to explain, I'm sorry. I don't understand why you say "rare hash with leading zeroes", the number of zeroes is log2(1/frequency) so if a matching hash is rare it's because it has been made so by the difficulty. I will have a look at equihash to try to understand how it differs regarding this, but I still fail to see how that would change anything given that we want a miner to spend time to prove his work.

bschn2 commented 5 years ago

Well after having read a bit about equihash I think I get it a bit better, but in my humble opinion it focuses solely on the memory-bound aspect and as a result it has already been ported to an ASIC (bitmain's Z9 which is 10 times faster for the price than a GPU) : https://www.heise.de/newsticker/meldung/Ende-der-Grafikkarten-Aera-8000-ASIC-Miner-fuer-Zcash-Bitcoin-Gold-Co-4091821.html This even resulted in a 51% attack on Bitcoin Gold and a loss of $18M. This is exactly the type of things I want to avoid.

Also looking at the numbers, it's said that an Nvidia 1080Ti does only 650 sol/s (=hashes/s) so it's even way lower than what we're doing on rfv2. The main challenge we have to address is to make sure that MBC's short-lived blocks can be mined in the block's life, and the solution above apparently makes this situation worse from what I'm reading.

endlessloop2 commented 5 years ago

Hello y'all, I've been following RF since last year and implemented V1 on my unreleased coin. I'd like to know what is the purpose of making the algorithm "faster", considering coins can change starting difficulty. What kind of tests are you doing that you are finding it "slow to find blocks"? I believe the algorithm is fine at this point except for any bugs that may be unsolved, but it shouldn't be changed to make it "faster".

wtarreau commented 5 years ago

Interesting. It's important to keep in mind that memory speed varies with the device's price. The DRAM access times I've measured so far: http://git.1wt.eu/web?p=ramspeed.git;a=blob;f=data/results.txt So a cheap board has 2-3 times the access time of a PC, and it's said (though I cannot verify) that GPUs are even faster with GDDR5. Also the PC's memory controller allows to initiate multiple accesses at once while the cheap devices don't, resulting in almost a 10 times difference in multi-core tests. It's not unreasonable to imagine someone plugging SRAM to an FPGA or ASIC and get 12ns access time where a PC needs 60. The cost of 96 MB of SRAM would certainly be prohibitively high though. I think that for what you guys are looking for, the algo mixes a lot of expensive features and makes it prohibitive to implement on hardware. I do have ideas how to help your algos be memory bound but they will be extremely slow and from what I'm reading the speed seems to be an issue for your use case.

jdelorme3 commented 5 years ago

Also looking at the numbers, it's said that an Nvidia 1080Ti does only 650 sol/s (=hashes/s) so it's even way lower than what we're doing on rfv2.

So this is what I was afraind of, equihash mostly target high performance hardware. I doubt I can run it on my Raspberry Pi!

LinuXperia commented 5 years ago

Ohh here it is already 2 AM need to go to sleep.

Look what it need to be done from my side of view is first calculate how many leading zeros are needed so we know how many bytes from the nonce endhash we need to map in the rambox on each rfv2_scanhash function call.

Here is a hint how to get the leading zeros from the nbits value that is in the pdata block header: https://bitcoin.org/en/developer-reference#target-nbits

I will post a code example in next 24 hours.

All what is needed to do is extract the amount of the leading zeros from the nbits filed value and after each RainForest Nonce endhash memcompare this amount of bytes in the rambox using the command memcmp();

to avoid to much memory scan cost als a random DAG like ethereum has it can be precomputed for x amount of Blocks and copied to the L1 or L2 cache and used it instead of the ram Box.

here you have more information about the DAG Ethereum uses. we could also create a 10 Killobytes DAG load it after each endHash into L1 or L2 CPU Cash and scan the nonce Byte part there instead in the rambox.

https://www.sanfoundry.com/cpp-program-construct-random-directed-acyclic-graph/

So what we do after nBytes from the Nonce Endash was found in the specific memory is we submit it.

Okey have to go to sleep. very late here. Will write you back in 24 hours but i guess you will code it allready til then.

As said extract the amount of the leading zeros from the nbits then you know how many bytes from each Nonce endhash you need to map.

Then after each Nonce Endhash compare n Amount of leading bytes from the Nonce endhash and check if it exist in a Memory area.

If such a byte collusion was found then submit nonce to Network.

that is it.

In most case with this aproach we will find right Nonce afte Milli seconds of calculation so that way a Single Miner can Easy mine Millions of Blocks alone every 1 Minute as required.

the difficulty algorithm then will adjust the nbits vaule automaticly and by this require us to use more Bytes to check for collusion and by this slow us down to 1 Minute on average for 1 Block to Mine.

bschn2 commented 5 years ago

But the nBits field is what defines the difficulty and the target. I'm not seeing how you want to mine faster when this difficulty is not changed. By the way @itwysgsl what's your lowest difficulty or nBits ? Bitcoin's minimal one is 0x1d00ffff which means a target of 0x00000000ffff00...00. It is chosen for very fast ASIC-compatible hashes (sha256) and long block life as it requires 4 billion hashes to scan entirely. Maybe you're still using the same ? The article pointed to above indicates that in regtest mode Bitcoin uses 0x207fffff which is 0x7fffff00...00, so if you forked off Bitcoin you also have access to this possible difficulty. At 1.1kH/s and a block per minute I'd suggest starting with nBits=0x1f00ffff for a difficulty of 1, which will give a target of 0x0000ffff00...00, hence exactly one block per minute.

itwysgsl commented 5 years ago

Hello @bschn2 I set min difficulty to 0x000000ffffffffffffffffffffffffffffffffffffffffffffffffffffffffff on test network for sake of easier testing.

bschn2 commented 5 years ago

So that's the issue, it takes 256 minutes for a block at 1kH/s, thus it is only compatible with very fast hashes. Could you please change the difficulty to 0x0000ffff... and retest ? (no more than 4 zero digits on the left). It must work there. I'm asking this because I'm really concerned with weakening the hash. If we hash too fast, we take the risk of seeing strong optimizations come back, which is something we don't want to face anymore!

itwysgsl commented 5 years ago

@bschn2 I'm not sure if it's good idea to change min difficulty. It's fine for testnet but not for main network I think.

MikeMurdo commented 5 years ago

I've been following this thread for a while and am seeing something wrong here. Rfv2 has sensibly the same hash rate as scrypt, and requires a comparable difficulty. Plus if MBC emits a block each and every minute this must be taken into account as well. With a target 0x00000000ffff... you need 4 billion hashes to find the block. At 1kH/s/core this means 4.3 million secondscore or 49 dayscores. If you want to find them in less than one minute, you need 71000 cores. This is possibly fine when the coin is popular but it can significantly hinder its growth in its early days.

Even with your test's min difficulty of 0x000000ffff... you need 279 cores to find a block in one minute, which is verified by your experiments showing you didn't find a single block in 10 minutes with your i7.

Note that I'm not saying these numbers are not reasonable, and I don't buy the stupid paradigm of the solo miner who must be able to find a block even if he's alone : a coin having a single miner is deemed to die anyway. But I find it important to adapt the configuration to the algorithm's performance and not the opposite, or history will repeat itself.

We've seen the effect of the optimization of the last round in v1. I'm with @endlessloop2 here, please don't reduce the number of rounds! If it were just me I'd even add more to discourage implementations from even trying to unroll part of a loop (but I can admit that when there's less than 1% to shave it's unlikely they'll go through this pain).

Based on the numbers shown here, I think that 0x000000ff... for mainnet and 0x0000ff... for testnet are way more suitable targets to maintain a safe algorithm that people can still mine.

itwysgsl commented 5 years ago

@bschn2 I've made little research (checked code of other coins) and realized that many of them changed min diff during algo switching. I beleive it's one of possible soultions of current speed issue. Going to give it a shot right now.

itwysgsl commented 5 years ago

Also, can you guys check this https://github.com/bschn2/rainforest/issues/16 issue?

MikeMurdo commented 5 years ago

Great news for the difficulty @itwysgsl ! Keep up the good work, many of us would like to see you rule the crypto world!

wtarreau commented 5 years ago

I've replied to #16 (it's about CRC32 implementation).

LinuXperia commented 5 years ago

In Case a much better solution is still required here is how i would improve the code without to change the Rainforest Algorithm

First as said extract how many leading zeros aka bytes are needed from the nBits Feild Value

@@ -44,6 +46,24 @@ int scanhash_rfv2(int thr_id, struct work *work, uint32_t max_nonce, uint64_t *h
        for (int k=0; k < 19; k++)
                be32enc(&endiandata[k], pdata[k]);

+       // EXTRACT HOW MANY BYTES FOR COLLUSION FROM THE nBITS FIELD VALUE
+       // Subtract from 0x20(32) minus the actual Diff aka 0x1f = to get how many bytes to check for collusion
+       uint32_t nBytesCollusion = 0x20 - (endiandata[18] >> 24);

Then what you do is instead to check if we are below a specific target we just look for byte collision of the Hash in a Memory Space be it in the RamBox or a DAG that is created same on all devices

@@ -55,13 +75,19 @@ int scanhash_rfv2(int thr_id, struct work *work, uint32_t max_nonce, uint64_t *h
                be32enc(&endiandata[19], nonce);
                rfv2_hash(hash, endiandata, 80, rambox, NULL);

-               if (hash[7] <= Htarg && fulltest(hash, ptarget)) {
+               // SCAN TROUGH 4096 MEMORY TO CHECK FOR A BYTE COLLUSION WITH nBYtes FROM THE HASH
+               for(uint32_t nBytePos=0; nBytePos<(4096-nBytesCollusion); nBytePos++) {
+                       //CHECK BYTE FOR BYTE IF A nBYTE COLLUSION EXIST IN THE MEM SPACE
+                       // IF YES SUBMIT SHARE / BLOCK TO NETWORK AS SOLUTION WAS FOUND
+                       if( !memcmp(ByteCollusionSpace+nBytePos, hash, nBytesCollusion) ) {
                work_set_target_ratio(work, hash);
                pdata[19] = nonce;
                *hashes_done = pdata[19] - first_nonce;
                ret = 1;
                goto out;
+           }
+       }

LinuXperia commented 5 years ago

@bschn2 I've made little research (checked code of other coins) and realized that many of them changed min diff during algo switching. I beleive it's one of possible soultions of current speed issue. Going to give it a shot right now.

@itwysgsl The problem is what happens after lets say some mined blocks a big miner jumps in and increases the difficulty and after this goes quickly away ?

another possibility would be that a miner finds a vulnerbility and by this gains a huge advance, mine some blocks, increases the difficulty and goes away.

Then Coin will be blocked as no hashpower exist for such a difficulty.

You are stuck with a high difficulty and the coin will be not able to process any other blocks!

We are not going away from the problem that bruteforcing hashes is not what RainForest stands for.

Rainforest need another algorithm for finding a Solution without this stupid brutefoce hash target meeting.

itwysgsl commented 5 years ago

@LinuXperia this is well known issue, and very smart guy @zawy12 (author of LWMA algo) came up with nice solution of this issue - TSA difficulty ajustment algo. Difficulty is slowly going down during mining of block itself.

itwysgsl commented 5 years ago

P.s. @LinuXperia your idea with scanning rambox is also interesting, if you can make some proof of concept code, we can give it a shot :D

itwysgsl commented 5 years ago

@wtarreau can you also check this issue https://github.com/bschn2/rainforest/issues/17 ?

itwysgsl commented 5 years ago

I finally finished implementation of RFv2 in cpuminer https://github.com/MicroBitcoinOrg/Cpuminer/commit/8897a982b462f1fa66bb25559431e14913da1eb5 :D

wtarreau commented 5 years ago

I see that you enabled it side-by-side with rfv1. In the patch @MikeMurdo did, he removed support for v1 and replaced it with v2. I don't think the difference is important though :-)

MikeMurdo commented 5 years ago

So are your performance issues solved ?

wtarreau commented 5 years ago

At least on my i7-6700k, I'm getting this, which is pretty close to rf_test: [2019-04-08 00:13:27] Total: 5.64 kH/s Note that I lowered the benchmark target a little bit because it was scrolling too fast.

bschn2 commented 5 years ago

I was about to close this one, but figured that if any participant wants to share new measures on various devices, it might still be the best place, thus I'm leaving it open. Please don't forget that the updated code is in the master branch now.

wtarreau commented 5 years ago

Good idea. rfv2_test gives:

4200/s on my i7-6700k at 4.4 GHz (4 cores, 8 threads)
1400/s on my J4105 at 2.4 GHz (4 cores, 4 threads)
5600/s on my NanoPI-Neo4 at 2x2.1+4x1.7 GHz (2xA72+4xA53 cores, 6 threads)
6600/s on my NanoPI-Fire3 at 1.6 GHz (8 A53 cores, 8 threads)
780/s on an Atom Z8350 at 1.68 GHz (4 cores, 4 threads)
330/s on a rk3288 (armv7, 4 cores at 1.9 GHz).

bschn2 / rainforest

RFv2 speed #15