bschn2 / rainforest

The rainforest crypto currency algorithm
MIT License
8 stars 8 forks source link

RFv2 speed #15

Open itwysgsl opened 5 years ago

itwysgsl commented 5 years ago

Hello again @bschn2, I have concerns about v2 speed. Don't you think that rainforest became way to slow? My MacBook's Intel i7 produce ~30 H/s and it's not enough to mine at least one block on lowest diff on test network 😅

itwysgsl commented 5 years ago

@MikeMurdo even if so, now we have >50 hashrate comming from public pools :) Better than nothing.

bschn2 commented 5 years ago

Gentlemen,

I must say I have been quite concerned over these last few days regarding these reports of high hash rates coming from nowhere on MBC, especially this experimental miner showing important values while being closed source, had it anything to hide.

So I decided to investigate. Today I've retrieved 9486 MBC block headers using mbc.wiki's API and decided to hash them again and see if a pattern would stand out. At least I can say my research was fruitful! Below I graphed the number of loops required to find each hash, from 2 to 257: strange In a uniformly distributed space loops are about 95% of the time above 200. Here as you can all see, this only happened at the beginning then it stopped, and most hashes now only require 2 loops and are roughly 50 times faster to compute this way, while there are only 1/256 of them present overall. And the graph is becoming even sparser over time! This was made possible by the late drop of the floating point operations which turned transcendental functions to their discrete equivalents, which are thus easy to iterate over. Fortunately, our last minute change consisting in adjusting the rambox write ratio to be inversely proportional to the loops was present to save us!

My biggest concern is that this started with 2 loops from the very first 85 blocks! This means that someone figured it was faster to skip all other ones while we were still adjusting the code. This makes me very angry because we designed openly as a community and people watch us work hard, take notes of issues and keep them silently to exploit them later instead of reporting them while it's not too late. Oh they are certainly reading us right now...

This stopped at block 915085 and started again at 915328 and is still valid. A few blocks are still found by honest miners however, which is important.

So I have a solution to this which will not require to change the server code. I have modified the benchmark code of rfv2_test.c to behave like a scan function and to skip the nonces we're not intersted in. At first I saw that 50% of the time was spent computing CRCs on messages to skip them 255 times out of 256. By only changing the last word (like a nonce) and recalculating the CRC on the end, I can double the scan speed and spend 99% in the hashing function computing 2-loop hashes. This is how my Skylake at 4 GHz now achieves 188 kH/s up from about 3600 without this. Yes it will be faster for everyone, not just for a small bunch of cheaters this time.

The next step is to retrofit this into the cpuminer's scan function. I'm now uploading the updated rfv2_test.c code to illustrate what is needed.

Please accept my apologizes for overlooking this design issue. In the end we're more or less back to the first step towards v2 which consisted in doubling the number of rounds and adding extra padding to be secure. We've still increased the rambox usage as well. But the extra protections like floats and variable number of loops are gone. This is still the benefit of having multiple protections though!

bschn2 commented 5 years ago

I have now also uploaded the updated rfv2_cpuminer.c file. On my Skylake it shows 46MH/s but you have to take into account that 255 our of 256 such hashes are skipped thus the real effective rate is 1/256 of what is displayed, thus 180 kH/s.

jdelorme3 commented 5 years ago

Many thanks Mr Schneider for your openess. Dont be sad, all coins and algos are attaqued. Your algo is resistant and it maintiens fairness between all users and that is much more important than what other algos do like sha256 who is mined only by big corporations using tons of asics.

I have just quickly tested rfv2_test on my Raspberry Pi 3B+ and it now say 47000 H/s. So I must divise by 256 that's it ?

bschn2 commented 5 years ago

@jdelorme3 thank your for the kind words. No rfv2_test's numbers are real! It's only cpuminer which reports the hash interval as a hashrate because it expects hashes to be linearly explored. Pools however will report a performance level approximately matching rfv2_test.

47000 for such a device is quite not bad at all, our performance remains roughly flat between platforms under various conditions, this is reassuring. Now I'm doing roughly 4 times better than you while in the past I was doing twice. I would have been worried if we'd gone to a 1:10 or 1:100 factor, but here it's perfectly fine. I'd be curious about @wtarreau's farm with his more modern ARM cpus.

jdelorme3 commented 5 years ago

Wow, 47000 real for my RPi, this is narly what some GPUs were doing on skypool a few days earlier!

wtarreau commented 5 years ago

OK guys, just two quick tests before going to bed

NanoPI-M4$ ./rfv2_test -b -t 6
91423 hashes, 1.000 sec, 6 threads, 91415.961 H/s, 15235.993 H/s/thread
91510 hashes, 1.000 sec, 6 threads, 91494.446 H/s, 15249.074 H/s/thread
91100 hashes, 1.000 sec, 6 threads, 91087.521 H/s, 15181.254 H/s/thread
91229 hashes, 1.000 sec, 6 threads, 91219.422 H/s, 15203.237 H/s/thread
91381 hashes, 1.000 sec, 6 threads, 91371.041 H/s, 15228.507 H/s/thread
91489 hashes, 1.000 sec, 6 threads, 91479.669 H/s, 15246.612 H/s/thread
91385 hashes, 1.000 sec, 6 threads, 91374.583 H/s, 15229.097 H/s/thread
NanoPI-Fire3$ ./rfv2_test -b -t 8
123646 hashes, 1.004 sec, 8 threads, 123146.150 H/s, 15393.269 H/s/thread
127629 hashes, 1.004 sec, 8 threads, 127123.050 H/s, 15890.381 H/s/thread
127779 hashes, 1.004 sec, 8 threads, 127272.582 H/s, 15909.073 H/s/thread
127883 hashes, 1.004 sec, 8 threads, 127375.663 H/s, 15921.958 H/s/thread
127838 hashes, 1.004 sec, 8 threads, 127330.334 H/s, 15916.292 H/s/thread
127931 hashes, 1.004 sec, 8 threads, 127424.615 H/s, 15928.077 H/s/thread
mcbin$ ./rfv2_test -b -t 4
81888 hashes, 1.000 sec, 4 threads, 81882.596 H/s, 20470.649 H/s/thread
89242 hashes, 1.000 sec, 4 threads, 89222.371 H/s, 22305.593 H/s/thread
89153 hashes, 1.000 sec, 4 threads, 89137.134 H/s, 22284.283 H/s/thread
89221 hashes, 1.000 sec, 4 threads, 89204.140 H/s, 22301.035 H/s/thread
89214 hashes, 1.000 sec, 4 threads, 89197.588 H/s, 22299.397 H/s/thread
89171 hashes, 1.000 sec, 4 threads, 89155.309 H/s, 22288.827 H/s/thread

So these were three in the end :-)

ARM are now slower than PCs however, likely due to the dominance of the rambox over expensive instructions. The results are still good.

However there is something I don't understand. If you skip 255 nonces every 256, how can you be sure you're not skipping the solution and that someone will actually find it ?

bschn2 commented 5 years ago

Thanks for the numbers. @wtarreau there isn't a single solution but many. The probability that it's present in these nonces is the same as being present in others. We're looking for hash collisions in fact.

djm34 commented 5 years ago

yes indeed, keeping only the loops = 2 allows many simplification such as there is no need to update the rambox. Applying these simplification gives hashrate around 200-300 kh/s (without any real optimization).

wtarreau commented 5 years ago

On Tue, May 14, 2019 at 07:00:51PM -0700, djm34 wrote:

yes indeed, keeping only the loops = 2 allows many simplification such as there is no need to update the rambox.

Why would you not update the rambox in this case ? It's used inside each round, so if you don't update it you will definitely produce bad hashes from time to time if the same location is visited multiple times.

Applying these simplification gives hashrate around 200-300 kh/s (without any real optimization)

Pretty good!

Willy

wtarreau commented 5 years ago

Just for fun I tried to optimize the code a little bit. My nanopi is now as fast as my laptop (~+30%) and uses much less memory. I measured that the amount of changes made to the rambox remains low (180 per hash on avg), resulting in 1 chance out of 70000 to reuse a modified location. So I bypassed the changes and used a shared rambox to please my small cache. It now shows ~40MH/s which really equals 160k effective. I initially tried to compare the changes by hand but figured it's much more convenient to connect to a pool and see a share accepted!

However there is something very interesting in the variable rambox boundaries, which is that it's pointless to try to save and intermediate state because the whole hash completely changes depending on the nonce thanks to this. Thus in practice there are always two full rounds.

I currently have the patches in a local development branch applied directly on top of cpuminer, I'll rebase my rfv2 repo and will re-integrate them there later today.

bschn2 commented 5 years ago

Excellent, Sir! I'm impatient to integrate your work!

wtarreau commented 5 years ago

It was sent now (PR #35)

bschn2 commented 5 years ago

Yes we were talking past each other :-) Now merged, thank you!

itwysgsl commented 5 years ago

Hello @bschn2, big thanks to your work on this project, also thanks to @wtarreau for his supports. I just builded cpuminer and keep getting Warning: rfv2_scan_hdr() returned invalid solution error messages. Since I'm using macOS I experienced some compilation issues related to lack of htobe32 and le32toh functions (which I successfully copied from cpuminer code, here is all code).

Any thoughts on this issue? I suspect my macOS build fixes may break something so I going to test build on my Ubuntu VM.

itwysgsl commented 5 years ago

Ok, my bad. Now works like a charm with this implemetation https://gist.github.com/yinyin/2027912.

bschn2 commented 5 years ago

Great news Sir! Ideally we should rebuild the cpuminer patch set and try to get it merged upstream so that you don't need to maintain this fork anymore.

djm34 commented 5 years ago

On Tue, May 14, 2019 at 07:00:51PM -0700, djm34 wrote: yes indeed, keeping only the loops = 2 allows many simplification such as there is no need to update the rambox. Why would you not update the rambox in this case ? It's used inside each round, so if you don't update it you will definitely produce bad hashes from time to time if the same location is visited multiple times.

the probability of multiple reading is very low, especially in the 2 loops case. You have roughly 50 rounds over a huge rambox...

itwysgsl commented 5 years ago

@bschn2 since we added some platform specific functions like be32toh which present onlu on Linux, I think we should include this header file to repository to make it comatible with all systems. I tested it with cpuminer and it works like a charm \:)

wtarreau commented 5 years ago

@itwysgsl we can do even simpler. I normally never use these macros, it's just that I lazily did that inside cpuminer which already uses them and didn't realize that by moving the code back to rfv2 I'd introduce this dependency. I'll replace them by generic functions. They're not even on a fast path, there's no reason to bother anyone with such dependencies.

wtarreau commented 5 years ago

PR #36 fixes this now.

MikeMurdo commented 5 years ago

Ideally we should rebuild the cpuminer patch set and try to get it merged upstream so that you don't need to maintain this fork anymore. Later today I should have some time to work on this and update my pending PR.

MikeMurdo commented 5 years ago

Still late on my work, I'm afraid it's not for today anymore. Wanted to let you know.

itwysgsl commented 5 years ago

Btw, yesterday @jareso (author of optimized private miner for rfv2) showed up in our discord and left this message:

Right now his miner, in average, produce 18 MH/s per miner. Just want to let you know :)

itwysgsl commented 5 years ago

Also I tested RFv2 with recent patches on my android phone, and it produce around 10 kh/s. https://github.com/itwysgsl/RainforestAndroidMiner/commit/ac744b3af58f239ea121d76157517b8538c5a0d2

bschn2 commented 5 years ago

@itwysgsl his response is totally valid regarding the fact that others are attacking algorithms as well. What I'm contesting is not this. It's the fact that we're all working on a way to better decentralize mining, and that this person has the required skills to spot weaknesses before the algorithm is released, and silently keeps them secret incurring lots of work for you and the pools upon each release while disclosing such issues early would result in a more robust initial design. Thus I stand by my words, his primary motive is to grab most of the shares and not to help with fairness.

With this said, I really doubt he achieves 18 MH/s per card, I suspect there are quite a number of cards per miner; the algorithm involves operations that are extremely fast on ARM, very fast on x86 and not natively implemented on many other platforms, thus costing more. So the base performance per core will be lower. Anyway, this helps fuel my ideas for a v3 :)

I'm seeing on miningpoolstats that our latest update has done lots of good, with most miners going back to the public pools. Also there are mostly GPUs on zergpool and mostly CPUs on skypool. So far, so good.

Regarding your phone, it's ARMv7, right ? If so it's pretty decent for such a platform which doesn't have AES, IDIV, CRC nor 64-bit native operations! Let's wait for some feedback from users of more recent phones using ARMv8 (with CPUs like Snapdragon or Kryo).

MikeMurdo commented 5 years ago

@bschn2 the CPU vs GPU specialization of pools is natural, expected and nothing new : if you're having a low power (say a CPU) you'd rather join a pool where users are mostly like you so that you have a chance to get a respectable share every time a block's found, even if it's not frequent. But if you have 40 GPUs and can expect to help a pool mine a block 3 times more frequently and get 50% of the shares, you'd rather join a pool already showing a high hash rate because you'll get a big share frequently. And if you know you're concentrating most of the power, you'd rather join the pool with the lowest fees or run solo.

And with few miners at the moment, MBC is extremely appealing to mine : even with moderate power you can expect to make a good share of 12.5k every 2 minutes or so. I watched yesterday and estimated that with only 1 MH/s it was possible to make roughly $25/day. This is why those with large enough power immediately jump onto such coins, they are extremely profitable at opening. Once more miners join, the revenue is more evenly spread and it's less possible to expect such high revenues. At the moment it seems to go mostly to Jareso though.

@bschn2: most android phones with armv8 seem to enable armv7a only: https://i.stack.imgur.com/7EF24.jpg https://i.stack.imgur.com/XCGxi.jpg However as you can see, aes, idiv and crc are present there and I think you may be able to enable them even in armv7 mode. I remember seeing a raspi somewhere in armv7 mode using CRC32 instructions. Possibly that building with "-march=armv7-a+idiv+crypto+crc" would enable everything. Or maybe with "-m32 -mcpu=cortex-a53+crypto+crc".

itwysgsl commented 5 years ago

@bschn2 I believe my phone have Snapdragon 835 inside. Also to make android miner work I commented out this part, otherwise error related to __builtin_clrsbll pop up.

wtarreau commented 5 years ago

It could be indicative of an arch mismatch indeed. Have you tried to pass "-march=armv8-a+crypto+crc" for example when building, instead of "-march=native" ?

itwysgsl commented 5 years ago

@wtarreau nope, but I can try right now.

itwysgsl commented 5 years ago

Here is error message itself error: definition of builtin function '__builtin_clrsbll' static. Probably this check don't have appropriate condition to handle android build 🤔

wtarreau commented 5 years ago

Can you please report the output of "gcc -v" ? I guess the test on the version is not correct indeed. I did this part myself to try to build on older compilers. In the worst case you can simply disable this block and see if your perf is better with the build options.

itwysgsl commented 5 years ago

@wtarreau I think Android Studio uses clang for compilation, but still here is output for gcc -v:

Apple LLVM version 10.0.0 (clang-1000.10.44.2)
Target: x86_64-apple-darwin18.2.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
wtarreau commented 5 years ago

Ah, got it! It's clang indeed, so it advertises v4.2.1. We'd need to check for the clang version then, as your version appears to include this builtin while the one I tested last time didn't. In the mean time it's harmless, feel free to comment out the whole block to be able to retest the performance on armv8 mode with all extensions enabled.

itwysgsl commented 5 years ago

@wtarreau so, with -march=armv8-a+crypto+crc flag miner loses around 20-40% of performance.

wtarreau commented 5 years ago

Oh that's really funny, probably that certain architectural optimizations are lost. From what I've found, SD835 is an octa-A73 up to 2.45 GHz, it must be awesome! I'm having a hard time imagining that it can only be a compiler issue. Please try to use "-mcpu=cortex-a73+crypto+crc" instead, and add -m64 to be certain it builds in 64-bit mode. Also double-check that you didn't lose the optimization level (typically -O3/-Ofast) when forcing the other flags.

MikeMurdo commented 5 years ago

I've updated the cpuminer PR with the latest speedups and fixes.

itwysgsl commented 5 years ago

@wtarreau with -mcpu=cortex-a73+crypto+crc I'm getting around 14 kh/s :) Can't compile with -m64 flag yet since it causes some troubles with sha256 implemetation in cpuminer. Working on a fix now.

itwysgsl commented 5 years ago

So, I tried it, but with -m64 flag miner lose speed and don't produce any shares.

wtarreau commented 5 years ago

Ah so at least it proves it's currently running in 32-bit which explains the lower performance. If it's cpuminer you're using, you should build it with -DNOASM. Last time I updated the code there I even added a script "build-linux-arm" or something like this because the default options didn't work for me.

djm34 commented 5 years ago

Btw, yesterday @jareso (author of optimized private miner for rfv2) showed up in our discord and left this message:

Right now his miner, in average, produce 18 MH/s per miner. Just want to let you know :)

LOL gpu dev on the brink of a burnout

wtarreau commented 5 years ago

Guys, I've reviewed some details of the algo and also the performance ratio mentioned above. I'm thinking that the main issue is that the whole rambox should be used to make sure there is no way to bypass it nor to precompute it. The problem is that memory controllers nowadays are extremely fast and prefetch tons of data on large systems (high-end CPUs, GPUs) but are dumb and unable to prefetch much on cheap hardware. However the cheap hardware possesses architectural optimizations (crc32, aes) that large ones partially possesses (aes for x86) and that GPUs don't have.

If we consider the instruction-equivalent latency, by carefully chaining all this it should be possible to make everyone run at a bounded speed. Let's take my Neo4's 180ns RAM access time as a reference. This device has fast CPU cores with all extensions enabled. If we focus on a 200ns target as the time for a single operation, it means this machine must consume 20 more ns doing things that other machines will be slower at. My skylake has a 65ns access time. It does have AES but lacks CRC32. Fine, let's figure the fastest CRC32 implementation, repeat it as needed to fill. A raspi 3B+ shows 160ns, and it does have CRC32 but lacks AES. So we can put maybe one or two AES calls so that AES+RAM+CRC still gives 200ns.

Note, the PC will be able to prefetch multiple accesses at once so multiple threads will benefit from this. But we don't need to read lots of data, the bandwidth is not interesting here. Let's say we use a 64 MB work area (26 bits). This can be turned to 24 bits by considering 32-bit words. An AES operation returns 128 bits. This can be sufficient to produce 5 memory addresses to fetch data from. You also want to always write so that there is no efficient way to keep a shared read-only copy of this.

From what I'm reading at a few places, GPUs like the 1080Ti mentioned above seem to feature many memory channels (11, or 12 minus one to be more precise): https://www.anandtech.com/show/11180/the-nvidia-geforce-gtx-1080-ti-review https://www.anandtech.com/show/11180/the-nvidia-geforce-gtx-1080-ti-review/3 https://www.quora.com/Why-is-the-size-of-RAM-in-my-GPU-GTX-1080-Ti-not-in-the-power-of-2

The prefetch is a full 64 bytes cache-line like on many other CPUs. So a GPU could be 12 times faster than a PC or an SBC thanks to this. There's not much that can be done by adding extra instructions in the loop, as these could be amortized on the huge number of cores. However the memory size can limit the number of active cores. A GPU with 12 GB of RAM could only have 192 active threads with 64 MB of work area. But even with 192 cores at work, I expect they could perform well on AES or CRC. But in this case we're at 12 times the performance, not 100 times or so.

Am I overlooking anything ?

itwysgsl commented 5 years ago

@wtarreau nice research, what you think about creating separate issue with discussion for Rainforest improvements?

wtarreau commented 5 years ago

you're right, we're not that much in the speed area anymore :-)

bschn2 commented 5 years ago

Hello! Indeed we'd rather move such a discussion somewhere else. A quick note though, as this /is/ related to this issue, which is that manipulating large amounts of RAM does have an impact on the hash rate. The initialization time for your 64 MB rambox can take a while. Even if we ignore it, let's say you want to modify your 64 MB 32-bit at a time, with 200ns access time. This is 16M x .2 = 3.2 seconds only for total access time, this would result in 0.3 H/s in the best case, which is not practical. Even if you only write 1M times you only have 6% chance of hitting an already visited location. By considering them as 128 bit locations instead, you'd have 4M locations. At 1 million changes it's 22% chance of collision and 5 H/s. We could also decide to write full cache lines (64 bytes) so that we do touch the whole rambox with one million writes but it's going to be expensive. Even 512k writes would result in 40% chances of collision only for 10 H/s. At 40% chance of collision, you can use shared memory and scale a lot. If you use 3000 threads and only 60% of them provide usable shares, it's still 1800 threads so the large memory doesn' t protect us here.

But let's open a new issue for this discussion.

MikeMurdo commented 5 years ago

There was a report of incorrect speed on cpuminer with my PR there : https://github.com/tpruvot/cpuminer-multi/pull/39. Is anybody aware of this ?

wtarreau commented 5 years ago

On Sat, May 25, 2019 at 02:37:38PM -0700, MikeMurdo wrote:

There was a report of incorrect speed on cpuminer with my PR there : https://github.com/tpruvot/cpuminer-multi/pull/39. Is anybody aware of this ?

Yes! I mentioned it when I brought the speedup patches. The problem was that the reported hashes count is used to compute how often to print the stats, and if we reported the correct value, the stats output was reported 256 times faster. Now I figured how to fix this, I have a patch somewhere, I'll send it eventually.

Willy

wtarreau commented 5 years ago

I finally uploaded my fixes in PR#39.

MikeMurdo commented 5 years ago

Perfect I'll take it and rebuild my patch thanks man

mcreator-kp commented 4 years ago

Btw, yesterday @jareso (author of optimized private miner for rfv2) showed up in our discord and left this message:

Right now his miner, in average, produce 18 MH/s per miner. Just want to let you know :)

Do u have jareso-experimental miner?