bschn2 / rainforest

The rainforest crypto currency algorithm
MIT License
8 stars 8 forks source link

RFv2 speed #15

Open itwysgsl opened 5 years ago

itwysgsl commented 5 years ago

Hello again @bschn2, I have concerns about v2 speed. Don't you think that rainforest became way to slow? My MacBook's Intel i7 produce ~30 H/s and it's not enough to mine at least one block on lowest diff on test network 😅

jdelorme3 commented 5 years ago

8000/s on my friend's ryzen 2700 at 16x4 GHz !!!

itwysgsl commented 5 years ago

Hello everyone, I'm setting up new testnet with fixed RFv2. Going to post some speed results soon.

wtarreau commented 5 years ago

On Sun, Apr 28, 2019 at 10:34:50AM -0700, iamstenman wrote:

Hello everyone, I'm setting new testnet with fixed RFv2. Going to post some speed results soon.

Does this mean there is an ip:port to which to connect to get stuff to hash ? I'm asking because when I tried sgminer on my intel GPU some time ago there was no bench mode and I had to connect to something more or less random that would provide work, but I don't really know if that could have been part of the failures, so for sure a public server could help. It's just a question our of curiosity, I don't need any such thing right now anyway.

Thanks, Willy

itwysgsl commented 5 years ago

@wtarreau yes, exactly, It's public test network main purpose of which - test and tune software and etc.

P.s. Updated RFv2 produce around 2.5 Kh/s on my Macbook's i7.

wtarreau commented 5 years ago

On Sun, Apr 28, 2019 at 11:24:33AM -0700, iamstenman wrote:

@wtarreau yes, exactly, It's public test network main purpose of which - test and tune software and etc.

OK, thanks for your explanation.

P.s. Updated RFv2 produce around 2.5 Kh/s on my I7.

Ah, great!

Willy

jdelorme3 commented 5 years ago

And 2200 on my Raspberry Pi 3B+! Almost as fast as the Macbook! Thank you guys for your amazing work!!

bschn2 commented 5 years ago

Cc: @itwysgsl @djm34 for their implementations

Gentlemen,

I found another remaining issue which really had to be addressed. @itwysgsl I'm aware it's late for your deployment but better fix late than too late.

I was looking at the reasons for significant speed variations and figured that we didn't modulate the rambox access probability based on the number of loops, so some hashes were performing few changes to the rambox and others were performing a lot. My concern was not as much the variations as the likelihood that some miners could to try to distribute hashes by arranging them based on the expected rambox changes and try to combine them to minimize the amount of memory needed.

I have addressed this so that the probability is adjusted based on the number of loops, and now the amount of changes per hash varies much less (about 750 to 1350) and the performance is much more stable. I have updated the C and CL code. I also updated the test vectors.

Many thanks for your understanding, Bill.

jdelorme3 commented 5 years ago

OK just tried, was worried about performance changes, and it remains the same, even more stabler, 2160 to 2180 on my Raspberry Pi 3B+. 2200 I reported was average, it was oscilating from 2100 to 2300 (10%) but now its better (1%). Excellent work man!

bschn2 commented 5 years ago

Many thanks for checking, Sir!

itwysgsl commented 5 years ago

@bschn2 oh, how critical this issue is?

bschn2 commented 5 years ago

Hello Sir! "critical" is not the term, "concerning" is most suitable. But it typically is an overlooked design issue which can result in weakening the hashing difficulty for some hashes and just like the previous CRC padding issue, who knows to what extent it will be exploited once software start to squeeze the last nanoseconds out of a hash? I definitely don't want to see people start to merge their ramboxes between many threads by precomputing parts of the hashes and distributing them accordingly. For your usage even if the schedule is tight I think it's less troublesome to address it now than what it can result in if you have to hard fork again in 3 months.

itwysgsl commented 5 years ago

I see, thanks for info @bschn2 !

djm34 commented 5 years ago

hmm with current change I am getting 300Hash/s on gpu... (sure there are probably ways to optimize but it isn't a lot)

bschn2 commented 5 years ago

Hello Sir! This is not expected at all, it reminds me when the RFV2_RAMBOX_HIST value was too low in the past. Are you sure it's at least 1536 as in the reference implementation ? The average number of rounds didn't change here, we're only making sure that the situations with less rounds maintain a proportional rambox access frequency. Most iterations have a large number of loops and experience no change since the limit remained the same.

itwysgsl commented 5 years ago

@bschn2 nodes is updated and new release is published

bschn2 commented 5 years ago

Well done, Sir! I feel better with this addressed.

djm34 commented 5 years ago

It is probably due to my implementation. I don't overwrite the rambox, but change locally the "prev" array which I have modified it's use. The main problem is that it has to scan through all the change, to see if something has been modified. (however I couldn't managed to run in parallel too many rambox

bschn2 commented 5 years ago

I guess you're speaking about the for() loop here : https://github.com/MicroBitcoinOrg/CudaMiner/blob/4bc1f0cd17aff6979a337d52adce5b249a7b17d6/rainforest/rainforest_function.h#L569

Indeed, it will ruin performance. The purpose of the rambox was exactly to be fast (and so fast that it's sensitive to memory latency) so any loop there will have a disastrous effect. I'm interested in seeing how this evolves over time: some implementations might for example decide to ruin per-thread performance doing hash-table lookups and support many threads and others might prefer to limit the number of threads and maximize their performance. Right now I'd suggest to stick to the latter approach which is the same as the CPU code's. It allows to run about 10 threads per GB of RAM, which means roughly 40 threads on a 4 GB graphics card. This is already way more than what a standard CPU will do, and should likely result in comparable to slightly better performance.

djm34 commented 5 years ago

what are the performance on cpu btw ?

tbh, I am a bit reluctant to modify the rambox, because a second access to a same index doesn't occur so often

bschn2 commented 5 years ago

On my SkyLake 4.0 GHz with 8 threads:

./rfv2_test -b -t 8
3953 hashes, 1.094 sec, 8 threads, 3611.771 H/s, 451.471 H/s/thread
3828 hashes, 1.000 sec, 8 threads, 3827.843 H/s, 478.480 H/s/thread
3808 hashes, 1.000 sec, 8 threads, 3807.912 H/s, 475.989 H/s/thread
3863 hashes, 1.000 sec, 8 threads, 3862.911 H/s, 482.864 H/s/thread
3831 hashes, 1.000 sec, 8 threads, 3830.923 H/s, 478.865 H/s/thread

@jdelorme3 mentioned around 2200 H/s on his Raspberry Pi.

wtarreau commented 5 years ago

Hi!

Some numbers from my i6700K at 4.4:

$ ./rfv2_test -b -t 8
4301 hashes, 1.128 sec, 8 threads, 3814.046 H/s, 476.756 H/s/thread
4217 hashes, 1.000 sec, 8 threads, 4216.827 H/s, 527.103 H/s/thread
4164 hashes, 1.000 sec, 8 threads, 4163.900 H/s, 520.488 H/s/thread
4236 hashes, 1.000 sec, 8 threads, 4235.898 H/s, 529.487 H/s/thread
4135 hashes, 1.000 sec, 8 threads, 4134.909 H/s, 516.864 H/s/thread
4162 hashes, 1.000 sec, 8 threads, 4161.892 H/s, 520.236 H/s/thread
4143 hashes, 1.000 sec, 8 threads, 4142.905 H/s, 517.863 H/s/thread

NanoPI M4 base (2xA72 at 2.0 + 4xA53 at 1.5, RAM defaults):

$ ./rfv2_test -b -t 6
4366 hashes, 1.578 sec, 6 threads, 2766.324 H/s, 461.054 H/s/thread
3874 hashes, 1.000 sec, 6 threads, 3872.885 H/s, 645.481 H/s/thread
3686 hashes, 1.000 sec, 6 threads, 3684.979 H/s, 614.163 H/s/thread
3675 hashes, 1.000 sec, 6 threads, 3673.755 H/s, 612.292 H/s/thread
3640 hashes, 1.000 sec, 6 threads, 3638.239 H/s, 606.373 H/s/thread
3727 hashes, 1.001 sec, 6 threads, 3725.085 H/s, 620.848 H/s/thread

NanoPI M4 tuned (2xA72 at 2.1 + 4xA53 at 1.7, RAM at 928 MHz):

$ ./rfv2_test -b -t 6
5452 hashes, 1.371 sec, 6 threads, 3977.141 H/s, 662.857 H/s/thread
5360 hashes, 1.000 sec, 6 threads, 5359.234 H/s, 893.206 H/s/thread
5361 hashes, 1.000 sec, 6 threads, 5360.480 H/s, 893.413 H/s/thread
5367 hashes, 1.000 sec, 6 threads, 5366.501 H/s, 894.417 H/s/thread
5388 hashes, 1.000 sec, 6 threads, 5387.467 H/s, 897.911 H/s/thread
5422 hashes, 1.000 sec, 6 threads, 5421.458 H/s, 903.576 H/s/thread

Nanopi Fire3 (8xA53 at 1.6 GHz):

$ ./rfv2_test -b -t 8
6061 hashes, 2.582 sec, 8 threads, 2347.529 H/s, 293.441 H/s/thread
6212 hashes, 1.004 sec, 8 threads, 6187.534 H/s, 773.442 H/s/thread
6248 hashes, 1.004 sec, 8 threads, 6223.356 H/s, 777.919 H/s/thread
6131 hashes, 1.004 sec, 8 threads, 6106.902 H/s, 763.363 H/s/thread
6223 hashes, 1.004 sec, 8 threads, 6198.497 H/s, 774.812 H/s/thread
6415 hashes, 1.004 sec, 8 threads, 6389.735 H/s, 798.717 H/s/thread
6174 hashes, 1.004 sec, 8 threads, 6149.696 H/s, 768.712 H/s/thread

MacchiatoBin (4xA72 at 2.0 GHz):

$ ./rfv2_test -b -t 4
4068 hashes, 1.115 sec, 4 threads, 3649.455 H/s, 912.364 H/s/thread
4473 hashes, 1.000 sec, 4 threads, 4472.544 H/s, 1118.136 H/s/thread
4435 hashes, 1.000 sec, 4 threads, 4434.698 H/s, 1108.675 H/s/thread
4525 hashes, 1.000 sec, 4 threads, 4524.267 H/s, 1131.067 H/s/thread
4437 hashes, 1.000 sec, 4 threads, 4436.299 H/s, 1109.075 H/s/thread
4524 hashes, 1.000 sec, 4 threads, 4523.290 H/s, 1130.822 H/s/thread
4494 hashes, 1.000 sec, 4 threads, 4493.286 H/s, 1123.321 H/s/thread

UP Board (Atom X5-8350 at 4x1.68 GHz):

$ ./rfv2_test -b -t 4
759 hashes, 1.320 sec, 4 threads, 574.814 H/s, 143.704 H/s/thread
777 hashes, 1.000 sec, 4 threads, 776.880 H/s, 194.220 H/s/thread
758 hashes, 1.000 sec, 4 threads, 757.931 H/s, 189.483 H/s/thread
774 hashes, 1.000 sec, 4 threads, 773.927 H/s, 193.482 H/s/thread
760 hashes, 1.000 sec, 4 threads, 759.930 H/s, 189.983 H/s/thread

Hoping this helps!

djm34 commented 5 years ago

ok around 18kh/s with some corners cut... (gives from time to time a non validated result)

wtarreau commented 5 years ago

On Fri, May 03, 2019 at 06:19:01AM -0700, djm34 wrote:

ok around 18kh/s with some corners cut... (gives from time to time a non validated result)

That's impressive! What hardware is this ?

Willy

djm34 commented 5 years ago

1080ti (this is pretty much the limit of the current implementation) with 2^13 thread

well probably not that impressive considering the hardware... a cpu with 32/64 thread will beat that easily

wtarreau commented 5 years ago

On Fri, May 03, 2019 at 06:57:14AM -0700, djm34 wrote:

1080ti (this is pretty much the limit of the current implementation) with 2^13 thread

Wow! This must be a monster! How do you deal with the rambox with so many threads?

Willy

djm34 commented 5 years ago

the rambox is common to all threads. And locally I match the rambox index to the ctx->changes index and passes and read changes to the rambox through the ctx.prev[] the local indexing registers only rambox_idx/16 (assuming that it is rare to have consecutive index access to the rambox... which is in most cases true) in to uint16_t which allows to run more thread

I saw actually some implementation of hashtable which should do a better job (assuming it works) than my fast implementation

bschn2 commented 5 years ago

Well, this sounds like a really great optimization, Sir! I'm really glad you manage to maintain a high enough performance level on this GPU without them being excessively high, it preserves the users' investment while offering great opportunities for newcomers to fairly compete with these setups using low power devices.

jdelorme3 commented 5 years ago

Hehe, 18k on a 1080Ti, this is only 9 times more higher than my Raspberry Pi 3B+! With 9 boards only I have the same performance with only 30 watts! MBC will be the best coin to mine :-)

bschn2 commented 5 years ago

@jdelorme3 the CUDA implementation is still fairly recent and needs to workaround some tradeoffs imposed into the rambox to limit the scalability. But history and experience tell us that such optimizations will come and this performance will increase. With this said, your point is currently valid and it was my very first goal when working on this since last year. Further, look at @wtarreau 's numbers on the Nanopi Fire3. This board costs the same as your board (around $35) and is about 3 times faster. It is an important point to consider when building a new setup as it will further save energy and reduce electronic waste. And mining on your smartphone could be even twice as fast depending on the number of cores and their frequency.

wtarreau commented 5 years ago

FYI in PR #34 I fixed the time measurement causing the low value for the first line in bench mode that can be observed above in my test reports. It only affects the startup time, so the performance reported on subsequent lines is correct.

itwysgsl commented 5 years ago

@bschn2 @wtarreau we successfully switched to rfv2 few minutes ago!

wtarreau commented 5 years ago

On Tue, May 07, 2019 at 02:24:00AM -0700, iamstenman wrote:

@bschn2 @wtarreau we successfully switched to rfv2 few minutes ago!

Oh, congratulations!

Willy

itwysgsl commented 5 years ago

Little update on situation after algo switch: at this moment we again have around 98% of hashrate from unknown source, and I have no idea why :(

wtarreau commented 5 years ago

On Sun, May 12, 2019 at 03:25:01AM -0700, iamstenman wrote:

Little update on situation after algo switch: at this moment we again have around 98% of hashrate from unknown source, and I have no idea why :(

Just for my understanding, what does this mean exactly ? Is it that someone is doing most of the work ? If you know it's an unknown source, I'm assuming your can see the source(s), are you able to know if this comes from multiple sources or a single one ? Could this mean that for example there are too few people mining and that the only ones present are the one with available hardware to throw at this task and are collecting all the results ?

Willy

itwysgsl commented 5 years ago

@wtarreau I consider all hashrate comming not from public pools as "unknown source". There is website which allow monitor that. I'm not saying it's bad or good thing (or fault) of RFv2 (I'm pretty sure algo is fine) but just letting you know how things going on.

wtarreau commented 5 years ago

On Sun, May 12, 2019 at 04:10:19AM -0700, iamstenman wrote:

@wtarreau I consider all hashrate comming not from public pools as "unknown source". There is website which allow monitor that. I'm not saying it's bad or good thing (or fault) of RFv2 (I'm pretty sure algo is fine) but just letting you know how things going on.

OK I see. How is the global rate computed if some pools are not public ? Also, is there any benefit for not going through the public ones ? For example let's imagine that a huge mining farm made of big hardware like you said was common doesn't want to pay a fee to a pool, it probably makes sense not to use them and save a few percent, especially if this small difference makes the difference with the electricity and cooling bill ? Sorry for looking a bit dumb here, just trying to understand motivations behind technical choices.

I'm seeing that the reported global hashrate (47 MH/s) roughly corresponds to what can be achieved with 10000 smartphones. Thus if you only had 10k users willing to run an application this could rebalance better.

Willy

itwysgsl commented 5 years ago

@wtarreau yeah, probably it's just big farm (it's most obvious explanation at the moment).

MikeMurdo commented 5 years ago

Few hours ago I submitted a PR for cpuminer-multi to integrate the new patches. It should boost adoption. GPUs tend to get most of the hashrate when a coin opens because big farms are readily available and you only need one of them picking the existing implementation and deploying massively to hundreds of nodes, while CPU ones take time to be noticed and adopted by the public. If GPUs are too strong initially it can scare CPU users who think they'll get too small shares. But if only 10k smartphones are needed to rebalance this, it will take time but will eventually work. How about we try to get the AMD miner up and running by the way ?

itwysgsl commented 5 years ago

@MikeMurdo working on sgminer right now, but not much is done by now (I don't even have AMD gpu for tests 😄).

jdelorme3 commented 5 years ago

@MikeMurdo nice, once merged I can write an other article showing how to mine MBC with RaspberryPi 3B+ and mainline cpuminer. So much power at this cost, everyone should mine, at least to pay electricity and a bit more!

bschn2 commented 5 years ago

Hello! @MikeMurdo is right about the risk of initial strong start from GPUs, eventhough that's always a stressful period. However, what is wrong with the sgminer implementation? Doesn't work properly? I concede I didn't spend as much time on it as the rest of the code but did my best to keep it up to date.

itwysgsl commented 5 years ago

@bschn2 main issue with AMD miner development - lack of experience and AMD gpu for tests 😅

bschn2 commented 5 years ago

Big shiver down my spine now : looking at the sgminer patch, it's the old one for RFv1!!! I don't know where I've put all the tedious work I did for RFv2, I hope it's not lost as I experienced lots of difficulties allocating private 96 MB areas to each thread, I hope I won't have to do it again! What scares me is that on my son's PC I boot on a USB stick and develop in /tmp which is mounted into RAM. I must have placed this code somewhere. I hope it didn't get lost when renaming the branches.

itwysgsl commented 5 years ago

@bschn2 just in case, you can check whole history of rfv2.cl file changes here https://github.com/bschn2/rainforest/commits/master/rfv2.cl

wtarreau commented 5 years ago

On Sun, May 12, 2019 at 04:48:25AM -0700, bschn2 wrote:

I hope it didn't get lost when renaming the branches.

There is no such reason, and in any case your old work is still present. If you want to recheck all your branches, you can :

Once you find the commit, do two things :

Good luck! Willy

itwysgsl commented 5 years ago

Seams like I figured out source of unknown hashrate. While ago miners with tag ccminer/jareso-experimental (ccminer - Nvidia miner) joined to one of public pools and they have almost 5 MH/s per 8 GPUs. Probably rest of miners use same software.

In average 1 Nvidia GPU gets around 551.8 kH/s.

wtarreau commented 5 years ago

On Sun, May 12, 2019 at 06:44:27AM -0700, iamstenman wrote:

Seams like I figured out source of unknown hashrate. While ago miners with tag ccminer/jareso-experimental (ccminer - Nvidia miner) joined to one of public pools and they have almost 5 MH/s per 8 GPUs. Probably rest of miners use same software.

Seems huge per GPU. Do you think these are the same types of GPUs with 3000+ cores as djm34's ? He saw ~18 kH/s on his device, which already looked big to me. Also I'm wondering how they deal with the rambox with so many threads.

Willy

itwysgsl commented 5 years ago

@wtarreau so far I have no details about this implemetation, but I also pretty sure it's using regular nvidia GPUs.

bschn2 commented 5 years ago

Gentlemen, I have uploaded a branch called "opencl-recovery" which contains an incomplete patch I found on my USB stick. It contains the C init code which performs the allocations and calls the OpenCL kernel. It still features the RFv1 C code, which I replaced last in the update process, so I take it as the backup I made after I saw the first valid hashes and before I replaced the C implementation. Also it does not contain the CL file which is a good hint that it's recent enough to rely on the official one that only needs be copied. I'm sharing this patch despite incomplete in hope it can help anyone (especially you, @itwysgsl), as I don't feel like doing it again for now I'm afraid. Please all accept my apologizes for not having being careful enough :(

MikeMurdo commented 5 years ago

@itwysgsl "In average 1 Nvidia GPU gets around 551.8 kH/s." Seems a fairly high ratio to CPUs, very likely what they present as a single GPU is in fact a single CPU node with 5-7 GPUs connected to it. We've seen huge rigs on this coin in the past and its certain that the short block life encourages putting massive hardware at it to win the race ; if you cant manage to get a single share in one minute, you've wasted energy for nothing. On the opposite while there are few users this energy is well invested when you can make 90% of the shares. So its possible that your environment-friendly coin is responsible for a few tens of KW at its beginning, which is kinda amusing :)