Closed MikeMurdo closed 5 years ago
Hello Sir, thank you very much, that's just in time, one issue is currently open on this.
Hello,
I was able to fix build errors on Windows just replacing the const void *msg
to const char *msg
in: static void rf256_update(rf256_ctx_t *ctx, const char *msg, size_t len)
. I had to change __asm__
instead of asm
in my MSVC 2017. Also it was required to change headers:
#ifdef _WIN32
#include <io.h>
#include <windows.h>
#else
#include <unistd.h>
#endif
Hello Sir,
The void vs char issue should be addressed with the latest update. For asm vs asm, OK, I'll update it. Thank you for the includes. I would appreciate it if you could send a pull request with these fixes, I'll happily merge it. Thank you Sir!
Bill, please see PR #5 :)
Hello Sir, many thanks to you and @aivve . This pull request was now merged.
You really should check my suggestions before implementing them because I am struggling to make sgminer work: I managed to compile it on MSVC 2017 (I had to use fork updated for MSVC 2015) but it doesn't do correct hashes, repeating rainforest_regenhash
. Therefore, I am not sure if it's to blame my changes for windows build.
Update: here is the last commit
Damn, you're right, the diffs for cpuminer and sgminer are wrong. @MikeMurdo you missed the last replacement of msg to msg8 in rf256_update(), so the final word restarts from the beginning of the hash. Let me fix this before people start to pull that.
Please apply this patch : https://github.com/bschn2/rainforest/commit/1b45e8ce91eb94d32d49eaed5c6053f7e6b4a306 or simply pull again. I'm quite confident that the issue should be gone now. Otherwise I'm afraid I'll have to re-setup my lab to retest all this.
Apparently, either there still are errors in my windows version of sgminer or code in rainforest.cl and rainforest.c differs somehow so the error persists. It would be good if someone reports if it's working.
So I'll have to re-assemble my lab to give it a try. The main issue you can have on Windows is the use of longs since they're 32-bit. You should try to compile it at -O0. If it suddenly works, it means that an optimization doesn't work there (possibly an aliasing issue). Do you have any build warnings reported by the compiler ?
Sir, I managed to partially reconnect my system but I'm getting hardware errors after regen_hash() which I suspect could in fact be bad nonces or bad nonce verification at the end. I've seen the code has significantly changed, it will take a bit longer to me to figure exactly what's wrong there. Given that I've noticed the original patch broke the openCL part with this final_s[] instead of hash[], I suspect a faulty cleanup when getting the patches ready for the final merge.
Oh dear, it's going to be a very long week-end again I'm afraid. I'll keep you updated.
I am sorry for making you spend time on this. The only difference between patched version by @jdelorme3 is change of f(((ulong*)final_s)[7]
to (((ulong*)hash)[3]
and adding missing #define SWAP4(x) as_uint(as_uchar4(x).wzyx)
. I thought it's just mine particular compiling setup but since you got the same errors it's apparently not, although I have info that it works fine on Mac OS so I'm confused. It's worth mentioning that I was trying to compile it on modified version of sgminer, adapted to newer MSVC 2015. So I will try to install linux and compile on linux or use other build options on Windows. I am trying to modify rainforest to work with CryptoNote but so far sgminer doesn't even work as is. The algo itself works perfectly on cpuminer and builtin CPU daemon miners for CryptoNote.
Sir, you don't have to be sorry. I failed to release a properly working initial patch, it's what caused the confusion and code fragmentation, I have to fix it. I will reactivate the debug code and will figure what is wrong. sgminer is not easy to work with due to the distributed nature of the code (both .c and .cl having to agree, it provides a double check but it's easy to forget to fix an error in one or the other when you remember having fixed something though it was not in the right file). Also I realize that there is a divergence between the reference rainforest.cl and the patch, this is not good, I need to address this as well. Likely I'll update the sgminer patch so that it doesn't contain it anymore to avoid any further confusion.
Regarding CryptoNote, what type of changes are necessary in your case ? Do you need to reintegrate some changes in the reference implementation ?
The changes are basically only in scanhash_rf256 for cpuminer and in search and regenhash for sgminer respectively. Well, it is also necessary to add rainforest conditions the same as in cryptonight. I suspect there's something in rainforest.cl that gives the different result than in c code so prior to ma make CryptoNote version I want to make a working version for Bitcoin-like coins.
Hello @bschn2, I noticed that you all chating here :) Just wanted to let you know, that today our network successfully switched to rainforest. So now you can get some real world data from your algo. Thanks for your hard work!
Hello again,
I am sure that the problem is with my sgminer because compiled version gives invalid shares on any algo I tried. I will start over with different build tools.
Are you using blocks at least 80 bytes long ? The CL code (and maybe the C as well) was optimized with the padding being commented out since it is not necessary beyond something like 72 or 76 bytes, I don't remember exactly to be honest. So with 80 we are certain we can disable it, but below you need to uncomment that code. @itwysgsl thank you for your kind words and for this info. sgminer is quite difficult to test without pools, this will definitely simplify the process. I'm glad it suits your needs, I hope you will at your level be a significant actor of this change in the cryptomining landscape.
Hi, I don't think there's something wrong with the blocks.
Hello @bschn2,
If I would want to use the hash twice, i.e. feed in a second pass the output of first one of length 64, then I suppose I would need to uncomment "optimized" code. Not all coins have input of 80, for example cpuminer won't mine correctly at cryptonight algo if block template contains more than 127 transactions.
Hi there, user @wtarreau has made some optimisations and gets almost 10 times the performance on his nanopi. He sent me his patch and I get nearly 4 times on my raspberry-pi 3B+. I asked him to submit his patches here. I saw that WildRig optimised 7-8 times their code, possibly this are the same optim, and GPU will be again 10 times faster than CPU not 100 times. edit: report is here: https://www.linkedin.com/feed/update/urn:li:activity:6487705540213374976
Hi, that's very interesting. I wonder if using hash twice would reduce GPU advantage.
Hi, yes indeed I played a bit this week-end with cpuminer upon Julien's request and found it very efficient at freezing my overclocked devices. I was then interested in looking closer at what the algo was doing to pull that much on the CPU and found that half of the CPU time was lost in memcpy(). Not being familiar with the utility it took me a while to figure that it was the mempy() used to initialize the rf256_ctx before attempting any hash. I counted and saw that in the worst case we'd have ~384 writes for about 80 bytes in so it was not worth copying 2048 words. The problem is that Cortex A53 have only a 64-bit read bus between the L1 cache and the datapath (128 write however). So we were wasting slightly more than 1024 cycles just copying this. Thus keeping a history instead made sense. After this I tried to reduce the number of rouds which in turn limited the history depth. And now my NanoPI Neo4 is doing 14-15kH/s instead of 1.5k. And my devices are even more stressed, which allowed me to figure their most reliable frequency :-) Julien found these patches to significantly improve his RPi as well (~3-4 times I think, but this device has no crypto so that's already huge). So I accepted to take a bit of time to clean all this up, put less cryptic commit messages and upload them. The code is here in the arm-optims branch : https://github.com/wtarreau/cpuminer-rf-optim-sbc. I also put a build script for these boards because it's not trivial out of the box an reports undefined symbols by default. Hoping this helps! Willy
Hello Sir,
your changes are extremely interesting. I didn't think about the branch prediction issue and the fact that it would be faster to always write to the rambox! Also regarding the memcpy() you're absolutely right! The rf256_ctx is 16512 bytes so it's even 2k cycles to copy it 64-bit at a time. What a waste of resource when you see how much was saved on divs and CRCs! And this is considering that both the source and destination fit in L1, which is not the case. Have you tried hand-writing the memcpy() using LDNP to avoid keeping a copy of ctx_common in the cache ? Sadly GPUs can be much faster on the memory operations (huge shared busses, so the copy happens fast but must be rare enough to avoid collisions which increase latency). I had to limit the rambox size to avoid killing the performance in memcpy(). But with your history method, it might be possible to instead use a huge area which instead fits in L2 or L3 and keep only the history in L1. This would further widen the gap with ASICs. This definitely is something to consider for a v2!
Now regarding your code, I don't know how to handle it. I think you should send it to the cpuminer's maintainer. But I'll have to study the reintegration of your patches into the reference design and in the OpenCL one as well so that everyone plays at the same level.
Thank you very much for this work, and rest assured that as a researcher I truly appreciate its value.
OK I can send a PR to the maintainer then. We'll see if he's interested. I don't want to invest too much time on this since I don't use this beyond testing :-)
I tried the LDNP instructions, as I already do use them in one of my memory benchmark program. But this didn't show any difference at all. I suspect the A53's L1 is not smart enough to consider the difference between a regular read an a non-temporal one.
If you're considering creating a v2, you should have a look at some ARMv8 instructions like RBIT which reverses all bits (not present in x86, will require roughly 20 cycles). It apparently exists on CUDA but not OpenCL from what I'm reading, and even when it exists it's only 32-bit so the 64-bit version is slower. Mixing it with your rotbox could be fun. There's also EXTR which extracts portions of a double word, which likely isn't trivial to implement on GPUs. I also found CLS which counts the number of identical left bits. It will require 4-5 instructions on x86 and probably the same on opencl/cuda.
@aivve I think that chaining operations would do the opposite of what you're seeking : I suspect that GPUs with their huge busses will be extremely fast at doing the static initial and final work (e.g. memcpy()) while others will be very slow. I think instead a good approach would be to make each round way heavier. If my A53 can run 2M rounds/second/core, we can make the chain much longer and involve many more operations (provided it still fits in L1 I-cache). For each of these operations which runs faster on a CPU, this will close the gap between the two. And good luck with implementing huge sequential processing in ASICs. Just my two cents.
By the way, I think it is a mistake to use both div+mod on small devices : on ARM there is no modulo provided by the division so it requires a multiply and a subtract. Often both will be serialized because they use the same ALU ports. Also you don't slow down a GPU much by making it compute the modulo once it has the quotient, so better stick to div only.
Also I don't think GPUs implement modular arithmetic but only saturated arithmetic (but I could be wrong). For example, multiplying 64x64bit and retrieving the highest 64 bits of the 128 bit result is probably not that easy there. On x86 it's trivial. On ARM64 it's just a mulhi if I'm not mistaken.
good luck! Willy
@wtarreau nice work! Now rainforest performs 4-5 times faster on my cpu :)
Then I guess you're on x86. I was amazed to see my $45 NanoPI-Neo4 run the program 25% faster than my $600 PC running an overclocked i6700K at 4.4 GHz! This game is getting fun, I'm going to order an RPi just to see if it's possible to make some code run faster on this than on a PC (will not be easy but it would be fun).
PR now sent to upstream.
PR sent to upstream and merged.
@itwysgsl I see a lot of WildRig on skypool reaches around 450 MH/s per rig, or around 50 times my PC. Have you an idea what setup they have ?
Hello @jdelorme3, no idea actually. WildRig is closed source AMD miner, so I can assume that they have cards like Vega and etc. Also few day ago big amount of hashrate pop up on our network (>80% of hashrate is comes from unknown source), and there is some hints (like "proxy" miners on pools and some insider but not proven info) that it could be FPGAs. Developer of WildRig assumed that one round of Rainforest is not enough to fill rambox of FPGAs, which gave them some agvantege. @bschn2 what you think about this?
@itwysgsl that's very interesting. I'm seeing that cards called "vega 64" are not that expensive. Given the specs (number of cores and memory bandwidth), it's tempting to buy one to see what's possible to do with this. I've never programmed a GPU and have no idea what similarities are found between this and CPU. I remember SSLshader, maybe it's time to experiment again with this.
Regarding the ability to port to FPGA, from what I've found in the scan_rf256() function, it's indeed possible to iterate over the last round when you fix the rest, and with the history method instead of doing the full memcpy() it could be affordable to port this. With this said, I also have strong doubts on the ability to easily port the number of CRC32 rounds and the 64bit divides to FPGAs.
I personally think it's a mistake (don't take offense @bschn2) to focus on L1 cache speed. I'd instead go for DRAM limitation. If you work on a 256 MB block, you're inflicted the memory latency (one latency count per memory channel). My measures indicate 65ns (skylake) to ~200 ns (low-end ARM boards). Combine this with CRC32 and RBIT for addressing and you could end up with a very low scalability on core counts. I.e. if the PC requires 140ns to compute what the ARM does in 5ns, a single core gives the same speed, and with 8 cores you're only 2.5 times faster. And similarly, the FPGA would require to split memory into many channels and have tons of memory available.
I found (possibly unreliable) info indicating that GPUs would use GDDR5 at about 40ns access time. This would mean they'd be at most 5 times faster than the low-end ARM board, or 1.5 times faster than the x86 CPU.
@wtarreau as far as I know 15 Vega cards rig produce ~1.5 GH/s on Rainforest.
Here is more stats: 120MH/s per Vega64 63MH/s per RX580
@itwysgsl thank you very much for these info! So one Vega64 only does 8 times my NanoPI-Neo4 which costs only $45! The datasheet says it draws 295W, my NanoPI around 7W. Thus with 56W of NanoPI costing half the price of the Vega64 one would achieve the same level of performance. Is it common to assemble 15 such cards ? That costs a lot, it should be around $5-$6k!
@wtarreau such righs pretty common thing, amount of cards can vary from couple to couple hundred cards though. Mining it's business after all :)
Hello,
@itwysgsl it could be possible that the first part of the loop is computed on a regular CPU and that they offload only one round to an FPGA. Indeed, a single round of the rambox might fit in the SRAM provided by an FPGA, especially with @wtarreau's method using the history buffer. The div64 and the CRC combined would easily eat all available space on the FPGA but since some FPGAs come with embedded CPUs, one could imagine that the heavy part of the processing is performed by the small CPU and that the wide datapath is handled by the FPGA. This would save gates but not latency! This remains hardly affordable for more than one round, but if it's the last one it's already an issue. As mentioned by @wtarreau (by the way, no offense taken, Sir) indeed using a much larger rambox to prevent it from fitting into SRAM or cache would make sense. Some FPGAs have multiple memory controllers so in theory they could run multiple hashes in parallel but they would be limited by the number of gates needed to implement some of the complex operations.
From your stats, an RX580 is 4 times as fast as @wtarreau 's ARM board. Given that it's expected to be twice as fast as the RX560 I tested, it seems roughly in line with my predictions.
In order to be certain you're free from FPGAs it could be reasonable to run the hash twice as suggested by @aivve . However it requires you to fork and I understand how problematic this can be and it would be wise to first see if the algorithm can be tailored to further rebalance performance.
Looking on hashpool.eu it says the network hashrate is roughly 1500 times what a single rig would do according to your numbers, with an average rate equaling 60 Vega64 cards. This is not huge, it fits in a rack! It could simply indicate that some professional miners are seeing an interest in mining your coin and are moving their huge computing power to it. Also the numbers reported on mbc.skypool.co show half a GH/s per rig, which as per your stats would indicate $2K rigs made of 5 Vega64 cards. Place 12 of them in a rack behind a proxy, you have your 6 GH per proxy for $24k invested and 18kW of power or roughly $16k/yr. Some are willing to spend that much and even way more for mining when it's affordable. They could make substantial power savings by migrating to lower power boards. Let's see how to make sure it becomes even less profitable to them to encourage them to adopt people's devices instead and leave more room to individuals.
Another benefit in encouraging them to use low-power boards is that it will force them to acquire new hardware that they cannot use to mine other coins. So they will be torn between using their power-hungry GPUs that could be better used on other algorithms, and buying lots of hardware to compete with individuals and then stick to your coin. It forces them to take an investment risk and will keep many of them away.
@bschn2 yeah, switching to RainforestD is one of options to solve FPGAs issue (right now we have 90% on unknown hashrate), but I want to give algo more time to figure out more flaws (consider this as public beta test of rainforest :D). Also there is Nvidia miner in development right now. I want to check how it performs on original Rainforest first and consider switching to RainforestD :)
Sir, rest assured that I'm not trying to defend my design (I'm a researcher and my daily job is to throw away everything each morning to restart from scratch) but I think that what we're observing right now are not really flaws but insufficient resistance to tremendous amounts of power. Also it's worth considering that the high frequency of your blocks certainly attract a lot of professionals who figure that with such power they can have an advantage over those who will take more time to get their first share, and it's interesting to keep this in mind for the algorithm improvement. I truly appreciate your involvement in helping to refine it, I'm absolutely convinced that with such public exposure we can make it even better! Before switching to a double hash, just poke me, in case we can refine a few parts to further improve your resistance to high power devices. Warmest regards
Sure thing @bschn2, thanks again for you hard work!
Yeah, let's abuse DRAM latency to put everyone at the same level, and call it "lopohash" (low-power hash) :-)
+1 for lopohash, actually my choice of the rainforest name was not the best given how difficult it is to find!
In fact, the idea of double hashing comes from @itwysgsl. I'm just trying to adapt this algo for CryptoNote and so far couldn't succeed in adding precalc and history optimizations. It works well with full hash applied twice.
I have run a number of DRAM latency tests on various machines, the results are here : http://git.1wt.eu/web?p=ramspeed.git;a=blob;f=data/results.txt
In short, a modern x86 takes 65ns vs 90-180 for an ARM board. When running on 4 concurrent threads, these times become about 130ns for x86 and 270ns for ARM dual-channel, or 350ns for single-channel. This means two things :
I think that running loops over pseudo-random memory accesses, followed by loops over difficult computations would make things equal for various players. If for example you want to achieve 10kH/s, that's 100us per hash, which can be split into 70us CPU + 30us RAM on a PC and 30us CPU + 70us RAM for a low-power board. I don't know how GPUs perform WRT random accesses, I've read specs about 2048 bit memory busses, so I'm assuming they have multiple channels. If you work over a 256 MB area, that's a maximum of 64 threads for a 16 GB card, and even if it has, say, 8 independent memory channels, this is still not orders of magnitude more than a PC. The FPGA will not be able to do better than the PC with 256 MB per thread, and will instead likely do much worse on complex operations. As long as a small cheap device like a Raspberry is faster and cheaper than the FPGA, there's no reason to go down that route.
Regarding the double hash, I'd recommend against this. At first glance it can look appealing but in fact it is only more complex for software approaches and less for hardware where outer operations are peeled off. I remember doing this already for password cracking two decades ago. You can never remove much but those able to optimize certain parts and propagation paths in the algorithm are advantaged. Also, it becomes trivial to pipeline operations between two instances of the same chip, and unused silicon can efficiently be used for this while on a PC every cycle counts. Instead I'd rather make it difficult (with loops) and make sure that the code is highly optimized for software.
One painful thing with FPGA is reconfiguration. If you don't want to reconfigure it on each algorithm change, you have to waste tons of cells to implement the control logic. Simply running pseudo-random numbers of loops based on the contents will constitute a huge pain. For example the rambox might work over a variable sized area based on the input data, and iterate a certain number of operations as well. Same for rotations/divisions.
You might want to look at RandomHash or we'll end up in ProgPOW or RandomX :) Thank you for comment on double hash, @wtarreau!
There's Wild Keccak POW that uses scratchpad made from blockchain data. We devised somewhat similar algo which will slowdown miners a lot: 1) Hash block data including nonce into hash_1 2) Split the hash_1 into 8 chunks and get the corresponding 8 blocks from the blockchain 3) Hash the eight blocks as one continuous block, using hash_1 as salt 4) Finally hash using the generated hash_2 as a salt, taking the previous hash_1 as the password
Because changing of the nonce will change the first hash and thus required blocks from blockchain I don't know how miners can bypass the reading random blocks from entire blockchain. The only potential issue is that miners can learn how to query blocks from public nodes, ddosing them, however, this will slow down them even more due to network latency.
@wtarreau thanks for your test report, this definitely deserves more investigation. Indeed, a 256 MB rambox sounds like a nice balance which will not inhibit todays low-cost devices. One element to keep in mind however is that a larger memory area will increase TLB misses, so it would be wise to enable huge pages to limit this effect. However it is difficult to use a variable sized area on GPUs because miners have to pre-allocate a certain amount of memory and share it between workers. If the block contents dictate the memory area size, it means you have to rebuild with different settings and thread counts for each block, which is not going to work well. Better stick to the largest area all the time, it will be more reliable for GPUs.
@aivve thank you for the link and the description. I've read the RandomHash paper. While there are some novel concepts, it still mixes lots of algorithms that are readily available as inexpensive IP blocks in ASICs. Don't forget that all these algorithms were designed to be easy to implement in ASICs. To give you an idea, most of them are part of the x11 algorithm, for which you can buy an ASIC delivering 19 GH/s for $140. For sure it will take some time to adapt existing ASICs to mix them in a variable way, but at such a low price you don't count the number of IP blocks involved and you can even waste some existing silicium by placing multiple such cheap ASICs around one FPGA for the control. This is why I really wanted to focus on difficult operations that generic CPU vendors had to solve over the last decades. The explanation on the dual-mining paradigm is very informative and is also covered by what @wtarreau suggests regarding the mix of RAM and CPU in the same hash.
@wtarreau you seem to have a wide range of hardware available based on your test results, would you be willing to further modify the rainforest implementation in cpuminer to measure such effects on different hardware ? Note, cpuminer provides a benchmark mode so you don't need to send invalid shares.
@bschn2 @wtarreau I also have a RaspberryPi 3B+ if needed, happy to run beta code!
@jdelorme3 I've ordered one last week already and should hopefully receive it this week, but thanks for the proposal!
@bschn2 I've started to take a look at it, making the rambox dynamically allocatable. I've noticed that changing its size from 2k entries to 256 MB reduces performance a bit, but increasing the number of rambox loops has a tremendous impact. It takes ages to report the perf in benchmark mode, I'm not sure why, probably there is a minimum number of loops required. I'll have to dig a little bit.
A few comments, I managed to fix the slow thing above, there was a target number not updated in benchmark mode. I received my RPi last evening. It has slightly less than 1 GB RAM for 4 cores so 256 MB work size will not make it. 128 or even 64 should be better. And it's slow as hell, it takes 10 times more to write to memory than on my PC. 45ms are needed to initialize 64 MB with a simple counter! So you cannot imagine computing everything for each hash, or it will at best emit a few hashes per second and larger setups will do way more just because of this. Maybe it would be worth recalculating the rambox only based on the higher bits of the nonce so that it happens but not too often ?
However I could make some trivial loops run 3.3 times faster on it than on my 4 GHz PC. It's made of div(cls(rbit)), so there's definitely a huge potential here! It doesn't have crypto extensions but it does have crc32 : "Features : fp asimd evtstrm crc32 cpuid".
That's all for today, having to switch to real work now.
Hello Bill, some users are facing build errors and warnings like here https://imgur.com/uEOLZpL or here https://gist.github.com/quagliero/90f493f123c7b1ddba5428ba0146329a#gistcomment-2800719 .
I've addressed them, they were missing types (ulong) and void* arithmetics (forbidden in C++). I have also updated the patches for cpuminer and sgminer. For yiimp I have not updated it as tpruvot merged it and fixed it already. But the cpuminer and sgminer patches are likely to be applied as is on forks from older versions of these miners.
Please merge this as adoption by some coins might possibly be stopped just because of this.