Build fixes - Githubissues

MikeMurdo commented 5 years ago

Hello Bill, some users are facing build errors and warnings like here https://imgur.com/uEOLZpL or here https://gist.github.com/quagliero/90f493f123c7b1ddba5428ba0146329a#gistcomment-2800719 .

I've addressed them, they were missing types (ulong) and void* arithmetics (forbidden in C++). I have also updated the patches for cpuminer and sgminer. For yiimp I have not updated it as tpruvot merged it and fixed it already. But the cpuminer and sgminer patches are likely to be applied as is on forks from older versions of these miners.

Please merge this as adoption by some coins might possibly be stopped just because of this.

bschn2 commented 5 years ago

Sir, Thanks again for your tests. The CPU part is encouraging. 64 MB ought to be enough, it would already limit 4 GB GPUs to 64 threads and 16 GB ones to 256 threads, which is substantially lower than their 4k+stream cores and will still create a lot of congestion on their 2048 or 4096 GDDR busses.

Regarding the rambox initialization from the nonce, be extremely careful. You always want small miners to have the time to find a share during a block's life. With fast coins like MBC, a block might be found in one minute, and it must be enough to let a small player find a few shares, or he'll stop mining this coin and will leave the room to the big ones only. Let's say a large pool connects with 10k clients with the larger being 100 times stronger than the lower, you want to be able to see at least 1 million shares of the same difficulty for a block. That's 16k shares/s thus 16k hashes per second to validate on the pool servers! You certainly don't want the full hash to be too expensive there! Of course the difficulty adapts to so that the large players don't emit 100 times more shares but you cannot validate less than the speed of the small ones and in practice due to the difficulty distribution you'll see at least the double. So you'd get at least 20k shares per minute, 400 per second. I estimate that a reasonable target is round 1-10k hashes per second per core for a strong x86 server. This should leave enough room to upscale and downscale various setups without impacting the fairness between all participants. And based on your numbers it's indeed not even imaginable to initialize 64MB of memory that fast. This is why relying on a pre-initialized rambox with few writes can be a reasonable alternative.

wtarreau commented 5 years ago

Noted, thanks for the explanation. That's really all again for today this time.

LinuXperia commented 5 years ago

My Improvement suggestion is to use serialization of the Nonce instead parallelisation.

At the moment Miners with deep money Pockets have exclusive access and position to expensive Hardware and are able by using this exclusive special Position to abuse the Network and the Comunity.

This work by calculating each Nonce in parallel with more expensive hardware and by this overrun the smaller ones.

So clearly the parallelisation of the Nonce has to be prevented.

And this can be easy done with a small change in the Code of the Rainforest Algorithm.

Instead using a only one Nonce to find a hash that is below a specific target use a range of the nonce in total iterate the input hash with the new Nonce in the defined Range by the Consensus Network.

Keep the calculated hash of the specific nonce and use it as a input for the next nonce inside the range to test the hash if it is good to be submited.

The Nonce in the submitted block will have as a Value how many Times the Input Hash has to be looped in the gicen range by the Node till the specific hash end result is okey.

Looping through the whole Nonce range is probably also easy for every node so maybe the Nonce Range is not even needed to avoid to much long verifications of the Blocks by the Nodes.

It is a little more work than just test one Nonce yes but then it eliminate the whole Parallisation Problem on a miner device.

THe only thing that need to be addressed then additionaly is using different block header data for each Thread and do this way a Parallisation to get a advantage aka instead parallel calculation of the Nonce different parallel block Mining in paralle with different merkle root and time stamps.

wtarreau commented 5 years ago

I like this idea! But I'm wondering, what makes one client find a share before another one ? Do they have slightly different blocks to mine ? If so what prevents the ones with deep money pockets from presenting one different client per CPU/GPU/thread ?

itwysgsl commented 5 years ago

@bschn2 @wtarreau little update on our hashrate situation, @andru-kun came up with some optimisations for AMD miner and now it performs much better. For example RX570 GPU produce ~33GH/s compared to ~60MH/s before.

wtarreau commented 5 years ago

Wow great! It would be nice to know what they've done so that the same could be merged into cpuminer. Otherwise this leaves the advantage to those with huge GPU farms. Maybe this week-end I'll have some time to test my proposed changes on my RPi and other small boards. I'm very interested in seeing how the algo performs with a 64 MB rambox!

itwysgsl commented 5 years ago

Well, since WildRig is closed source software @andru-kun leave explicit right to share this edits for himself I think :(

itwysgsl commented 5 years ago

Some peoples also noticed that hashaltcoin (developer of F1 FPGAs) added MicroBitcoin and Rainforest to their profitability calculator https://www.hashaltcoin.com/cn/calculation

Hashaltcoin Rainforest

LinuXperia commented 5 years ago

I like this idea! But I'm wondering, what makes one client find a share before another one ? Do they have slightly different blocks to mine ?

Its the RainForest CPU Algorithm itself as it is designed to perform very well on Singleboard Computer per only one CPU Thread.

A small Example:

Lets give you a Example.

You have 100 GPUs and for simplification lets say you Mine only 1 Block Per GPU and on average you to Find the Serialized Nonce on one such GPU you need 30 Seconds.

So on all 100 GPUs on average you will be able to find a Block in about 30 Seconds.

Lets say I have just 1 Raspberry PI instead like you 100 GPUs but i have just one small Advantage you dont have and this is a little more Speed as the Rainforest Algorithm favors my Small Board Computer. Becouse of this small advantage in computuinal serial speed i can find a Serialized Nonce on my Rasperberry Pi in 20 Seconds on Average per device.

So what does that mean.

No matter how mayn GPUs you have either 100 or 1000 GPU calculating 1000 Blocks at parallel at a time if your average Block Find Time on the GPUs is 30 Seconds and my average Time to find a Block on my Raspberry PI is 20 Seconds you as a Big GPU Farm Miner will never outperform me with my single Raspberry PI.

Having this Serialized Nonce Algo change in Rainforest will give CPU Mining a advantage over massive Parallel Architectures as they lack the CPU Speed and are better at parallelisation bet exactly this is prevented by the new Proposal.

ClashLuke commented 5 years ago

You might want to look at RandomHash or we'll end up in ProgPOW or RandomX

@aivve About that, I actually wrote a variant of rainforest which is highly similar to ProgPoW and Ethash. It is based on your research @bschn2, but it uses everything you mentioned only once. So one block of rotations, one block of additions, one block of divisions and one block of CRCs. By doing some optimisations, I was able to achieve approximately 4 cycles per byte on a Zen CPUs. Yes, 4 cycles per byte results in a hashrate of 250MH/s on your test CPU.

About the RAM-Box you've been talking about, as seen in dagger hashimoto its definitely not a good idea regenerating the dataset/RAM-Box every nonce or even every block. Assuming you have a blockchain which propagates with a speed of 15 blocks per second but generating your rambox takes 14 seconds, you would have one second of hashes, which would result in extremely weird difficulty calculations.

No matter how mayn GPUs you have either 100 or 1000 GPU calculating 1000 Blocks at parallel at a time if your average Block Find Time on the GPUs is 30 Seconds and my average Time to find a Block on my Raspberry PI is 20 Seconds you as a Big GPU Farm Miner will never outperform me with my single Raspberry PI.

@LinuXperia

Actually, I will with just 1.5GPUs. So lets say you have 1000 RPis, you will find one block every 20s/1000 (50B/s). To achieve this speed with my GPUs, I will have to get 1500GPUs which would result in 30s/1500 (50B/s). Now assuming that my GPU has a better Hash/Heat, Hash/Power or Hash/Investment performance, there definitely are reasons for an investor to go for it. The thing is, it wont ever be like that. A Vega64 has 8GB HBM2 memory (which is a lot faster than the RAM on your RPi) so the I/O intensive part definitely is a whole lot faster in there. Additionally you can run 8x as many threads. Next thing would be the core clock. A Vega64 has 1.4GHz, a RPi3B+ has 1.2GHz. Since neither of those have any optimisations (aes, crc) available, you can assume that my Vega will be significantly faster. In case its bound to 1GB RAM, it would be about 9.6 times faster, if its only processing dependant, we can safely say its even worse. Now, lets compare Price/Hashrate. My Vega costs about 340€ here in germany, shipping included new. Your Raspberry Pi costs about 41€, both come from trustable sellers. Meaning you pay 41€ per H/s, I pay 34€ per H/s. Next point would be thermal performance A Vega64 runs at about 60°C when mining for Ethash, a raspberry pi can easily get above the 100°C mark when mining without a heatspreader - which would cost another 10€. The last possible point would be the power draw. If you hash a lot more efficient than I do, it is definitely worth it. While a Vega64 can take up to 295W, it takes about 130W when throttled so we can assume that we can run on full power at 150W. Each of your Raspi's consumes about 5W, which means you have about one third of the power usage I have. Considering the other stats, I would highly suggest everyone buys a Vega64 instead of a RPi3B+

edit because I messed up the quotes

ClashLuke commented 5 years ago

@bschn2

Thanks again for your tests. The CPU part is encouraging. 64 MB ought to be enough, it would already limit 4 GB GPUs to 64 threads and 16 GB ones to 256 threads, which is substantially lower than their 4k+stream cores and will still create a lot of congestion on their 2048 or 4096 GDDR busses.

About that, you usually take the maximum available memory. Ethash currently uses about 4GiB, which would mean an ASIC needs to have at least 4GiB high-performance memory. Existing Ethash ASICs simply utilise 72GiB of memory and a few processing cores. Think about it, most devices these days have 6GiB RAM or more. So leaving one third to the system and running one thread is decent. A GPU would be limited via RAM aswell - a Vega64 could run two threads, which means that (assuming you have the same delays), you can easily outperform it using a threadripper. Since I mentioned ethash a couple times now, I want to get back to the hash I wrote based off your research. (I'm sincerely am sorry if this counts as shameless advertising, I do not intend to do so - this is supposed to be about improvement, so I think its definitely worth mentioning it) There is a variant of the hash which uses a 4GiB read-only scratchpad. Assuming you would have RAM as fast as L1 cache, you would need about twice as much time just to get data from the dataset then for the rest of the calculations. As you can see, its clearly I/O bound, instead of calculation-bound. This gives it another strong level of ASIC resistance. it has a light mode, for verifiers, and a full mode, which needs the 4GiB dataset. The dataset is generated while the blockchain propagates and changes happen every "epoch" (n blocks). For comparision, the full mode gives me about 1.1MH/s, single-threaded, on an Xeon E3-1225v2 with DDR3-1333 memory, while the light mode has about 1.2kH/s on the same CPU.

In my humble opinion, this hash summarizes this discussion very well and also gives all your hard work while researching a new purpose, considering what @itwysgsl just posted

Some peoples also noticed that hashaltcoin (developer of F1 FPGAs) added MicroBitcoin and Rainforest to their profitability calculator https://www.hashaltcoin.com/cn/calculation

Thanks for reading this rather long comment, I'm looking forward to hearing from you.

itwysgsl commented 5 years ago

Hello @ClashLuke, where I can check your variant of rainforest :) ?

About that, I actually wrote a variant of rainforest which is highly similar to ProgPoW and Ethash

itwysgsl commented 5 years ago

Also @bschn2, I think we have some serious hole in Rainforest design

Hole

itwysgsl commented 5 years ago

one more

ClashLuke commented 5 years ago

Hello @ClashLuke, where I can check your variant of rainforest :) ?

@itwysgsl right, I wrote about it, but didnt link it. :thinking: Here it is: https://github.com/ClashLuke/Squash-Hash I highly doubt there is an issue like the one mentioned above in it.

bschn2 commented 5 years ago

Hello Sir,

Oh my! From the very beginning the algorithm was designed to hash a full block at once and not to iterate over only the last word in the block, hence the padding at the end. Now I understand the growing hash rates we've been witnessing. I think I have a quick fix. Let me have a look at it.

itwysgsl commented 5 years ago

Thanks @bschn2, looking forward to it!

wtarreau commented 5 years ago

Now I understand why the hash block looked bogus when I was increasing the rambox (it was always the same and I was looking for a bug in my changes). Indeed the nonce is the last round. Just running the block backwards would make it first and force to compute a full hash for every nonce. If you're going to change it, may I suggest some small improvements to increase the computational cost on big hardware ? Use RBIT before each rotate, replace the modulo in the div with another rbit. Add a bit of popcount and CLS. And I'd like to see some FPU stuff there : IEEE754 defines a 53-bit mantissa for double floats. So by doing some integer-to-FP mapping, we actually lose some precision, which creates a one-to-one mapping of an integer to a float. This mapping requires IEEE754 compatibility, which is painful to emulate when you don't have a fully compatible stack. You could for example use something like this in the operations : y = ((__int128_t)x x) >> 64; y += sqrt(y x + clz(y) + popcount(x)); The slightest incompatibility with IEEE754 will result in incorrect rounding and erroneous values. By definition all modern CPUs will always produce the exact same result here.

ARM and x86 have several nice other things in common like PCLMULQDQ applied to vectors. It should be a bit more expensive on GPUs (will take multiple cycles), but it's easy on hardware so one should not abuse this.

wtarreau commented 5 years ago

@ClashLuke thanks for the link. The explanation on the TLB cost is right, but it misses the point about parallelism. Memory busses are expensive. Your ASIC could be as fast as a PC with as wide a memory bus (i.e. 2x64 bits). To become even 10 times faster when the DRAM access latency sets the limit, it would require to have 1280 bits using the same techno. That's really ~1400 pins dedicated to DRAM connectivity! And we're only speaking about 10 times faster. A GPU with 2048 bits would still beat it.

Also, I thought that the power costs were more important than the investment cost in this area when comparing RPi to Vega. Note that RPi is dog slow, you can easily find significantly faster boards at around the same price, sometimes even lower.

Ideally if I understand the paper well, the goal would be to make the lowest power consumption devices hash the fastest even if they are more expensive to afford. That would steer the investment effort towards low-power consumption even for large actors.

ClashLuke commented 5 years ago

@wtarreau today I learned So a GPU would be even faster - and still faster in my variant. But this honestly doesn't matter too much, since we, or at least I did, focus on ASIC resistance and want speed on all devices, asymmetric memory utilization and ideally some minor changes which support the average Joe instead of the big data center.

Also, I thought that the power costs were more important than the investment cost in this area when comparing RPi to Vega. Note that RPi is dog slow, you can easily find significantly faster boards at around the same price, sometimes even lower.

As far as I'm concerned, only if it makes a big impact. Assuming that you will mine a coin for a year, similar to what you do with ASICs like the S9, the power usage doesn't matter too much. If the RPis cost about 70€ more than the Vega64 and the Vega runs at 150W with an average electricity bill of 10ct per kWh, we have 8.76 kilo hours (24*365) times 150W we have a total of 131.4USD paid for the Vega, while the RPis would be at 43.8USD. The resulting difference is 87.6USD, or a profit of 16.6USD when using the RPis for a year.

Ideally if I understand the paper well, the goal would be to make the lowest power consumption devices hash the fastest even if they are more expensive to afford. That would steer the investment effort towards low-power consumption even for large actors.

For this one, lets think about what happened with GPUs and RAM for a second. The shortage made the prices go up to about x2, up to x3 of what it originally was. Yes, many people already have mobile phones, but this doesn't mean that large-scale miners wont buy many phones or specific CPUs. This would drastically increase the cost for everyone involved. So instead of paying your 100€ for a phone (who pays 1000€ for a phone, seriously) you pay 150€. In fact, there already are ARM server CPUs, such as the N1 or the Hi CPUs by huawei. Which are based on ARMv8, which would mean that they wouldnt be forced to rely on the phone CPUs or RPis (btw, RPis are up to ARMv7, those optimisations are ARMv8). So when enabling low-power devices, you also further empower servers. But not all of them, only a small group of servers, which means that in the end, it will be (almost) centralised again.

I personally think that its the best going a way like cryptonight did, evening everything out a bit between CPUs and GPUs, instead of focussing entirely on CPUs, but lets see how this plays out.

bschn2 commented 5 years ago

@ClashLuke I agree with you that it is of utter importance to be fair with all involved devices. I've been focusing on being extremely fair between CPUs and GPUs. If you look at my presentation, at the end all three of the ARM, the x86 and the GPU produced the same performance. However I still think it is important to give a bonus to those eating less power. Ideally if your smartphone can hash as fast as a 150W GPU, it becomes affordable for many people to try to cover a part of their mobile subscription by mining while on charge during the night. It could be provided by default by some operators (who would likely provide their own pools) and in this case most of the power is decentralized. It remains true that large facilities can buy a lot of devices to get a bigger share but this is part of the economics. As long as they don't steer all the shares everyone find his place there. And thanks for your demonstration showing that for a small player it can be more tempting to try a small silent device lying on a shelf 24x7 than a big loud card that cannot be used by students living in a studio.

bschn2 commented 5 years ago

So I will go with a mix of @aivve and @wtarreau 's proposals. Initially I was not fond of running multiple hashes in series because of the higher initialization cost for CPUs than ASICs/FPGAs in the rambox. But @LinuXperia's comment made me realize that we could very well not reinitialize the rambox between the iterations and create a serialization that cannot be pipelined nor parallelized because the RAM space is the same for all instances. So I'll run some measurements this evening once back home but I think that looping 16 times on rf256_hash() with a single rambox initialization and @wtarreau's history buffer will be much stronger and will derail any attempt at unrolling any step since the rambox will have been altered by the nonce in the first round. And the history buffer will allow fast reinitialization that should not kill the performance.

I will experiment with larger ramboxes. I can use my son's PC with the RX560 GPU, he's out for the week-end. I will also remove the modulus from the divisions and will likely replace them with an inverted addition, e.g. rbit;add;rbit which should take the same number of cycles as the division. I like @wtarreau's idea of playing with floats but we need more time to validate this I think. However using a 128-bit integer multiply could prove efficient at stopping attempts to use an FPU for some heavy operations. Regarding the PCLMUL, I believe CUDA has it and it's not very hard to implement in ASIC/FPGA so I'm not sure there would be any benefit in adding it (but we can check).

aivve commented 5 years ago

Regarding floats and GPU please take a look at CN-GPU and here. Unfortunately I didn't find any documentation but this article. Just a quote from the Reddit post:

CN-GPU is a complete rewrite of original CN POW and is based on 32-bit floating point operations. Namely addition, subtractions, multiplication & division. It doesn’t rely on any fancy feature, instead anchoring its performance to physical no of cores. The algorithm is very through test for IEEE 754 implementation and has been tested on Intel SSE2, Intel AVX, ARM7, ARM8, NVIDIA, AMD.

wtarreau commented 5 years ago

@aivve Impressive work on CN-GPU. I'd have gone for 64-bit but probably they have a good reason to stick to 32. It proves there are definitely some possibilities there. And after all, using the FPU in a general purpose CPU is just making use of yet another set of feature that adds complexity to be replicated in FPGAs and ASICs. @bschn2 I can test on some machines if you want. I'll stop my build farm at some point to replace the cooling mechanism but the rest should remain usable. BTW if you could take this opportunity to have generic files that can be included everywhere instead of having patches for different projects which reimplement the same files :-)

wtarreau commented 5 years ago

I wanted to go back to experimenting with my rambox changes and find the operations frustrating because there are so many files to adjust (sometimes in cpuminer, sometimes in rainforest) and it's difficult to keep track of changes. Thus I'll try to split your work into multiple files that can be integrated as-is into cpuminer, hoping it will be much less painful. Given that the one in cpuminer was partially reindented and that aivve also reindented his version, I'll likely lift all this up as well to ease my progress through your code and will submit you a PR once done.

wtarreau commented 5 years ago

I have performed a number of cleanups and code reorganization, plus reintegrated my optims for cpuminer. Now the code has rf_crc32.c for the CRC, rf_aes2r.c for AES, rf_core.c for the core, rf_test.c for the tests and rf_cpuminer.c for the cpuminer-specific stuff. The .c files include each other so as not to have complex dependencies and let the compiler optimize away what it doesn't need. I tested with rf_test.c that the output hash remained unchanged. I'm sure we can do much more but I don't want to interact too much with your work. Where can I put it ?

bschn2 commented 5 years ago

Sir, thank you a lot for doing this! I know my C style isn't pretty, been used too much to code-and-forget. I have created a new rfv2 branch, please send a pull request against it and I'll merge it. I will then redo my changes after yours.

LinuXperia commented 5 years ago

@wtarreau Amazing Job ! Thank you very much.

@bschn2 Your work has the Potential to be the Real Bitcoin Hash Algorithm. Thank you very much for all the Work, your effort and research for this amazing Code. Looking forward to your new updated Version.

bschn2 commented 5 years ago

As suggested by @itwysgsl the discussion about rfv2 moved here : #7

bschn2 / rainforest

Build fixes #4