bschn2 / rainforest

The rainforest crypto currency algorithm
MIT License
8 stars 8 forks source link

Design discussion #38

Open wtarreau opened 5 years ago

wtarreau commented 5 years ago

This is the continuation of the discussion started at the end of #15.

@bschn2 you make a good point indeed. However then what can be done is to compute a pre-hash of the whole initial message used as an IV for AES transformations for example, and fill the whole RAM area with this. This guarantees you need to use that much writable memory. Then you can perform some random walks there and mix the data until you find a pattern that works. Then it will be the memory fill time which will be a bottleneck. It's actually nice as even on the A53, while the L1 cache data path is 64-bit for reads, it's still 128-bit for writes. So there is less of a bottleneck here for small CPUs. Just as an estimate, if we use 32-bit 1600 MT/s DDR3/4, that's 6.4 GB/s theorical peak, thus 10ms to fill 64 MB. You'd cap at 100 H/s on a small machine, is this too low ?

bschn2 commented 5 years ago

No it is not too low, it's pretty usable, even on a fast block coin like MBC. In one minute you can visit 6000 nonces, you just have to start with a low difficulty. I remember seeing cryptonight around 30 H/s or so some time ago and it had happy users at XMR.

Your idea of maxing out the write bandwidth is interesting. However it is mandatory that the visited locations cannot be predicted before the area is full. E.g. you read a location, it gives you a pointer. However, keep in mind that AES is trivial on hardware and that CRC32 is doable as well albeit takes a lot of place. 64MB of SRAM is possible (but expensive). You could design an ASIC capable of issuing a 512-bit block every 12ns and write to SRAM in parallel this fast. It would only take 87 microseconds to fill this area in an ASIC. So you need to add substantial operations to either fill or read this area. I'm inclined to think that writing very fast first then performing very expensive lookups might be a solution, but for an ASIC it could allow to design it in two parts using different technologies, such as ASIC for writes and FPGA/CPU for reads. The other option is to count the time it takes to write a 64B cache line on a target CPU and to put enough complex operations (rol/ror, bit reversal, 64-bit div, CRC, AES, etc) to keep the CPU busy during this time. In your example above you have 10ns available, or 15 cycles on small CPUs to 40 cycles on big ones. It seems possible to place a div,rot,crc,aes there and have them computed for free. Some of them are very problematic for ASICs (div) and will not be easily parallelized. Also using different operations for each word prevents pipelining of hardware units in ASICs. So it could be nice to cut each cache line into 8 64b words and perform different operations on each of them. It would be possible to exploit slightly unaligned accesses during the writes to make the filling process very hard to parallelize. Imagine the mux/demux dealing with 8-bit granularity out of a 512b memory bus! That would require 6 layers of 256 muxes each, and would be enough to ruin any hopes of 12ns accesses on SRAM.

wtarreau commented 5 years ago

I'm currently revisiting this research with new elements. I've looked a bit at how GDDR5 works and just like DDR4, it suffers from long delays after closing a row, or after having accessed 4 different banks. You can typically expect around 13ns for the first access but when you have to add 43ns to close the row because you stay in the same bank, it means that random reads will still be highly impacted. Also I found that a GTX1080Ti card has 3584 cores sharing 11 memory channels, that's 325 cores per channel! Thus if we can make the workload heavily dependent on memory access time, the extra cores will only be usable to compensate the lack of crypto extensions. Given that a regular CPU fetches a full 64B cache line at once from memory and that GDDR5X also fetches 64B (32B for GDDR5), it makes sense to design a workload relying on very fast operations involving these fetched bytes so that the machines with a small CPU count per memory channel (typically 4) remain memory bound and not CPU bound.

I also found interesting to note that high-end GPUs have a small ratio of FP64-capable cores. I'm seeing numbers like 1:32 ratio on GTX1070 and 1:8 on some AMD GPUs. This could be a good way to limit the number of cores that can make fast operations.

In the end I'm interested in involving a lot of slow operations. This way even if you can optimize a part for a certain hardware, you don't save a lot. Typically I think that I'll pre-fill a work area with 64 MB of data depending on the block and on 24 bits from the nonce. This can take some time because it will be done only once every 256 hashes. On a GPU, this means that if this operation takes as much time as it takes to compute 256 hashes, it means that half of the crypto time cannot be parallelized on all cores. Thus one core fills the data then 256 can use it (or more likely the CPU prepares data and quickly uploads them). Such writes could make use of FP64 by the way. This would force any implementation (including FPGA/ASICs) to rely on soft cores and be CPU-bound on the write time. It would be the reads that would be fast but involve serial operations that are fast on small CPUs (crc, div, rot, add, rev) and slower on huge CPUs or GPUs. I was thinking that it would be nice to involve some operations like sha2 in the fast path but some devices like RPi sadly still lack such basic extensions (and at this point they're the only ones living in the 20th century) so we would impact such devices too strongly.

I started experimenting with new code and figured it was convenient to reuse some of your existing code from rfv2, thus I'm not sure whether it's better to fork rfv2 into rfv3 (or anything else) or to restart from scratch and just import some code parts and attribute credits. I already managed to cut the files in a more portable way. Just let me know what option you prefer.

bschn2 commented 5 years ago

Good morning Sir!

Nice work, and glad to see you back working on this! The timing you're talking about is t_FAW, it means "four activate window" and indicates a mandatory delay before issuing a fifth activate command. In other words it's a sliding window and you cannot have more than 4 activate commands in this window, regardless whether they are for a bank of the same group or not. Please have a look there for more detailed information, it's quite up to date and well explained: https://www.systemverilog.io/ddr4-basics https://www.systemverilog.io/understanding-ddr4-timing-parameters

If you manage to create a purely memory-bound design, it could make a high-end GTX 1080 Ti only 11 times faster than any low-end device such as a raspberry pi 3B, that would be really fair in terms of cost, power consumption, and electronic waste.

I didn't notice this 1:X FP64 ratio on GPUs, it's also true that I didn't really consider using floating point in my initial work. With this said, when you have 300 cores per memory channel, you'd still have ~10 FP64 cores per memory channel so I don't think you will be able to draw any benefit from this limitation.

Regarding the code, I would suggest that you don't embarrass yourself with the existing layout if it does not best suit your needs. Feel free to pick whatever you need there, as you know I don't really claim anything (and there's even quite a sensible part of this code from you if my memory serves me right). If you find a better split that could help further experimentation, feel free to contribute it, but if it's too much a hassle, do not spend your valuable time on this.

Kind regards, B.S.

wtarreau commented 5 years ago

On Wed, Sep 11, 2019 at 11:40:05PM -0700, bschn2 wrote:

Please have a look there for more detailed information, it's quite up to date and well explained: https://www.systemverilog.io/ddr4-basics https://www.systemverilog.io/understanding-ddr4-timing-parameters

Thanks a lot for the links, this is exactly what I was looking for! I could only find information from vendors datasheets, which assume the reader already knows this well :-)

If you manage to create a purely memory-bound design, it could make a high-end GTX 1080 Ti only 11 times faster than any low-end device such as a raspberry pi 3B, that would be really fair in terms of cost, power consumption, and electronic waste.

Yes this is the idea. I still think that this GRRD5X is a bit faster than DDR4. So maybe the card will be 20 times faster. But even 20 times faster remains OK in my opinion, the break-even would be between 30-50 times an RPi3B. I bought an RPi4B which runs fine at 2 GHz, that's 8 GHz of A72 for $35, it can constitute an attractive device for many people.

I didn't notice this 1:X FP64 ratio on GPUs, it's also true that I didn't really consider using floating point in my initial work. With this said, when you have 300 cores per memory channel, you'd still have ~10 FP64 cores per memory channel so I don't think you will be able to draw any benefit from this limitation.

In fact I was seeing it differently. I think this will be one way to keep the cores quite busy, and still make the workload way easier to implement on CPU/GPU than on FPGA/ASIC. Given rounding method is well defined in C99 and IEC559 and is configurable with fesetround() we should avoid the issues we faced lately while making it hard for an ASIC to implement a compliant stack (adding inverted values while keeping a valid 52-bit mantissa with no rounding error is likely quite tricky). By the way I'm fine with still having 0.5bit error when doing no more than 2^20 operations, because I'd then have 32-bit integers with a 20-bit fractional part that will not bleed into the integer part unless at least 1 bit of error appears. And my first tests indicate that the result indeed slightly changes when adjusting the rounding mode using fesetround().

Regarding the code, I would suggest that you don't embarrass yourself with the existing layout if it does not best suit your needs. Feel free to pick whatever you need there, as you know I don't really claim anything (and there's even quite a sensible part of this code from you if my memory serves me right). If you find a better split that could help further experimentation, feel free to contribute it, but if it's too much a hassle, do not spend your valuable time on this.

OK, will check what's best then.

Thanks! Willy

bschn2 commented 5 years ago

Thank you for all the details. I think you have a plan here.