dogecoin / dogecoin

very currency
MIT License
14.4k stars 2.8k forks source link

[feat] Added ARMv8 SHA support #2687

Closed edtubbs closed 2 years ago

edtubbs commented 2 years ago

Implemented ARMv8 intrinsics for SHA-1 and SHA-256 Added configuration argument to enable support Added experimental build to CI environment

Replaces https://github.com/dogecoin/dogecoin/pull/2620

edtubbs commented 2 years ago

@patricklodder I'm testing this in an aarch64 docker container to push it along.

patricklodder commented 2 years ago

I'm testing this in an aarch64 docker container to push it along.

Alright. Per your comment https://github.com/dogecoin/dogecoin/pull/2620#issuecomment-956469233 I was actually taking it ez on this one. I'll figure out a way to get myself some aarch64 cloud host somewhere and help out.

edtubbs commented 2 years ago

I'm testing this in an aarch64 docker container to push it along.

Alright. Per your comment #2620 (comment) I was actually taking it ez on this one. I'll figure out a way to get myself some aarch64 cloud host somewhere and help out.

Thanks! Michi ran out of memory on the ODROID last week, but I'll get an update.

mohammedabdualhammed commented 2 years ago

تنفيذ مكونات ARMv8 لـ SHA-1 و SHA-256 وسيطة التكوين المضافة لتمكين دعم البناء التجريبي المُضاف إلى بيئة CI

يحل محل # 2620

edtubbs commented 2 years ago

@patricklodder I had added a build variable for native armv8.2 compiler that isn't valid for the cross compiler, more changes are needed.

edtubbs commented 2 years ago

Testing on armv8 hardware is needed; tests are passing in docker container.

edtubbs commented 2 years ago

Initial target results

Samsung Galaxy S10

UserLAnd (Ubuntu/aarch64-linux-gnu)

On average, test_dogecoin executes roughly 12 seconds faster with SHA1 and SHA-256 ARMv8 intrinsics.

Architecture:        aarch64
Byte Order:          Little Endian                    
CPU(s):              8                                
On-line CPU(s) list: 0-7                              
Thread(s) per core:  1                                
Core(s) per socket:  2                                
Socket(s):           3
Vendor ID:           Qualcomm                         
Model:               14                               
Stepping:            0xd                              
CPU max MHz:         2841.6001                        
CPU min MHz:         300.0000                         
BogoMIPS:            38.40                            
L1d cache:           unknown sizes                     
L1i cache:           unknown size                     
L2 cache:            unknown size
L3 cache:            unknown size                     
Flags:               fp asimd evtstrm aes pmull sha1 sha2 
crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dc

ed@localhost:~/dogecoin/src/test$ time ./test_dogecoin
Running 253 test cases...

*** No errors detected

real    2m15.408s
user    1m35.336s
sys     1m3.254s
ed@localhost:~/dogecoin/src/test$ time ./test_dogecoin
Running 253 test cases...

*** No errors detected

real    2m11.307s
user    1m34.588s
sys     0m56.653s
ed@localhost:~/dogecoin/src/test$ time ./test_dogecoin
Running 253 test cases...

*** No errors detected

real    2m11.657s
user    1m34.207s
sys     0m57.890s
ed@localhost:~/dogecoin/src/test$ time test_dogecoin  
Running 253 test cases...

*** No errors detected

real    2m1.563s
user    1m23.319s
sys     1m0.220s
ed@localhost:~/dogecoin/src/test$ time test_dogecoin
Running 253 test cases...                                                                                   

*** No errors detected

real    1m59.015s                                     
user    1m21.764s                                     
sys     0m57.242s                                     
ed@localhost:~/dogecoin/src/test$ time test_dogecoin  
Running 253 test cases...

*** No errors detected

real    1m58.234s                                     
user    1m22.354s
sys     0m54.212s                                     
ed@localhost:~/dogecoin/src/test
patricklodder commented 2 years ago

@edtubbs can you try with bench/bench_dogecoin? It shows pretty things like:

#Benchmark,count,min,max,average,min_cycles,max_cycles,average_cycles
[..]
SHA1,480,0.002021752297878,0.002681218087673,0.002173943817616,4447828,5898645,4782669
SHA256,176,0.004970729351044,0.008211925625801,0.006007801402699,10935567,18066257,13217128
SHA256_32b,4,0.363700032234192,0.368765473365784,0.366232752799988,800138735,811282725,805710730
SHA512,288,0.003498315811157,0.004499971866608,0.003790136012766,7696331,9899870,8338290
[..]
edtubbs commented 2 years ago

@edtubbs can you try with bench/bench_dogecoin? It shows pretty things like:

#Benchmark,count,min,max,average,min_cycles,max_cycles,average_cycles
[..]
SHA1,480,0.002021752297878,0.002681218087673,0.002173943817616,4447828,5898645,4782669
SHA256,176,0.004970729351044,0.008211925625801,0.006007801402699,10935567,18066257,13217128
SHA256_32b,4,0.363700032234192,0.368765473365784,0.366232752799988,800138735,811282725,805710730
SHA512,288,0.003498315811157,0.004499971866608,0.003790136012766,7696331,9899870,8338290
[..]

bench_dogecoin without intrinsics

ed@localhost:~/dogecoin/src/bench$ ./bench_dogecoin
#Benchmark,count,min,max,average,min_cycles,max_cycles,average_cycles
[..]
SHA1,384,0.002652443945408,0.002673551440239,0.002658174683650,0,0,0
SHA256,208,0.005069121718407,0.005084991455078,0.005073384596751,0,0,0
SHA256_32b,4,0.363403558731079,0.363472461700439,0.363438010215759,0,0,0
SHA512,288,0.003672942519188,0.003682434558868,0.003676412834062,0,0,0
[..]

bench_dogecoin with SHA1 and SHA-256 ARMv8 intrinsics

ed@localhost:~/dogecoin/src/bench$ ./bench_dogecoin_arm
#Benchmark,count,min,max,average,min_cycles,max_cycles,average_cycles
[..]
SHA1,1280,0.000865176320076,0.000872939825058,0.000867213308811,0,0,0
SHA256,960,0.001058697700500,0.001063376665115,0.001060704141855,0,0,0
SHA256_32b,10,0.101818919181824,0.102169990539551,0.101997494697571,0,0,0
SHA512,288,0.003674030303955,0.003680281341076,0.003676753905084,0,0,0
[..]
michilumin commented 2 years ago

Working on some stats on this, please don't close. Native build takes 4+ hours on these small ARM machines, and needs to be done 2-3 times, so the process is pretty lengthy. Will have review shortly.

michilumin commented 2 years ago
Machine:
Odroid C4
Amlogic S905X3 12nm Processor  (4-core ARM Cortex-A55 @ 2GHz, ArmV8.2-A)
4GiB DDR4

lscpu output:

Architecture:        aarch64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           1
Vendor ID:           ARM
Model:               0
Model name:          Cortex-A55
Stepping:            r1p0
CPU max MHz:         1908.0000
CPU min MHz:         100.0000
BogoMIPS:            48.00
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp

==== no intrinsics build ====

Run 1:
Running 255 test cases...

*** No errors detected

real    4m11.937s
user    3m51.316s
sys     1m18.404s

Run 2:
Running 255 test cases...

*** No errors detected

real    4m12.818s
user    3m51.896s
sys     1m21.384s

Run 3:
Running 255 test cases...

*** No errors detected

real    4m12.339s
user    3m51.612s
sys     1m20.088s

==== armv8 intrinsics build ====

Run1:
Running 255 test cases...

*** No errors detected

real    3m55.323s
user    3m37.196s
sys     1m38.788s

Run2:
Running 255 test cases...

*** No errors detected

real    3m47.701s
user    3m32.784s
sys     1m24.536s

Run3:
Running 255 test cases...

*** No errors detected

real    3m52.609s
user    3m36.496s
sys     1m32.540s

==== armv8.2 intrinsics build ====

Run1:
Running 255 test cases...

*** No errors detected

real    3m47.797s
user    3m32.912s
sys     1m25.616s

Run2:
Running 255 test cases...

*** No errors detected

real    3m52.898s
user    3m36.076s
sys     1m33.088s

Run3:
Running 255 test cases...

*** No errors detected

real    3m53.211s
user    3m35.080s
sys     1m35.316s

---------------------------------------

So around 15-17 seconds faster armv8 vs no-intrinsics; no real change going to armv8.2 (sha512) from armv8.

More tests incoming.

patricklodder commented 2 years ago

@michilumin that's great! The reason why you don't see SHA512 much is because it's mostly used for seeding the wallet / BIP32 key derivation. If you want to have full benchmark stats on all the crypto functions, you can run src/bench/bench_dogecoin.

edtubbs commented 2 years ago

I’ve pushed changes to SHA-512 that pass tests on a native build, will post more results soon.

@patricklodder the experimental cross build in the ci environment fails, I think due the version of the compiler. Can you recommend changes to ci.xml for g++ 8 or higher?

patricklodder commented 2 years ago

Can you recommend changes to ci.xml for g++ 8 or higher?

edtubbs commented 2 years ago
  • for experimental, switch to focal - because bionic supports 4.8 and 7

That works, thanks!

  • to lessen impact of a move from experimental to release - maybe we can try compiling it with the clang from depends instead of gcc? It's possibly easier to upgrade that than gcc for release.

Interesting, are you suggesting we create a new package with this code?

patricklodder commented 2 years ago

are you suggesting we create a new package with this code?

Only once we release with this (I think we'll need a runtime guard for that?)

Would probably need a separate gitian descriptor for aarch64. For depends, we already have clang under native_cctools because we use it for macOS builds.

I'd prefer it if we can make macOS build on focal though, because then we can use newer compilers for everything, but I've not been able to make that work without changing the minimum supported target OS 😕

edtubbs commented 2 years ago

Only once we release with this (I think we'll need a runtime guard for that?)

Agreed, we can read capabilities bits from the target OS

Would probably need a separate gitian descriptor for aarch64. For depends, we already have clang under native_cctools because we use it for macOS builds.

I'd prefer it if we can make macOS build on focal though, because then we can use newer compilers for everything, but I've not been able to make that work without changing the minimum supported target OS 😕

I can attempt to build with clang natively

patricklodder commented 2 years ago

I can attempt to build with clang natively

I am sure that clang supports aarch64 as --target for cross-compile through llvm, but am not sure what version we need for it to work with crypto extensions. I have checked the included binaries from the native_cctools package, but there's currently no llc exported (probably because we use Apple's tools for macOS instead of llvm?) - so we would need to fiddle a bit with packages for this to be doable.

Note that as long as we're experimental, it doesn't matter much though, so we can solve it separately from this PR.

michilumin commented 2 years ago

So, on Apple M1, with the sha1 sha2 and sha512 extensions clearly available - currently showing no difference in bench_dogecoin with and without the --with-armv8-crypto switch. So need to examine this further, something isn't kicking in, I think. Ed most recently worked with the config switches so I'll loop back with him here and make sure that something didn't get missed.

michilumin commented 2 years ago

Ok, building on my end was the issue. Fixed. Results:

M1 cpu under emulation (Parallels, Ubuntu for ARM-64 20.04)

Architecture:                    aarch64
CPU op-mode(s):                  64-bit
Byte Order:                      Little Endian
CPU(s):                          6
On-line CPU(s) list:             0-5
Thread(s) per core:              1
Core(s) per socket:              6
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       ARM
Model:                           0
Stepping:                        r0p0
BogoMIPS:                        48.00
Flags:                           fp asimd evtstrm aes pmull sha1 sha2 crc32 atom
                                 ics fphp asimdhp cpuid asimdrdm jscvt fcma lrcp
                                 c dcpop sha3 asimddp sha512 asimdfhm dit uscat 
                                 ilrcpc flagm ssbs sb paca pacg dcpodp flagm2 fr
                                 int

====Apple M1 ARM - No Intrinsics====

Run 1:
Running 253 test cases...

*** No errors detected

real    0m27.327s
user    0m29.657s
sys     0m8.270s

Run 2:
Running 253 test cases...

*** No errors detected

real    0m27.676s
user    0m29.914s
sys     0m8.724s

Run 3:
Running 253 test cases...

*** No errors detected

real    0m27.875s
user    0m30.630s
sys     0m8.607s

Bench:

#Benchmark,count,min,max,average,min_cycles,max_cycles,average_cycles

SHA1,832,0.001208953559399,0.001216325908899,0.001211082992645,0,0,0
SHA256,384,0.002816252410412,0.002833746373653,0.002820481856664,0,0,0
SHA256_32b,6,0.183485984802246,0.183998942375183,0.183686653772990,0,0,0
SHA512,576,0.001792609691620,0.001800376921892,0.001796402864986,0,0,0

====Apple M1 ARM - armv8 Intrinsics====

Run 1:
Running 253 test cases...

*** No errors detected

real    0m21.332s
user    0m23.379s
sys     0m8.168s

Run 2:
Running 253 test cases...

*** No errors detected

real    0m21.578s
user    0m23.542s
sys     0m8.689s

Run 3:
Running 253 test cases...

*** No errors detected

real    0m21.201s
user    0m23.176s
sys     0m8.295s

Bench:

#Benchmark,count,min,max,average,min_cycles,max_cycles,average_cycles

SHA1,2304,0.000454682856798,0.000457312911749,0.000455250032246,0,0,0
SHA256,2304,0.000478355214000,0.000480824150145,0.000479226518008,0,0,0
SHA256_32b,28,0.035967946052551,0.036407589912415,0.036169716290065,0,0,0
SHA512,576,0.001792661845684,0.001804739236832,0.001795532802741,0,0,0

====Apple M1 ARM - armv82 intrinsics====

Run 1:
Running 253 test cases...

*** No errors detected

real    0m20.623s
user    0m22.875s
sys     0m7.008s

Run 2:
Running 253 test cases...

*** No errors detected

real    0m21.324s
user    0m23.442s
sys     0m8.565s

Run 3:
Running 253 test cases...

*** No errors detected

real    0m21.070s
user    0m23.621s
sys     0m7.514s

Bench:

#Benchmark,count,min,max,average,min_cycles,max_cycles,average_cycles

SHA1,2304,0.000454593449831,0.000457273796201,0.000455633633667,0,0,0
SHA256,2304,0.000478317029774,0.000482515431941,0.000479137628443,0,0,0
SHA256_32b,28,0.036400437355042,0.036925077438354,0.036725819110870,0,0,0
SHA512,1408,0.000725327059627,0.000730296596885,0.000726992095059,0,0,0

Not insignificant. 7-8 seconds per run with and without; Not much difference on the test run times with armv8.2/sha512, but bench shows it certainly is working and improving performance.

michilumin commented 2 years ago

FriendlyElec NanoPi NEO2 (512mb RAM, 1ghz Cortex A53):

Architecture:        aarch64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           1
Vendor ID:           ARM
Model:               4
Model name:          Cortex-A53
Stepping:            r0p4
CPU max MHz:         1008.0000
CPU min MHz:         480.0000
BogoMIPS:            48.00
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid

====NanoPi NEO2, Cortex A53, No Intrinsics====

Run 1:
Running 253 test cases...

*** No errors detected

real    6m43.300s
user    6m43.994s
sys     1m25.625s

Run 2:
Running 253 test cases...

*** No errors detected

real    6m41.898s
user    6m42.436s
sys     1m24.858s

Run 3:
Running 253 test cases...
*** No errors detected

real    6m43.464s
user    6m42.865s
sys     1m28.414s

Bench:

#Benchmark,count,min,max,average,min_cycles,max_cycles,average_cycles

SHA1,120,0.008695006370544,0.008890479803085,0.008735750118891,0,0,0
SHA256,64,0.016333758831024,0.016511976718903,0.016383234411478,0,0,0
SHA256_32b,2,1.302595496177673,1.302595496177673,1.302595496177673,0,0,0
SHA512,96,0.010611474514008,0.010874986648560,0.010642292598883,0,0,0

====NanoPi NEO2, Cortex A53, armv8 intrinsics====

Run 1:
Running 253 test cases...

*** No errors detected

real    6m6.121s
user    6m5.458s
sys     1m19.750s

Run 2:
Running 253 test cases...

*** No errors detected

real    6m4.145s
user    6m3.978s
sys     1m20.102s

Run 3:
Running 253 test cases...

*** No errors detected

real    6m7.309s
user    6m6.720s
sys     1m20.155s

Bench:

#Benchmark,count,min,max,average,min_cycles,max_cycles,average_cycles

SHA1,384,0.002659626305103,0.002908438444138,0.002721820647518,0,0,0
SHA256,352,0.003035306930542,0.003156565129757,0.003112667663531,0,0,0
SHA256_32b,4,0.452487587928772,0.453146934509277,0.452817261219025,0,0,0
SHA512,96,0.010622859001160,0.010808378458023,0.010716758668423,0,0,0

====NanoPi NEO2, Cortex A53 : Does not appear to support armv82/sha512====

... intrinsics v no, makes around a 37 second difference in the test run.

michilumin commented 2 years ago

Hardkernel ODroid C4 (4096mb RAM, 2ghz Cortex A55):

Architecture:        aarch64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              4
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  4
Socket(s):           1
Vendor ID:           ARM
Model:               0
Model name:          Cortex-A55
Stepping:            r1p0
CPU max MHz:         1908.0000
CPU min MHz:         100.0000
BogoMIPS:            48.00
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asi
                     mdhp

====ODroid C4, Cortex A55, No Intrinsics====

Run 1:
Running 253 test cases...

*** No errors detected

real    3m53.445s
user    3m34.588s
sys     1m32.692s

Run 2:
Running 253 test cases...

*** No errors detected

real    3m51.352s
user    3m35.144s
sys     1m31.816s

Run 3:
Running 253 test cases...

*** No errors detected

real    3m55.525s
user    3m37.208s
sys     1m36.500s

Bench:

#Benchmark,count,min,max,average,min_cycles,max_cycles,average_cycles

SHA1,224,0.004558563232422,0.004636228084564,0.004577254610402,0,0,0
SHA256,120,0.008610486984253,0.008799910545349,0.008637017011642,0,0,0
SHA256_32b,2,0.665253520011902,0.665253520011902,0.665253520011902,0,0,0
SHA512,192,0.005632936954498,0.005719125270844,0.005649875849485,0,0,0

====ODroid C4, Cortex A55, armv8 intrinsics====

Run 1:
Running 253 test cases...

*** No errors detected

real    3m34.343s
user    3m18.140s
sys     1m31.160s

Run 2:

Running 253 test cases...
*** No errors detected

real    3m29.497s
user    3m12.600s
sys     1m32.828s

Run 3:
Running 253 test cases...

*** No errors detected

real    3m35.300s
user    3m13.636s
sys     1m31.332s

Bench:

#Benchmark,count,min,max,average,min_cycles,max_cycles,average_cycles

SHA1,768,0.001343995332718,0.001371115446091,0.001347934827209,0,0,0
SHA256,640,0.001578316092491,0.001609813421965,0.001587722077966,0,0,0
SHA256_32b,6,0.217849493026733,0.218177437782288,0.218026638031006,0,0,0
SHA512,192,0.005609989166260,0.005717694759369,0.005649911860625,0,0,0

====ODroid C4, Cortex A55 : Does not appear to support armv82/sha512====

around 20-25 seconds difference, with and without intrinsics.

edtubbs commented 2 years ago

Could use a squash on the commits because not all of them are functional, and I have 1 question inline.

I can do that, into a single commit?

patricklodder commented 2 years ago

single commit?

I'd recommend to at least squash the last 3 commits. The rest I think we can live with - I think it's too much a hassle to declutter the others.

patricklodder commented 2 years ago

@michilumin did you want to re-ack per @edtubbs request?

michilumin commented 2 years ago

Merging contingent future work in re Patrick's comments: have also seen that the builds core dump when run on CPUs without the needed instructions; hopefully in the future can "fail gracefully and notify" or notify and bypass. For experimental though as Pat said, all good since it needs to be explicitly built. 👍

@michilumin did you want to re-ack per @edtubbs request?

Looks like I was merging while you were typing. All good and of course, ACK, including your recommendations, Pat.