Closed edtubbs closed 2 years ago
@patricklodder I'm testing this in an aarch64 docker container to push it along.
I'm testing this in an aarch64 docker container to push it along.
Alright. Per your comment https://github.com/dogecoin/dogecoin/pull/2620#issuecomment-956469233 I was actually taking it ez on this one. I'll figure out a way to get myself some aarch64 cloud host somewhere and help out.
I'm testing this in an aarch64 docker container to push it along.
Alright. Per your comment #2620 (comment) I was actually taking it ez on this one. I'll figure out a way to get myself some aarch64 cloud host somewhere and help out.
Thanks! Michi ran out of memory on the ODROID last week, but I'll get an update.
تنفيذ مكونات ARMv8 لـ SHA-1 و SHA-256 وسيطة التكوين المضافة لتمكين دعم البناء التجريبي المُضاف إلى بيئة CI
يحل محل # 2620
@patricklodder I had added a build variable for native armv8.2 compiler that isn't valid for the cross compiler, more changes are needed.
Testing on armv8 hardware is needed; tests are passing in docker container.
On average, test_dogecoin executes roughly 12 seconds faster with SHA1 and SHA-256 ARMv8 intrinsics.
Architecture: aarch64
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 3
Vendor ID: Qualcomm
Model: 14
Stepping: 0xd
CPU max MHz: 2841.6001
CPU min MHz: 300.0000
BogoMIPS: 38.40
L1d cache: unknown sizes
L1i cache: unknown size
L2 cache: unknown size
L3 cache: unknown size
Flags: fp asimd evtstrm aes pmull sha1 sha2
crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dc
ed@localhost:~/dogecoin/src/test$ time ./test_dogecoin
Running 253 test cases...
*** No errors detected
real 2m15.408s
user 1m35.336s
sys 1m3.254s
ed@localhost:~/dogecoin/src/test$ time ./test_dogecoin
Running 253 test cases...
*** No errors detected
real 2m11.307s
user 1m34.588s
sys 0m56.653s
ed@localhost:~/dogecoin/src/test$ time ./test_dogecoin
Running 253 test cases...
*** No errors detected
real 2m11.657s
user 1m34.207s
sys 0m57.890s
ed@localhost:~/dogecoin/src/test$ time test_dogecoin
Running 253 test cases...
*** No errors detected
real 2m1.563s
user 1m23.319s
sys 1m0.220s
ed@localhost:~/dogecoin/src/test$ time test_dogecoin
Running 253 test cases...
*** No errors detected
real 1m59.015s
user 1m21.764s
sys 0m57.242s
ed@localhost:~/dogecoin/src/test$ time test_dogecoin
Running 253 test cases...
*** No errors detected
real 1m58.234s
user 1m22.354s
sys 0m54.212s
ed@localhost:~/dogecoin/src/test
@edtubbs can you try with bench/bench_dogecoin
? It shows pretty things like:
#Benchmark,count,min,max,average,min_cycles,max_cycles,average_cycles
[..]
SHA1,480,0.002021752297878,0.002681218087673,0.002173943817616,4447828,5898645,4782669
SHA256,176,0.004970729351044,0.008211925625801,0.006007801402699,10935567,18066257,13217128
SHA256_32b,4,0.363700032234192,0.368765473365784,0.366232752799988,800138735,811282725,805710730
SHA512,288,0.003498315811157,0.004499971866608,0.003790136012766,7696331,9899870,8338290
[..]
@edtubbs can you try with
bench/bench_dogecoin
? It shows pretty things like:#Benchmark,count,min,max,average,min_cycles,max_cycles,average_cycles [..] SHA1,480,0.002021752297878,0.002681218087673,0.002173943817616,4447828,5898645,4782669 SHA256,176,0.004970729351044,0.008211925625801,0.006007801402699,10935567,18066257,13217128 SHA256_32b,4,0.363700032234192,0.368765473365784,0.366232752799988,800138735,811282725,805710730 SHA512,288,0.003498315811157,0.004499971866608,0.003790136012766,7696331,9899870,8338290 [..]
bench_dogecoin without intrinsics
ed@localhost:~/dogecoin/src/bench$ ./bench_dogecoin #Benchmark,count,min,max,average,min_cycles,max_cycles,average_cycles [..] SHA1,384,0.002652443945408,0.002673551440239,0.002658174683650,0,0,0 SHA256,208,0.005069121718407,0.005084991455078,0.005073384596751,0,0,0 SHA256_32b,4,0.363403558731079,0.363472461700439,0.363438010215759,0,0,0 SHA512,288,0.003672942519188,0.003682434558868,0.003676412834062,0,0,0 [..]
bench_dogecoin with SHA1 and SHA-256 ARMv8 intrinsics
ed@localhost:~/dogecoin/src/bench$ ./bench_dogecoin_arm #Benchmark,count,min,max,average,min_cycles,max_cycles,average_cycles [..] SHA1,1280,0.000865176320076,0.000872939825058,0.000867213308811,0,0,0 SHA256,960,0.001058697700500,0.001063376665115,0.001060704141855,0,0,0 SHA256_32b,10,0.101818919181824,0.102169990539551,0.101997494697571,0,0,0 SHA512,288,0.003674030303955,0.003680281341076,0.003676753905084,0,0,0 [..]
Working on some stats on this, please don't close. Native build takes 4+ hours on these small ARM machines, and needs to be done 2-3 times, so the process is pretty lengthy. Will have review shortly.
Machine:
Odroid C4
Amlogic S905X3 12nm Processor (4-core ARM Cortex-A55 @ 2GHz, ArmV8.2-A)
4GiB DDR4
lscpu output:
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Vendor ID: ARM
Model: 0
Model name: Cortex-A55
Stepping: r1p0
CPU max MHz: 1908.0000
CPU min MHz: 100.0000
BogoMIPS: 48.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp
==== no intrinsics build ====
Run 1:
Running 255 test cases...
*** No errors detected
real 4m11.937s
user 3m51.316s
sys 1m18.404s
Run 2:
Running 255 test cases...
*** No errors detected
real 4m12.818s
user 3m51.896s
sys 1m21.384s
Run 3:
Running 255 test cases...
*** No errors detected
real 4m12.339s
user 3m51.612s
sys 1m20.088s
==== armv8 intrinsics build ====
Run1:
Running 255 test cases...
*** No errors detected
real 3m55.323s
user 3m37.196s
sys 1m38.788s
Run2:
Running 255 test cases...
*** No errors detected
real 3m47.701s
user 3m32.784s
sys 1m24.536s
Run3:
Running 255 test cases...
*** No errors detected
real 3m52.609s
user 3m36.496s
sys 1m32.540s
==== armv8.2 intrinsics build ====
Run1:
Running 255 test cases...
*** No errors detected
real 3m47.797s
user 3m32.912s
sys 1m25.616s
Run2:
Running 255 test cases...
*** No errors detected
real 3m52.898s
user 3m36.076s
sys 1m33.088s
Run3:
Running 255 test cases...
*** No errors detected
real 3m53.211s
user 3m35.080s
sys 1m35.316s
---------------------------------------
So around 15-17 seconds faster armv8 vs no-intrinsics; no real change going to armv8.2 (sha512) from armv8.
More tests incoming.
@michilumin that's great! The reason why you don't see SHA512 much is because it's mostly used for seeding the wallet / BIP32 key derivation. If you want to have full benchmark stats on all the crypto functions, you can run src/bench/bench_dogecoin
.
I’ve pushed changes to SHA-512 that pass tests on a native build, will post more results soon.
@patricklodder the experimental cross build in the ci environment fails, I think due the version of the compiler. Can you recommend changes to ci.xml for g++ 8 or higher?
Can you recommend changes to ci.xml for g++ 8 or higher?
- for experimental, switch to focal - because bionic supports 4.8 and 7
That works, thanks!
- to lessen impact of a move from experimental to release - maybe we can try compiling it with the clang from depends instead of gcc? It's possibly easier to upgrade that than gcc for release.
Interesting, are you suggesting we create a new package with this code?
are you suggesting we create a new package with this code?
Only once we release with this (I think we'll need a runtime guard for that?)
Would probably need a separate gitian descriptor for aarch64. For depends, we already have clang under native_cctools because we use it for macOS builds.
I'd prefer it if we can make macOS build on focal though, because then we can use newer compilers for everything, but I've not been able to make that work without changing the minimum supported target OS 😕
Only once we release with this (I think we'll need a runtime guard for that?)
Agreed, we can read capabilities bits from the target OS
Would probably need a separate gitian descriptor for aarch64. For depends, we already have clang under native_cctools because we use it for macOS builds.
I'd prefer it if we can make macOS build on focal though, because then we can use newer compilers for everything, but I've not been able to make that work without changing the minimum supported target OS 😕
I can attempt to build with clang natively
I can attempt to build with clang natively
I am sure that clang supports aarch64 as --target
for cross-compile through llvm, but am not sure what version we need for it to work with crypto extensions. I have checked the included binaries from the native_cctools
package, but there's currently no llc
exported (probably because we use Apple's tools for macOS instead of llvm?) - so we would need to fiddle a bit with packages for this to be doable.
Note that as long as we're experimental, it doesn't matter much though, so we can solve it separately from this PR.
So, on Apple M1, with the sha1 sha2 and sha512 extensions clearly available - currently showing no difference in bench_dogecoin with and without the --with-armv8-crypto switch. So need to examine this further, something isn't kicking in, I think. Ed most recently worked with the config switches so I'll loop back with him here and make sure that something didn't get missed.
Ok, building on my end was the issue. Fixed. Results:
M1 cpu under emulation (Parallels, Ubuntu for ARM-64 20.04)
Architecture: aarch64
CPU op-mode(s): 64-bit
Byte Order: Little Endian
CPU(s): 6
On-line CPU(s) list: 0-5
Thread(s) per core: 1
Core(s) per socket: 6
Socket(s): 1
NUMA node(s): 1
Vendor ID: ARM
Model: 0
Stepping: r0p0
BogoMIPS: 48.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atom
ics fphp asimdhp cpuid asimdrdm jscvt fcma lrcp
c dcpop sha3 asimddp sha512 asimdfhm dit uscat
ilrcpc flagm ssbs sb paca pacg dcpodp flagm2 fr
int
====Apple M1 ARM - No Intrinsics====
Run 1:
Running 253 test cases...
*** No errors detected
real 0m27.327s
user 0m29.657s
sys 0m8.270s
Run 2:
Running 253 test cases...
*** No errors detected
real 0m27.676s
user 0m29.914s
sys 0m8.724s
Run 3:
Running 253 test cases...
*** No errors detected
real 0m27.875s
user 0m30.630s
sys 0m8.607s
Bench:
#Benchmark,count,min,max,average,min_cycles,max_cycles,average_cycles
SHA1,832,0.001208953559399,0.001216325908899,0.001211082992645,0,0,0
SHA256,384,0.002816252410412,0.002833746373653,0.002820481856664,0,0,0
SHA256_32b,6,0.183485984802246,0.183998942375183,0.183686653772990,0,0,0
SHA512,576,0.001792609691620,0.001800376921892,0.001796402864986,0,0,0
====Apple M1 ARM - armv8 Intrinsics====
Run 1:
Running 253 test cases...
*** No errors detected
real 0m21.332s
user 0m23.379s
sys 0m8.168s
Run 2:
Running 253 test cases...
*** No errors detected
real 0m21.578s
user 0m23.542s
sys 0m8.689s
Run 3:
Running 253 test cases...
*** No errors detected
real 0m21.201s
user 0m23.176s
sys 0m8.295s
Bench:
#Benchmark,count,min,max,average,min_cycles,max_cycles,average_cycles
SHA1,2304,0.000454682856798,0.000457312911749,0.000455250032246,0,0,0
SHA256,2304,0.000478355214000,0.000480824150145,0.000479226518008,0,0,0
SHA256_32b,28,0.035967946052551,0.036407589912415,0.036169716290065,0,0,0
SHA512,576,0.001792661845684,0.001804739236832,0.001795532802741,0,0,0
====Apple M1 ARM - armv82 intrinsics====
Run 1:
Running 253 test cases...
*** No errors detected
real 0m20.623s
user 0m22.875s
sys 0m7.008s
Run 2:
Running 253 test cases...
*** No errors detected
real 0m21.324s
user 0m23.442s
sys 0m8.565s
Run 3:
Running 253 test cases...
*** No errors detected
real 0m21.070s
user 0m23.621s
sys 0m7.514s
Bench:
#Benchmark,count,min,max,average,min_cycles,max_cycles,average_cycles
SHA1,2304,0.000454593449831,0.000457273796201,0.000455633633667,0,0,0
SHA256,2304,0.000478317029774,0.000482515431941,0.000479137628443,0,0,0
SHA256_32b,28,0.036400437355042,0.036925077438354,0.036725819110870,0,0,0
SHA512,1408,0.000725327059627,0.000730296596885,0.000726992095059,0,0,0
Not insignificant. 7-8 seconds per run with and without; Not much difference on the test run times with armv8.2/sha512, but bench shows it certainly is working and improving performance.
FriendlyElec NanoPi NEO2 (512mb RAM, 1ghz Cortex A53):
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Vendor ID: ARM
Model: 4
Model name: Cortex-A53
Stepping: r0p4
CPU max MHz: 1008.0000
CPU min MHz: 480.0000
BogoMIPS: 48.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
====NanoPi NEO2, Cortex A53, No Intrinsics====
Run 1:
Running 253 test cases...
*** No errors detected
real 6m43.300s
user 6m43.994s
sys 1m25.625s
Run 2:
Running 253 test cases...
*** No errors detected
real 6m41.898s
user 6m42.436s
sys 1m24.858s
Run 3:
Running 253 test cases...
*** No errors detected
real 6m43.464s
user 6m42.865s
sys 1m28.414s
Bench:
#Benchmark,count,min,max,average,min_cycles,max_cycles,average_cycles
SHA1,120,0.008695006370544,0.008890479803085,0.008735750118891,0,0,0
SHA256,64,0.016333758831024,0.016511976718903,0.016383234411478,0,0,0
SHA256_32b,2,1.302595496177673,1.302595496177673,1.302595496177673,0,0,0
SHA512,96,0.010611474514008,0.010874986648560,0.010642292598883,0,0,0
====NanoPi NEO2, Cortex A53, armv8 intrinsics====
Run 1:
Running 253 test cases...
*** No errors detected
real 6m6.121s
user 6m5.458s
sys 1m19.750s
Run 2:
Running 253 test cases...
*** No errors detected
real 6m4.145s
user 6m3.978s
sys 1m20.102s
Run 3:
Running 253 test cases...
*** No errors detected
real 6m7.309s
user 6m6.720s
sys 1m20.155s
Bench:
#Benchmark,count,min,max,average,min_cycles,max_cycles,average_cycles
SHA1,384,0.002659626305103,0.002908438444138,0.002721820647518,0,0,0
SHA256,352,0.003035306930542,0.003156565129757,0.003112667663531,0,0,0
SHA256_32b,4,0.452487587928772,0.453146934509277,0.452817261219025,0,0,0
SHA512,96,0.010622859001160,0.010808378458023,0.010716758668423,0,0,0
====NanoPi NEO2, Cortex A53 : Does not appear to support armv82/sha512====
... intrinsics v no, makes around a 37 second difference in the test run.
Hardkernel ODroid C4 (4096mb RAM, 2ghz Cortex A55):
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Vendor ID: ARM
Model: 0
Model name: Cortex-A55
Stepping: r1p0
CPU max MHz: 1908.0000
CPU min MHz: 100.0000
BogoMIPS: 48.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asi
mdhp
====ODroid C4, Cortex A55, No Intrinsics====
Run 1:
Running 253 test cases...
*** No errors detected
real 3m53.445s
user 3m34.588s
sys 1m32.692s
Run 2:
Running 253 test cases...
*** No errors detected
real 3m51.352s
user 3m35.144s
sys 1m31.816s
Run 3:
Running 253 test cases...
*** No errors detected
real 3m55.525s
user 3m37.208s
sys 1m36.500s
Bench:
#Benchmark,count,min,max,average,min_cycles,max_cycles,average_cycles
SHA1,224,0.004558563232422,0.004636228084564,0.004577254610402,0,0,0
SHA256,120,0.008610486984253,0.008799910545349,0.008637017011642,0,0,0
SHA256_32b,2,0.665253520011902,0.665253520011902,0.665253520011902,0,0,0
SHA512,192,0.005632936954498,0.005719125270844,0.005649875849485,0,0,0
====ODroid C4, Cortex A55, armv8 intrinsics====
Run 1:
Running 253 test cases...
*** No errors detected
real 3m34.343s
user 3m18.140s
sys 1m31.160s
Run 2:
Running 253 test cases...
*** No errors detected
real 3m29.497s
user 3m12.600s
sys 1m32.828s
Run 3:
Running 253 test cases...
*** No errors detected
real 3m35.300s
user 3m13.636s
sys 1m31.332s
Bench:
#Benchmark,count,min,max,average,min_cycles,max_cycles,average_cycles
SHA1,768,0.001343995332718,0.001371115446091,0.001347934827209,0,0,0
SHA256,640,0.001578316092491,0.001609813421965,0.001587722077966,0,0,0
SHA256_32b,6,0.217849493026733,0.218177437782288,0.218026638031006,0,0,0
SHA512,192,0.005609989166260,0.005717694759369,0.005649911860625,0,0,0
====ODroid C4, Cortex A55 : Does not appear to support armv82/sha512====
around 20-25 seconds difference, with and without intrinsics.
Could use a squash on the commits because not all of them are functional, and I have 1 question inline.
I can do that, into a single commit?
single commit?
I'd recommend to at least squash the last 3 commits. The rest I think we can live with - I think it's too much a hassle to declutter the others.
@michilumin did you want to re-ack per @edtubbs request?
Merging contingent future work in re Patrick's comments: have also seen that the builds core dump when run on CPUs without the needed instructions; hopefully in the future can "fail gracefully and notify" or notify and bypass. For experimental though as Pat said, all good since it needs to be explicitly built. 👍
@michilumin did you want to re-ack per @edtubbs request?
Looks like I was merging while you were typing. All good and of course, ACK, including your recommendations, Pat.
Implemented ARMv8 intrinsics for SHA-1 and SHA-256 Added configuration argument to enable support Added experimental build to CI environment
Replaces https://github.com/dogecoin/dogecoin/pull/2620