crypto: uniform runtime capability detection

patricklodder commented 1 month ago

Because we want to check for hardware capabilities at runtime among a set of current experimental features when they mature, it's good to have a uniform, central facility that checks capabilities. Detection code is riddled with macros making it hard to review/debug, because for each architecture/compiler there can be quirks.

This PR does 3 things:

It backports compat/cpuid.h from Bitcoin Core, to provide a generic, low level interface for reading out x86 CPU properties. This helps us in the future outside of crypto algorithms too, because we'll need it for RNG enhancements; see for example this finding on a proposed RDSEED implementation.
It implements a detection function in crypto/hwcap.*, where all capabilities can be tested (once), and decoupled from implementation selectors. It can be easily extended by enumerating new booleans into the HardwareCapabilities struct, which can then be detected in the DetectHWCapabilities() function. The values in this struct can also be overridden in init.cpp in the future, for example: there are plenty products where you would not want to run RDSEED or AVX2 on because either the implementation is terribly slow or has serious security flaws. This construct allows us to create a facility that turns those off using a startup argument that overrides the detection. This would allow users to simply run a released binary but runtime override what features to use, sparing them the need for compile expertise.
Implementation on the scrypt-sse2 experimental code, which is now a lot cleaner. I've slightly refactored it to have namespaces, like the Bitcoin sha256 and the sha512 proposal from #3188 has.

Testing

I think this needs a couple of specific tests that we don't have a CI for:

Compiling with clang
Compiling with MSVC
Compiling with newer mingw
Sync at least testnet with the experimental scrypt code, and once without it, because scrypt is of course part of consensus code.

victorsk2019 commented 1 month ago

Hi,

I ran bench_dogecoin for this PR and original 1.15.0-dev branch on my Dell XPS Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz system without explicitly enabling scrypt-sse2 support.

For this PR I've got for scrypt: Scrypt,2048,0.000507116317749,0.000533625483513,0.000520408619195,1119821,1178271,1149121

For original 1.15.0-dev benchmark run I've got: Scrypt,4096,0.000255910679698,0.000260507687926,0.000257457257248,565081,575221,568493

Judging by the results (and skimming over the code), my understanding is that this PR detects and automatically sets SSE2 CPU features if they are supported by hardware? This is really awesome! Does this mean sse2 and avx2 features can be regarded as standard and not that much "experimental" features anymore? The reason why I ask is in my net-p2p/dogecoin-qt I make it a requirement to specify experimental use flag as part of avx2 and scrypt-sse2 selection options (to model our current requirement for these features.

However, if CPU supports such features as avx2 and/or sse2, it is more preferable in Gentoo ecosystem to automatically detect and set CPU features during package installation without explicitly specifying them with USE flag (something similar to what this PR aims at doing), provided the user correctly specifies them for CPU_FLAGS_X86 variable in make.conf. Does this mean I can modify net-p2p/dogecoin-qt ebuild to automatically use sse2 and avx2 during package installation without experimental requirement?

Thank you, Victor.

patricklodder commented 1 month ago

Thank you.

without explicitly enabling scrypt-sse2 support

This PR does not change the experimental-ness of scrypt-sse2 and you should still pass --enable-experimental --enable-scrypt-sse2

For this PR I've got [..]

Those are odd results, could you please attach your config.log for both versions?

Judging by the results (and skimming over the code), my understanding is that this PR detects and automatically sets SSE2 CPU features if they are supported by hardware?

It already did that before - I just made the method available generically.

However, if CPU supports such features as avx2 and/or sse2, it is more preferable in Gentoo ecosystem to automatically detect and set CPU features during package installation

We can support that once we take this out of experimental. However, the results you posted are not consistent with what I would expect!

victorsk2019 commented 1 month ago

Hi,

Thank you for clarifying. Attaching both config.log files as requested. config_3512.log config_orig.log

patricklodder commented 1 month ago

I reviewed the logs... your result is a 2x worse performance when compiling without SSE2 enabled. I will try to reproduce these results, because that would be an unintended effect.

victorsk2019 commented 1 month ago

Hi,

Just a quick note that I also ran test on another Linux installation in VirtualBox and got following compilation error: sync.cpp:60:14: error: ‘set’ in namespace ‘std’ does not name a template type 60 | typedef std::set<std::pair<void*, void*> > InvLockOrders; | ^~~ sync.cpp:14:1: note: ‘std::set’ is defined in header ‘<set>’; did you forget to ‘#include <set>’? 13 | #include <boost/thread.hpp> +++ |+#include <set> 14 | sync.cpp:72:5: error: ‘InvLockOrders’ does not name a type; did you mean ‘LockOrders’? 72 | InvLockOrders invlockorders; | ^~~~~~~~~~~~~ | LockOrders sync.cpp: In function ‘void push_lock(void*, const CLockLocation&, bool)’: sync.cpp:123:18: error: ‘struct LockData’ has no member named ‘invlockorders’; did you mean ‘lockorders’? 123 | lockdata.invlockorders.insert(p2); | ^~~~~~~~~~~~~ | lockorders sync.cpp: In function ‘void DeleteLock(void*)’: sync.cpp:172:18: error: ‘struct LockData’ has no member named ‘invlockorders’; did you mean ‘lockorders’? 172 | lockdata.invlockorders.erase(invitem); | ^~~~~~~~~~~~~ | lockorders sync.cpp:175:5: error: ‘InvLockOrders’ has not been declared 175 | InvLockOrders::iterator invit = lockdata.invlockorders.lower_bound(item); | ^~~~~~~~~~~~~ sync.cpp:176:12: error: ‘invit’ was not declared in this scope; did you mean ‘int’? 176 | while (invit != lockdata.invlockorders.end() && invit->first == cs) { | ^~~~~ | int sync.cpp:176:30: error: ‘struct LockData’ has no member named ‘invlockorders’; did you mean ‘lockorders’? 176 | while (invit != lockdata.invlockorders.end() && invit->first == cs) { | ^~~~~~~~~~~~~ | lockorders sync.cpp:179:18: error: ‘struct LockData’ has no member named ‘invlockorders’; did you mean ‘lockorders’? 179 | lockdata.invlockorders.erase(invit++); | ^~~~~~~~~~~~~ | lockorders make[2]: *** [Makefile:6600: libdogecoin_util_a-sync.o] Error 1 make[2]: Leaving directory '/home/me/projects/fromgit/dogecoin/PR-3512/dogecoin/src' make[1]: *** [Makefile:9882: all-recursive] Error 1 make[1]: Leaving directory '/home/me/projects/fromgit/dogecoin/PR-3512/dogecoin/src' make: *** [Makefile:695: all-recursive] Error 1

This error was fixed by adding #include <set> in sync.cpp (as message advises). I ran bench_dogecoin on that installation and got this for Scrypt:

Scrypt,1792,0.000569975003600,0.000772126019001,0.000594207617853,1258510,1704819,1312008

config_vbox.log

patricklodder commented 1 month ago

@victorsk2019 2 notes:

Using --enable-debug may not be optimal when doing benchmarks, as that compiles with -O0 (and your make log must be full of warnings if you do this on Gentoo). Without it, you also do not get the include error.
On 1.15 you can use --enable-c++14 instead of adding the CXX flag; you can see in your config.log that it adds 2 c++ standard instructions, but only does checks for c++11. (And after #3494 goes in, I will also propose --enable-c++17.)

victorsk2019 commented 1 month ago

Thank you for additional information. I've recompiled this PR's code without --enable-debug and got the following bench results for Scrypt (this is probably what was expected): Scrypt,4096,0.000250319950283,0.000264337286353,0.000253840873484,552764,583719,560546 Maybe had something to do with mutexes locking ~~processes~~ threads causing slower benchmark performance 😕

dogecoin / dogecoin

crypto: uniform runtime capability detection #3512

Testing