JayDDee / cpuminer-opt

Optimized multi algo CPU miner
Other
773 stars 545 forks source link

AVX512 KNL extensions #240

Closed moroznah closed 4 years ago

moroznah commented 4 years ago

Trying to compile with AVX512 support. It's not a Skylake CPU.

lscpu output: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 256 On-line CPU(s) list: 0-255 Thread(s) per core: 4 Core(s) per socket: 64 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 87 Model name: Intel(R) Xeon Phi(TM) CPU 7210 @ 1.30GHz Stepping: 1 CPU MHz: 1427.980 CPU max MHz: 1301.0000 CPU min MHz: 1000.0000 BogoMIPS: 2599.80 L1d cache: 32K L1i cache: 32K L2 cache: 1024K NUMA node0 CPU(s): 0-255 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl est tm2 ssse3 fma cx16 xtpr pdcm sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ring3mwait cpuid_fault epb ibrs ibpb fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms avx512f rdseed adx avx512pf avx512er avx512cd xsaveopt dtherm ida arat pln pts

With -march=native (or -march=knl) it compiles and runs, but doesn't seem to use AVX512 on supported algos. For ex. phi2: SW built on Feb 10 2020 with GCC 8.3.0. CPU features: AVX2 AES SW features: AVX2 AES Algo features: AVX512 VAES

If i do manually specify -march=skylake-avx512 miner dumps with illegal instruction. I'm guessing Skylake has additional avx512* extensions, versus KNL platform.

Is there an easy way to fix this?

Thanks!

JayDDee commented 4 years ago

What a beast! you have to post your hashrates.

cpuminer-opt requirements are the same as Skylake-X: AVX512F, AVX512VL, AVX512DQ & AVX512BW

The only one in your list are AVX512F.

I know KNC had KNCNI to show AVX512 support but I don't see that in the list and I don't know exactly what it includes.

I'm not concerned with VL, it's just backporting some advanced instructions to the older smaller vectors. I could probably remove that requirement but I don't think it would help anyone.

DQ & BW (double, quad, byte, word) are almost certainly a problem. I think you might be out of luck. The files that fail to compile will give a clue how bad the compatibility issue is.

moroznah commented 4 years ago

FYI, some benchmarks i have documented with cpuminer-opt, GCC7/8. Many are dated but figures most likely did not change much since AVX512 is not available:

Algo;Algo alias;Hashrate;Mining hash unit;Miner software;Threads/Intensity;Last updated Xevan xevan 0.450 Mh/s cpuminer-opt 256 2018-11-19 17:28:24 Argon2d-crds argon2d250 265.000 kh/s cpuminer-opt 256 2019-12-05 13:13:09 Argon2d-dyn argon2d500 80.000 kh/s cpuminer-opt 256 2019-12-05 13:07:05 Argon2d-uis argon2d4096 19.000 kh/s cpuminer-opt 256 2019-12-05 13:12:17 Lyra2z lyra2z 2.860 Mh/s cpuminer-opt 256 2018-11-19 17:26:12 M7M m7m 0.310 Mh/s cpuminer-opt 256 2018-11-19 17:26:03 YeScrypt yescrypt 14.500 kh/s cpuminer-opt 256 2018-11-19 17:25:55 Lyra2z330 lyra2z330 7.600 kh/s cpuminer-opt 128 2019-10-08 22:32:13 Yespower yespower 4.400 kh/s cpuminer-opt 128 2019-10-16 12:13:08 Allium allium 3.300 Mh/s cpuminer-opt 256 2018-11-19 17:17:41 Blake(2s) blake2s 0.480 Gh/s cpuminer-opt 256 2018-11-19 17:17:30 C11 c11 3.400 Mh/s cpuminer-opt 256 2018-11-19 17:16:55 HMQ1725 hmq1725 0.600 Mh/s cpuminer-opt 256 2018-11-19 17:17:07 PHI1612 phi1612 2.670 Mh/s cpuminer-opt 256 2018-11-19 17:16:27 Polytimos polytimos 2.200 Mh/s cpuminer-opt 256 2018-11-19 17:16:16 TimeTravel timetravel 4.200 Mh/s cpuminer-opt 256 2018-11-19 17:16:02 Tribus tribus 11.400 Mh/s cpuminer-opt 256 2018-11-19 17:15:44 YeScryptR32 yescryptr32 1.880 kh/s cpuminer-opt 256 2018-11-26 23:44:33 YeScryptR16 yescryptr16 3.730 kh/s cpuminer-opt 256 2018-11-28 15:05:49 YeScryptR8 yescryptr8 14.400 kh/s cpuminer-opt 256 2018-11-28 15:13:46 YeScryptR8G yescryptr8g 18.100 kh/s cpuminer-opt 256 2019-05-08 13:14:55 Power2b power2b 4.400 kh/s cpuminer-opt 128 2019-10-14 20:41:06

Did some digging in instruction sets on both KNL and Skylake-X servers and cpuminer-opt code, although i'm not really a coder but more of a MCU guy. As you have mentioned these platforms share just two common sets, only one of which is usable:

Skylake-X and KNL common instruction sets AVX512F - Foundation AVX512CD - Conflict detection

Skylake-X additional instruction sets AVX512BW - Byte and word AVX512DQ - Doubleword and quadword AVX512VL - Vector length

KNL additional instruction sets AVX512PF - Prefetch AVX512ER - Exponential and reciprocal

It would seem cpuminer-opt takes advantage of F, BW, DQ and VL sets. Although i don't get if F is easily separatable from others in code. There are some cpu miners which use just F with performance gains (for ex. https://github.com/bogdanadnan/ariominer). There is just one miner that maybe is using KNL specific PF and ER and advantages of onboard 16GB MCDRAM bandwidth with hugepages, but it's closed source (http://www.lukminer.net/releases/) so no usable examples...

Looks like Intel will keep chaotically enabling/disabling specific functions for different platforms according to the last table here https://en.wikipedia.org/wiki/AVX-512#cite_note-reinders512-1. So the issue of these functions availability on other cpu's will come back to haunt at some point almost certainly. A good example is IFMA, VBMI, which are usable for performance gains in mining but are absent in Skylake-X.

I could share ssh access to a Phi node for tinkering, easiest thing to do it would seem is just enabling F instructions during compilation. Implementing PF and ER is probably a lot of work, and i don't see how ER can be used, perhaps just PF.

JayDDee commented 4 years ago

Those are pretty big numbers for a CPU.

AVX512F is the foundation which suuports the rnew registers, mem load/store and some basic ops like xor. You can't really do much with it alone but you can't do anything without it.

The main issues are interger arith (requires DQ) and byte-swapping (requires BW). Any 512 implementation would have to downgrade to 256 for those ops.

A while back I considered the possibility of a CPU mining rig filled with KNLs instead of GPUs but found they weren't really suitable for mining and are way too expensive.

JayDDee commented 4 years ago

FYI here's my goto reference for AVX*

https://software.intel.com/sites/landingpage/IntrinsicsGuide

moroznah commented 4 years ago

FYI here's my goto reference for AVX*

https://software.intel.com/sites/landingpage/IntrinsicsGuide

Thanks for the read, with those references it's much easier to read your code. Went to dig in older commits with first AVX512 implementations, in blake2b 8-way and blake2s 16-way. I see lot's of BW functions there (vpblendmw, vpaddb) but also a lot of F (vpaddq, vpandd). Obviously these blake2 implementations could be subdivided for only F instruction case with integer and byte swapping in 256 ops. However it's questionable if there will be big performance benefits since most likely BW and DQ are saving most cpu cycles in the code, not simple XOR's etc.

Phi was actually cheaper then single 1080ti (per CPU), not just the CPUs but fully assembled plug-n-play barebones. These are normally socketed LGA3647 stones, not pci-e cards. They are still possibly available, large volumes are required to get sane prices. What bought me primarily is density, 80 CPUs per 42u rack (2u left for switches), and not dealing with glitchy multi GPU rigs on desktop boards with risers. There's also a point of being enterprise level hardware, with IPMI, quality fans and redundant PSUs. No regrets so far - apart from mining, have other jobs for them at times. If only mining i'd say no, perhaps with your expertise it would be different :)

JayDDee commented 4 years ago

Right now I'm drooling over the TR 3990X. It's too bad it's AVX2 performance is so poor and AVX512 non-existant. And it's overpriced to subsidize their smaller CPUs vs Intel.

As far as mixing 512 and 256 there are a lot of pitfalls that i've learned the hard way.

Conversion is sometimes faster using the cache than moving from reg to reg. Inserting or extracting the high lanes is 3 clock op. It's much easier and mor eefficient to mix sizes when dealing with mem-resident buffers than smaller reg variables.

Anything that shifts lanes is also very expensive. Some permutes can be implemented using bit shift and rotate which is faster.

Some purpose built instructions (gather/scatter) ar every slow, it's faster to load to an imm reg then move to zmm or operate on ints directly.

Any data interleaving of more than 2 sources is a multistage operation.

But immediate constants are the biggest pain in the ass, and they are used frequently as permute indexes. Generating a 512 bit vector from 64 bit immediate data is 10 instructions with a worst case latency of 25 clocks. By reordering I think I've reduced it to around 17. There are a few shortcuts when a sequence of ints is repeated.

JayDDee commented 4 years ago

I have another question. I don't have a CPU capable of testing support for more that 64 threads so I really don't know how well it works. epsecially when using --cpu-affinity.

The CLI doesn't accept integers biggere than 64 bits so it's impossible to specify an affinity mask for more than 64 threads without a rewrite of the code. So I replicate the 64 bit affinity mask until all cores are covered. So familiar masks like 0xaaaaaaaaaaaaaaaa still should work as before but over a larger set of CPUs.

You could try a few things to try to break it if you're willing to do the testing of any fixes. The timing is good because I'm trying to stabilize things after the heavy churn of the last few months. I expected some breakage but there was more than I was comfortable with.

And I always try to exploit users to help out. That's the price of free SW.:)

moroznah commented 4 years ago

Right now I'm drooling over the TR 3990X. It's too bad it's AVX2 performance is so poor and AVX512 non-existant. And it's overpriced to subsidize their smaller CPUs vs Intel.

As far as mixing 512 and 256 there are a lot of pitfalls that i've learned the hard way.

Conversion is sometimes faster using the cache than moving from reg to reg. Inserting or extracting the high lanes is 3 clock op. It's much easier and mor eefficient to mix sizes when dealing with mem-resident buffers than smaller reg variables.

Anything that shifts lanes is also very expensive. Some permutes can be implemented using bit shift and rotate which is faster.

Some purpose built instructions (gather/scatter) ar every slow, it's faster to load to an imm reg then move to zmm or operate on ints directly.

Any data interleaving of more than 2 sources is a multistage operation.

But immediate constants are the biggest pain in the ass, and they are used frequently as permute indexes. Generating a 512 bit vector from 64 bit immediate data is 10 instructions with a worst case latency of 25 clocks. By reordering I think I've reduced it to around 17. There are a few shortcuts when a sequence of ints is repeated.

About cache/registers, i'm not deep in conversion and/or moving costs on modern CPUs. But we had interesting results concerning memcpy. There's this https://software.intel.com/en-us/articles/memcpy-memset-optimization-and-control , but i don't know how effective this is on other CPUs. I was running in-house written memory bound HPC app for some time, the dev that wrote it from scratch got large performance benefits after switching from regular memcpy. It could be beneficial on memory bound algos. This, however, implies usage of icc compiler. I haven't seen much initiative in mining software to maintain different compiler code compatibility, almost everyone universally uses gcc. Personally i had success in using clang, icc after small tweaks in code in some miners. A fork of nheqminer for verushash algo gave almost 10% performance boost after just changing the compiler without significant mods in code. Ironically, Intel compiler have boosted performance also on AMD rofl. I think i opened an issue here a long time ago about icc and forgot to look for an answer :) There is some testing happening now in xmrig project regarding different compilers, with varying results https://github.com/xmrig/xmrig/issues/1512

I have another question. I don't have a CPU capable of testing support for more that 64 threads so I really don't know how well it works. epsecially when using --cpu-affinity.

The CLI doesn't accept integers biggere than 64 bits so it's impossible to specify an affinity mask for more than 64 threads without a rewrite of the code. So I replicate the 64 bit affinity mask until all cores are covered. So familiar masks like 0xaaaaaaaaaaaaaaaa still should work as before but over a larger set of CPUs.

You could try a few things to try to break it if you're willing to do the testing of any fixes. The timing is good because I'm trying to stabilize things after the heavy churn of the last few months. I expected some breakage but there was more than I was comfortable with.

And I always try to exploit users to help out. That's the price of free SW.:)

I've encountered problems with multiple miners when setting affinity as no one really cares(d) much about anything with more then 64 threads. With Phi's it's not usually a problem since they are single NUMA node in quadrant mode without external RAM, so setting affinity is not required. Some time ago (i don't remember exact software) ran a miner that unintentionally didn't even support more then 128 threads, had to launch 2 instances and insert crutches for affinity using linux tools.

But it get's more interesting with DP/MP systems, for ex. have a DP 104 thread Platinum system, which almost definitely should benefit from fixed affinity and is sufficient to test numbers greater then 64. Spare time is limited as always but let's do some testing. Found your mail on bitcointalk, sent you my contact details.

JayDDee commented 4 years ago

Replied to your email.

The subject question has been answered in the negative. Closing.

mattvirus commented 1 year ago

@moroznah did you find other optimizations for the xeon phi platform? Looked for your email but could not find...