fireice-uk / xmr-stak

Free Monero RandomX Miner and unified CryptoNight miner
GNU General Public License v3.0
4.05k stars 1.79k forks source link

Xeon Phi x100 support #2679

Open nikolobok opened 4 years ago

nikolobok commented 4 years ago

Old Intel Xeon Phi x100 (Knights Corner) include x86 ISA, 4-way SMT per core, 512-bit SIMD units, 32 KB L1 instruction cache, 32 KB L1 data cache, coherent L2 cache (512 KB per core[53]), and ultra-wide ring bus connecting processors and memory (GDDR5). AES and AVX not supported, but there is original IMCI 512-bit SIMD instructions set. Also Intel OpenCL 14.2 driver has support of x100 coprocessor. So could it be used to calculate RandomX algo? And is there any pitfalls to adopt current implementation?

stu-l commented 4 years ago

As far as I'm aware AVX is supported on all Xeon Phi chips even the early ones, all the way up to AVX-512. Since AVX has some AES intrinsics I would have expected them to be supported also, but I cant confirm this (especially as there are a number of AVX variants). Since each Xeon Phi core is compatible with any x86-64 core the implication would be that it could run the RandomX algo. However on a quick investigation memory might be an issue. Running RandomX in fast mode requires 2080MB per instance, which would be 57*2080=119GB on my Xeon Phi (an early one), but the chip only has 28.5MB. So all cores would not be utilised. (ref: https://github.com/tevador/RandomX).

Spaceguide commented 4 years ago

As far as I'm aware AVX is supported on all Xeon Phi chips even the early ones, all the way up to AVX-512. Since AVX has some AES intrinsics I would have expected them to be supported also, but I cant confirm this (especially as there are a number of AVX variants). Since each Xeon Phi core is compatible with any x86-64 core the implication would be that it could run the RandomX algo. However on a quick investigation memory might be an issue. Running RandomX in fast mode requires 2080MB per instance, which would be 57*2080=119GB on my Xeon Phi (an early one), but the chip only has 28.5MB. So all cores would not be utilised. (ref: https://github.com/tevador/RandomX).

That's a bit far off... You need 2MB per thread , so 57 x 2MB (L3 cache ) would be 114MB , so, your Phi could run with 14 instances. Exactly the big L3 cache makes newer AMD's now superior. You do need access to huge pages in memory, and have fast memory , I guess you would have enough with some 2048MB here, I guess, you can put DDR memory into a Phi machine ? Are there no Phi's avail with more L3 avail then 28MB ?

nikolobok commented 4 years ago

These all need some low level coding and testing. By the way, why scratchad memory constants of the RandomX algo choosed by people.. could it be dynamically adopted based on some statistics and network mining complexity?

stu-l commented 4 years ago

As far as I'm aware AVX is supported on all Xeon Phi chips even the early ones, all the way up to AVX-512. Since AVX has some AES intrinsics I would have expected them to be supported also, but I cant confirm this (especially as there are a number of AVX variants). Since each Xeon Phi core is compatible with any x86-64 core the implication would be that it could run the RandomX algo. However on a quick investigation memory might be an issue. Running RandomX in fast mode requires 2080MB per instance, which would be 57*2080=119GB on my Xeon Phi (an early one), but the chip only has 28.5MB. So all cores would not be utilised. (ref: https://github.com/tevador/RandomX).

That's a bit far off... You need 2MB per thread , so 57 x 2MB (L3 cache ) would be 114MB , so, your Phi could run with 14 instances. Exactly the big L3 cache makes newer AMD's now superior. You do need access to huge pages in memory, and have fast memory , I guess you would have enough with some 2048MB here, I guess, you can put DDR memory into a Phi machine ? Are there no Phi's avail with more L3 avail then 28MB ?

The Phi runs optimally if it is running all threads (57 on my boards), and the spec goes on to say that it is best with 4 threads per core, all in 6GB (my board). They only have L1 and L2 cache, no L3. The Phi is connected to the machine via PCIe bus, for any sort of performance all data needs to be present locally on the board. This is an issue on all add on boards, the data must be local. Also the job running on the board must 'make back' the time it took to transfer the data in the first place, otherwise there is no point off loading it.

ghost commented 4 years ago

It seems great to mine randomx on a xeon phi, since the GDDR5 memory is fast and large enough to mine all threads but a major problem is how to port the code to the IMCI instruction set. Even though AES is not supported a soft aes can be used.

nikolobok commented 4 years ago

L3 is absent and L2 is to little for Monero RandomX constants. Seems it will be slow. But also I found RandomX too synthetic for securing blockchain and regular forking makes it very risky to investing in it. My vision is to remove that man tuned parameters of mining algo and make it dynamically self configured, based on network statistics. Registry based virtual machine will not succeed in long term. It should be switched to more generic virtual machine to be really independent from any people. I will return later with proposal of nextgen mining algo it is all ready for its implementation.