Assembler version of cryptonight_v8 is slow

tevador commented 6 years ago

xmr-stak 2.5.0, compiled on Ubuntu 16.04. On Ryzen 7 1700, I'm getting the following hashrates with 8 threads (@ 3.35 GHz):

monero	asm	H/s	percent
v7	-	536.5	100%
v8	off	499.1	93.0%
v8	amd_avx	477.5	89.0%
v8	intel_avx	481.2	89.7%

All settings are equal, just the "asm" option is changed as noted in the table.

Clearly the performance of the asm versions is not optimal, especially amd_avx, which is advertised for this CPU family in the description and is selected with the autogenerated config.

psychocrypt commented 6 years ago

You are the first whete it is slow. do you running the linux in a vm?

tevador commented 6 years ago

No, it's not a VM. The results were obtained using the --benchmark option.

tevador commented 6 years ago

I looked into the testing issue https://github.com/fireice-uk/xmr-stak/issues/1832 and saw only results from Ryzen 2000 series. No tests for Ryzen 1000 series.

tevador commented 6 years ago

Is the "amd_avx" assembly the same version as "ryzen" asm by @SChernykh ? Perhaps it's optimized just for Ryzen 5 2600.

psychocrypt commented 6 years ago

It is the same.

Spudz76 commented 6 years ago

It is for anything Hammer (iirc) or newer, not just Ryzen and/or its capabilities specifically. Similar to the Intel one runs on anything Ivy/Sandy or newer. But might not be the best on the latest CPU (Haswell or various Lakes). Especially Ryzen those are weird from an architecture standpoint (cache to core pathing etc).

The raw C++ version gets whatever "upgrades" the compiler sees fit, which could end up being faster if it decides correctly - in the Intel or older AMD cases the hand-explicit ASM modules end up better than what the compiler decides. Ryzen must have decent compiler support - AMD has been working closely with compiler groups so there may be good first-tier support for intrinsics. They were not so hands-on in the past so compilers had to guess more and had less optimization (best for those that built the CPU to also set the compiler intelligence to match, versus black box guessing game and raw PDFs of specs)

Intel still mostly lets gcc/llvm groups guess at what works best for each case, even their own compiler (ICC) pretty much generates the same junk GCC does by itself. So Intel got more help from terse ASM it can't do wrong / AMD has its hands in compilers so the compiler actually generates pretty nice ASM. AMD also has their own fork of LLVM with in-house optimizations but the features from that end up in mainstream LLVM reasonably quick. You might try their compiler kit for a few more hashes / or it might make no difference like ICC.

I also choose to disable as much of the Spectre mitigations as possible, in kernel and compilers, mining benefits from the features and speculation and I am not concerned about the fringe security implications that will never happen in my use case. Some of the compilers need to be told not to generate "safe" slow code and just make fast code without worrying who is sniffing what or what can smash your stacks. It is pointless to be that careful on a single user dedicated mining rig (behind firewalls, that never launches a browser).

SChernykh commented 6 years ago

@tevador

I looked into the testing issue #1832 and saw only results from Ryzen 2000 series. No tests for Ryzen 1000 series.

There shouldn't be much difference. Ryzen 2000 series is essentially the same core but with faster L3 cache. And of course both asm versions should be faster than non-asm version.

tevador commented 6 years ago

There shouldn't be much difference. Ryzen 2000 series is essentially the same core but with faster L3 cache.

I know it's essentially the same architecture. Perhaps changing the cache latency may affect what the optimal machine code looks like?

And of course both asm versions should be faster than non-asm version.

I thought the same, but they are slower according to the benchmark.

Here is the complete config used in the benchmark:

"cpu_threads_conf" :
[
    { "low_power_mode" : false, "no_prefetch" : false, "asm" : "off", "affine_to_cpu" : 1 },
    { "low_power_mode" : false, "no_prefetch" : false, "asm" : "off", "affine_to_cpu" : 3 },
    { "low_power_mode" : false, "no_prefetch" : false, "asm" : "off", "affine_to_cpu" : 5 },
    { "low_power_mode" : false, "no_prefetch" : false, "asm" : "off", "affine_to_cpu" : 7 },
    { "low_power_mode" : false, "no_prefetch" : false, "asm" : "off", "affine_to_cpu" : 9 },
    { "low_power_mode" : false, "no_prefetch" : false, "asm" : "off", "affine_to_cpu" : 11 },
    { "low_power_mode" : false, "no_prefetch" : false, "asm" : "off", "affine_to_cpu" : 13 },
    { "low_power_mode" : false, "no_prefetch" : false, "asm" : "off", "affine_to_cpu" : 15 },
],

SChernykh commented 6 years ago

@tevador Maybe something was running in the background and spoiled benchmark results. Can you try to becnhmark with just 1 CPU thread and compare different versions?

tevador commented 6 years ago

Maybe something was running in the background and spoiled benchmark results.

It's a headless Linux machine. Nothing else runs there and also I repeated the tests. Single core results are about the same percent-wise.

Here are results for Ryzen 5 1600 (with 8 threads):

monero	asm	H/s	percent
v7	-	478.6	100%
v8	off	443.2	92.6%
v8	amd_avx	433.3	90.5%
v8	intel_avx	438.7	91.7%

asm is still slower, although a bit less than on the 1700.

tevador commented 5 years ago

I might have figured out why the asm versions are slower.

It seems that the no_prefetch option is ignored when the asm is set to anything but "off".

Ryzen 1700:

asm	no_prefetch	H/s
off	false	499.3
off	true	449.9
amd_avx	false	477.5	89.0%
amd_avx	true	477.5

So while the asm version mines faster without prefetch, enabling prefetch on Ryzen 1000 series actually increases performance more than the optimized assembly version.

Can the "amd_avx" version be updated to support prefetch?

SChernykh commented 5 years ago

@tevador Ryzen 5 2600: asm=amd_avx gives 580 H/s asm=off and no_prefetch=false gives 567 H/s ams=off and no_prefetch=true gives 562 H/s

tevador commented 5 years ago

@SChernykh Yes, it seems that the cache was redesigned in Ryzen 2000 series. For the older Ryzens, prefetch makes a much bigger difference (~50 H/s).

SChernykh commented 5 years ago

@tevador It's not that simple to "update" asm code to support prefetch. Every new processor means starting from scratch to create optimized version. And I only have new Ryzen, so I can't make optimized version without real hardware. You should just stick with asm=off and no_prefetch=false if it works best for you. Have you tried asm=intel by the way? Maybe it would work better on your system.

tevador commented 5 years ago

@SChernykh I'm aware of that, but at least for scratchpad explode/implode, it could be enabled quite easily without modifying the assembly code. At the moment, the asm versions have hardcoded "false" for explode/implode prefetch.

Here is a similar patch in CryptoGoblin: https://github.com/Dead2/CryptoGoblin/commit/49cc1623ba23d72ac336cfeeba8a1b48dc874596

Have you tried asm=intel by the way?

Yes, it gives ~481 H/s, slightly more than amd, but still less than prefetch.

fireice-uk / xmr-stak

Assembler version of cryptonight_v8 is slow #1942