Open tevador opened 6 years ago
You are the first whete it is slow. do you running the linux in a vm?
No, it's not a VM.
The results were obtained using the --benchmark
option.
I looked into the testing issue https://github.com/fireice-uk/xmr-stak/issues/1832 and saw only results from Ryzen 2000 series. No tests for Ryzen 1000 series.
Is the "amd_avx" assembly the same version as "ryzen" asm by @SChernykh ? Perhaps it's optimized just for Ryzen 5 2600.
It is the same.
It is for anything Hammer (iirc) or newer, not just Ryzen and/or its capabilities specifically. Similar to the Intel one runs on anything Ivy/Sandy or newer. But might not be the best on the latest CPU (Haswell or various Lakes). Especially Ryzen those are weird from an architecture standpoint (cache to core pathing etc).
The raw C++ version gets whatever "upgrades" the compiler sees fit, which could end up being faster if it decides correctly - in the Intel or older AMD cases the hand-explicit ASM modules end up better than what the compiler decides. Ryzen must have decent compiler support - AMD has been working closely with compiler groups so there may be good first-tier support for intrinsics. They were not so hands-on in the past so compilers had to guess more and had less optimization (best for those that built the CPU to also set the compiler intelligence to match, versus black box guessing game and raw PDFs of specs)
Intel still mostly lets gcc/llvm groups guess at what works best for each case, even their own compiler (ICC) pretty much generates the same junk GCC does by itself. So Intel got more help from terse ASM it can't do wrong / AMD has its hands in compilers so the compiler actually generates pretty nice ASM. AMD also has their own fork of LLVM with in-house optimizations but the features from that end up in mainstream LLVM reasonably quick. You might try their compiler kit for a few more hashes / or it might make no difference like ICC.
I also choose to disable as much of the Spectre mitigations as possible, in kernel and compilers, mining benefits from the features and speculation and I am not concerned about the fringe security implications that will never happen in my use case. Some of the compilers need to be told not to generate "safe" slow code and just make fast code without worrying who is sniffing what or what can smash your stacks. It is pointless to be that careful on a single user dedicated mining rig (behind firewalls, that never launches a browser).
@tevador
I looked into the testing issue #1832 and saw only results from Ryzen 2000 series. No tests for Ryzen 1000 series.
There shouldn't be much difference. Ryzen 2000 series is essentially the same core but with faster L3 cache. And of course both asm versions should be faster than non-asm version.
There shouldn't be much difference. Ryzen 2000 series is essentially the same core but with faster L3 cache.
I know it's essentially the same architecture. Perhaps changing the cache latency may affect what the optimal machine code looks like?
And of course both asm versions should be faster than non-asm version.
I thought the same, but they are slower according to the benchmark.
Here is the complete config used in the benchmark:
"cpu_threads_conf" :
[
{ "low_power_mode" : false, "no_prefetch" : false, "asm" : "off", "affine_to_cpu" : 1 },
{ "low_power_mode" : false, "no_prefetch" : false, "asm" : "off", "affine_to_cpu" : 3 },
{ "low_power_mode" : false, "no_prefetch" : false, "asm" : "off", "affine_to_cpu" : 5 },
{ "low_power_mode" : false, "no_prefetch" : false, "asm" : "off", "affine_to_cpu" : 7 },
{ "low_power_mode" : false, "no_prefetch" : false, "asm" : "off", "affine_to_cpu" : 9 },
{ "low_power_mode" : false, "no_prefetch" : false, "asm" : "off", "affine_to_cpu" : 11 },
{ "low_power_mode" : false, "no_prefetch" : false, "asm" : "off", "affine_to_cpu" : 13 },
{ "low_power_mode" : false, "no_prefetch" : false, "asm" : "off", "affine_to_cpu" : 15 },
],
@tevador Maybe something was running in the background and spoiled benchmark results. Can you try to becnhmark with just 1 CPU thread and compare different versions?
Maybe something was running in the background and spoiled benchmark results.
It's a headless Linux machine. Nothing else runs there and also I repeated the tests. Single core results are about the same percent-wise.
Here are results for Ryzen 5 1600 (with 8 threads):
monero | asm | H/s | percent |
---|---|---|---|
v7 | - | 478.6 | 100% |
v8 | off | 443.2 | 92.6% |
v8 | amd_avx | 433.3 | 90.5% |
v8 | intel_avx | 438.7 | 91.7% |
asm is still slower, although a bit less than on the 1700.
I might have figured out why the asm versions are slower.
It seems that the no_prefetch
option is ignored when the asm
is set to anything but "off".
Ryzen 1700:
asm | no_prefetch | H/s | |
---|---|---|---|
off | false | 499.3 | |
off | true | 449.9 | |
amd_avx | false | 477.5 | 89.0% |
amd_avx | true | 477.5 |
So while the asm version mines faster without prefetch, enabling prefetch on Ryzen 1000 series actually increases performance more than the optimized assembly version.
Can the "amd_avx" version be updated to support prefetch?
@tevador Ryzen 5 2600: asm=amd_avx gives 580 H/s asm=off and no_prefetch=false gives 567 H/s ams=off and no_prefetch=true gives 562 H/s
@SChernykh Yes, it seems that the cache was redesigned in Ryzen 2000 series. For the older Ryzens, prefetch makes a much bigger difference (~50 H/s).
@tevador It's not that simple to "update" asm code to support prefetch. Every new processor means starting from scratch to create optimized version. And I only have new Ryzen, so I can't make optimized version without real hardware. You should just stick with asm=off and no_prefetch=false if it works best for you. Have you tried asm=intel by the way? Maybe it would work better on your system.
@SChernykh I'm aware of that, but at least for scratchpad explode/implode, it could be enabled quite easily without modifying the assembly code. At the moment, the asm versions have hardcoded "false" for explode/implode prefetch.
Here is a similar patch in CryptoGoblin: https://github.com/Dead2/CryptoGoblin/commit/49cc1623ba23d72ac336cfeeba8a1b48dc874596
Have you tried asm=intel by the way?
Yes, it gives ~481 H/s, slightly more than amd, but still less than prefetch.
xmr-stak 2.5.0, compiled on Ubuntu 16.04. On Ryzen 7 1700, I'm getting the following hashrates with 8 threads (@ 3.35 GHz):
All settings are equal, just the "asm" option is changed as noted in the table.
Clearly the performance of the asm versions is not optimal, especially amd_avx, which is advertised for this CPU family in the description and is selected with the autogenerated config.