fireice-uk / xmr-stak-cpu

Monero CPU miner
GNU General Public License v3.0
1.11k stars 478 forks source link

Benchmark MULX #78

Closed fireice-uk closed 7 years ago

fireice-uk commented 7 years ago

I pushed the code integrating the mulx instruction and generally cleaning up the hash function template system. Can you give it a run on different machines?

psychocrypt commented 7 years ago

Yes will try it today.

psychocrypt commented 7 years ago

current results: -mbmi2 avoid to run the miner on older cpus without bmi2. IMO the issue is that -mbmi2 activation some asm commands within code parts those are not guarded with your check if bmi2 is supported.

The result is an runtime error Illegal instruction

I am searching for a solution to have MULX and support old CPUS.

psychocrypt commented 7 years ago

@fireice-uk do you see any benefits of MULX on your test system?

psychocrypt commented 7 years ago

added support for mulx without the need of the full BMI2 asm instructions in #79.

psychocrypt commented 7 years ago

Test 2 x E5-2630 v3 @ 2.40GHz

version used threads no_prefetch cache [hash/s] huge pages system
#77 16 false** 2x20MiB 781 no ubuntu 14.04
dev 16 false 2x20MiB 747 no ubuntu 14.04
dev disabled MULX* 16 false 2x20MiB 763 no ubuntu 14.04
dev 16 true 2x20MiB 803 no ubuntu 14.04
#79 16 true 2x20MiB 741 no ubuntu 14.04
dev disabled MULX* 16 true 2x20MiB 805 no ubuntu 14.04

Test 2 x E5-2609 @ 2.40GHz (BMI2 not supported)

version used threads no_prefetch cache [hash/s] huge pages system
#77 8 false** 2x10MiB 357 no ubuntu 14.04
#79 8 false 2x10MiB 357 no ubuntu 14.04
#79 8 true 2x10MiB 362 no ubuntu 14.04

Test i7-4600U CPU @ 2.10GHz

version used threads no_prefetch cache [hash/s] huge pages system
#77 2 false** 4MiB 117 no ubuntu 16.04
dev 2 true 4MiB 121 no ubuntu 16.04
dev disabled MULX* 2 true 4MiB 120 no ubuntu 16.04

* hard disabled within the code ** false in #77 is equivalent to true in the current dev

psychocrypt commented 7 years ago

From my test above I would say we should change the default config and set no_prefetch : true by default because it is faster on all my tested systems.

Maybe we should think about adding a runtime option in the config file to disable MULX even if the processor supports BMI2, this would allow that the user can test this option if there is any benefit. But I think it would be more usefull to remove MULX and use only the compile flag -mbmi2. MULX is not performing the code.

79 shows a slow down on systems with BMI2 support (missing -mbmi2 is the reason).

fireice-uk commented 7 years ago

On my system I got a small 1-2% increase, but it might have been a fluke since that's 1-2H/s.

MSVC is much nicer to work with in this particular case as it will always emit the instruction that you asked for in the mnemonic regardless of the optimisation settings, so we really need to simply make gcc go along with this behaviour and we are all set.

One particular decision point in light of the benchmarks is whether we always use MULX if available or if we leave it up to the user, who is already probably very confused by prefetch and thread affinity settings. Any thoughts on that one?

fireice-uk commented 7 years ago

@psychocrypt

On systems with BMI2 support we lose performance (compared to the current dev) because of the missing -mbmi2 compile flag. The performance is not coming from MULX, instead it comes from other compiler optimization based on BMI2 features.

Can you point me to the specific data point indicating that? 99% of the time is spent inside explode - main loop - implode, and 90% inside the main loop and I can't think of anything else there that would use BMI2. The speedup might be coming from a better SSE - AVX switch strategy which might be enabled for BMI2 code.

If we confirm that this is indeed the case I will look at the binaries to see what is happening exactly

psychocrypt commented 7 years ago

Can you point me to the specific data point indicating that? 99% of the time is spent inside explode - main loop - implode, and 90% inside the main loop and I can't think of anything else there that would use BMI2. The speedup might be coming from a better SSE - AVX switch strategy which might be enabled for BMI2 code.

Yes it could be that the compiler optimize the context switch between SSE2 and AVX. But it could be also that the compiler is merging 2x SSE2 128bit operations to one AVX(2) instruction. The most reasonable optimization would be to choose BMI2 bit selection functions to perform idx0 & 0x1FFFF0.

One particular decision point in light of the benchmarks is whether we always use MULX if available or if we leave it up to the user, who is already probably very confused by prefetch and thread affinity settings. Any thoughts on that one?

IMO we should skip the explicit usage of MULX for the next release. Whith my tests I found no system where we get a benefit from the explicit usage.

If we confirm that this is indeed the case I will look at the binaries to see what is happening exactly.

Yes we should look into the binaries to understand the optimization much better.

fireice-uk commented 7 years ago

@psychocrypt

Can you point me to the specific data point indicating that? I'm asking because from the first table you seem to have conflicting results - 747 v 763 for no prefetch, 803 v 805 for prefetch.

IMO we should skip the explicit usage of MULX for the next release.

Ah, here I get the per-feature branches. And I thought that if monero makes-do with one we would be set with two =). I would much rather finish digging up mulx first.

Yes we should look into the binaries to understand the optimization much better.

If you pin down where we get the BMI2-but-not-MULX boost just post the binaries and I will disassemble and find the reason.

psychocrypt commented 7 years ago

Can you point me to the specific data point indicating that? I'm asking because from the first table you seem to have conflicting results - 747 v 763 for no prefetch, 803 v 805 for prefetch.

747 H/s was with no_prefetch : false, after removing MULX from the code I got 763 H/s hashed (increase of the performance. After disabling memory prefetching with no_prefetch : true but with MULX I got 803 H/s . Again removing the MULX usage from the code (by hand) the performance increased a little bit to 805 H/s

Ah, here I get the per-feature branches. And I thought that if monero makes-do with one we would be set with two =). I would much rather finish digging up mulx first.

Why do you think that MULX can give use any benefit over _umul128? Has MULX a higher throughput?

If you pin down where we get the BMI2-but-not-MULX boost just post the binaries and I will disassemble and find the reason.

Currently I can't pin it down to a specific part of the has function. BMIT is used for the full function cryptonight_hash and the compiler is free to implicit optimize all parts. I can check if I get more information if I isolate the loop for(size_t i = 0; i < ITERATIONS; i++).

psychocrypt commented 7 years ago

If you pin down where we get the BMI2-but-not-MULX boost just post the binaries and I will disassemble and find the reason.

Currently I can't pin it down to a specific part of the has function. BMIT is used for the full function cryptonight_hash and the compiler is free to implicit optimize all parts. I can check if I get more information if I isolate the loop for(size_t i = 0; i < ITERATIONS; i++).

I used no_prefetch:true for all tests.

  1. I isolated this for loop and put the code in a static inline function, I give all argument by reference to this function and tested to compile the full project with SSE2 and only the isolated part with BMI2. I used #80 as base where all MULX was deleted. The performance with this change is equal to #80 where the full function cryptonight_hash is compiled with BMI2.
  2. Than I switched back to the current dev and removed only the MULX usage by hand. The performance was equal to 1.
  3. I used the dev with MULX and the performance was a little bit worth than with 2.
  4. I used the code base from 2. and changed the CMake file (I removed -mbmi2). Now the code was the current dev without the BMI2 compiler flag and without MULX. I got the some performance than in than in 2. and 1.

All in all I came to the result that BMI2 is not given any performance increase. This IMO means that #80 can be close and we should open a new pull request were we remove the MULX and the compiler flag -mbmi2.

I repeated the test twice to be avoid any mistakes.

@fireice-uk Could you please test the performance on your test systems with the current dev branch and with the changes like in 4.

fireice-uk commented 7 years ago

I my results tend to confirm your findings. I see a small boots but well within 5H/s error.

However we don't see a reduction either. Would you be opposed to leaving it in for now, slated for removal on the next release if we don't find any evidence of a speedup by then?

The main reason why I expected MULX to be faster because it hasn't got fixed registers unlike mul (that's the same reason why inline asm tends to slow things down), but optimisation is always a guesswork.

psychocrypt commented 7 years ago

We can keep it, I will readd mulx as option in #80. But we still need #80 to support a binarie with 2 code passes.

fireice-uk commented 7 years ago

I will have a look at implementing it without that right now. The other thing that I would like to do before Friday is autoconfig for CPU.

psychocrypt commented 7 years ago

If we can life for now with autoconfig without affinity support a can provide it with hwloc. This means check the system for the cache sizes and calculate the number of cores to utilze the cache. Pinning with hwloc is also possible but I am not sure if the OS core indecies can be mapped to hwloc.

fireice-uk commented 7 years ago

For now I would like to do autoconfig that does two things:

hwloc will only apply to servers (single-cpu machines won't gain anything), so the standard of technical competence tends to go up there

fireice-uk commented 7 years ago

@psychocrypt this should fix the build, on gcc the bmi2 switch will apply only to the inline, while MSVC will put in the instruction anyway since we asked it to

Also I will do the autoconfig in a separate branch -> looks like we will evolve to the branch-per-issue model anyway.

psychocrypt commented 7 years ago

I will test it tonight but if the wrapper funtion is called _mulxu64 we create a invinit loop or not? Could we move the wrapper to a namespace or leave the at the beginning. _ at the beginninh should not be used because it is reserved for intrinsics.

fireice-uk commented 7 years ago

There is no loop. This is a sneaky parameter overload as for gcc unsigned long long != unit64_t (don't ask me why). The reason I did this like that is because no extra code for MSVC is needed. Other way to do it would be simply do define compat_mulx_u64 for both compilers, but given that

_mulx_u64(unit64_t, uint64_t, unit64_t) ≡ _mulx_u64(unsigned long long, unsigned long long, unsigned long long)

I don't see any reason how this can trip somebody up.

psychocrypt commented 7 years ago

The current dev implementation slows down the performance compared e.g. to #79.

Test 2 x E5-2630 v3 @ 2.40GHz

760 H/s compared to 803 H/s here

Test 2 x E5-2609 @ 2.40GHz (BMI2 not supported)

353 H/s compared to 362 H/s

Update: I run a validation test if something is wrong with my test system, with #80 I get still 803 H/s and for the other system 362 H/s.

fireice-uk commented 7 years ago

It looks like bad benchmarking to be honest - 353 v 363 especially as the code there will be 1:1 identical.

psychocrypt commented 7 years ago

@fireice-uk The rates are always stable and I can reproduce the value each run. I also copied the mulx code from #79 direct into the file cryptonight_aseni.hpp and can reproduce the slowdown from the current dev.

Do you compared the results of the dev with the results from #80 on your system?

fireice-uk commented 7 years ago

Just going to give them a run since the results seem extremely odd. Are you comparing dev to #79 #80 or another version in https://github.com/fireice-uk/xmr-stak-cpu/issues/78#issuecomment-297121596 ?

psychocrypt commented 7 years ago

79 and #80 get both the same results. Please test it with #80 because my online branch of #79 is deleted.

psychocrypt commented 7 years ago

I got the issue if you look to the results of this the pull request #79 was also only 741 H/s because of MULX.

Now I run both tests again:

Test 2 x E5-2630 v3 @ 2.40GHz

version used threads no_prefetch cache [hash/s] huge pages system
#79 16 true 2x20MiB 745 no ubuntu 14.04
#80 16 true 2x20MiB 803 no ubuntu 14.04
fireice-uk commented 7 years ago

I benchmarked in single-thread mode on command line env and I can confirm results. The numbers were (master / #80 / dev) : 36.9 - 36.8 - 34.5

Thanks for persisting with this one. I will have a look at wtf is happening there tomorrow.

psychocrypt commented 7 years ago

Thanks for persisting with this one. I will have a look at wtf is happening there tomorrow.

This decrease is coming from mulx.

Maybe something like described here

Some processors (definitely Xeon E5 v3, but perhaps others) incur a delay of 
about 10 microseconds (about 30,000 cycles) shortly after any core begins 
using 256-bit registers (if it has not used 256-bit registers in the last millisecond). 
(This appears to be a delay required to "turn on" the upper 128-bits of the 
vector pipelines --  similar to a frequency change, but occurring even if the 
frequency does not change, 
so it is probably a delay due to a voltage change.)    
This stall only occurs when using 256-bit registers -- AVX/AVX2 encodings of
scalar and 128-bit register operations do not generate this delay.
...

But I tested to use vzeroall which was suggested here in stack overflow but that is not solving the slow down.

fireice-uk commented 7 years ago

This is the AVX switch - I already tried that :).

How come #80 has the same results as the master if MULX is the cause?

psychocrypt commented 7 years ago

How come #80 has the same results as the master if MULX is the cause?

Because in #80 the MULX usage is removed (like in master). The usage of bmi2 for #80 is not shipping any benefit (detected in this post)

Do you running on a test system with huge memory pages enabled? If so than you should not get with #80 any speed increase compared to the master branch.

fireice-uk commented 7 years ago

@psychocrypt correct, I agree with you now. One more thing before the whole MULX is going to trash - do you have the same results on AMD chips?

fireice-uk commented 7 years ago

Looks like our competition switched to the SSE2 algos that I wrote back in Oct too and dopped mulx - I guess imitation is the sincerest form of flattery :)

psychocrypt commented 7 years ago

@psychocrypt correct, I agree with you now. One more thing before the whole MULX is going to trash - do you have the same results on AMD chips?

Yes same results on the AMD system. I can rerun it today night and post the results, if needed.

fireice-uk commented 7 years ago

I removed the code in ec1e41cd8c78492ac5c657d0434d84bd9c491fb5 I will close this and pr since it is clear that that direction is a no-go.