PoW modifications (shuffles and integer math) discussion and tests

SChernykh commented 6 years ago

The original discussion starts here: https://github.com/monero-project/monero/issues/3545#issuecomment-380117524 GPU version of shuffle and integer math modifications is here: https://github.com/SChernykh/xmr-stak-amd

You can post your performance test results, also your suggestions and concerns here.

AMD Ryzen 7 1700 @ 3.6 GHz, 8 threads

Mod	Hashrate	Performance level
-	600.8 H/s	100.0%
INT_MATH	588.0 H/s	97.9%
SHUFFLE	586.6 H/s	97.6%
Both mods	572.0 H/s	95.2%

AMD Ryzen 5 2600 @ 4.0 GHz, 1 thread

Mod	Hashrate	Performance level
-	97.0 H/s	100.0%
INT_MATH	91.7 H/s	94.5%
SHUFFLE	94.6 H/s	97.5%
Both mods	91.3 H/s	94.1%
Both mods (PGO build)	93.5 H/s	96.4%
Both mods (ASM optimized)	94.8 H/s	97.7%

AMD Ryzen 5 2600 @ 4.0 GHz, 8 threads (affinity 0,2,4,5,6,8,10,11)

Mod	Hashrate	Performance level
-	657.6 H/s	100.0%
INT_MATH	613.3 H/s	93.3%
SHUFFLE	647.0 H/s	98.4%
Both mods	612.3 H/s	93.1%
Both mods (PGO build)	622.4 H/s	94.6%
Both mods (ASM optimized)	636.0 H/s	96.7%

Intel Pentium G5400 (Coffee Lake, 2 cores, 4 MB Cache, 3.70 GHz), 2 threads

Mod	Hashrate	Performance level
-	146.5 H/s	100.0%
INT_MATH	141.0 H/s	96.2%
SHUFFLE	145.3 H/s	99.2%
Both mods	140.5 H/s	95.9%

Intel Core i5 3210M (Ivy Bridge, 2 cores, 3 MB Cache, 2.80 GHz), 1 thread

Mod	Hashrate	Performance level
-	72.7 H/s	100.0%
INT_MATH	66.3 H/s	91.2%
SHUFFLE	71.1 H/s	97.8%
Both mods	66.3 H/s	91.2%
Both mods (PGO build)	66.3 H/s	91.2%
Both mods (ASM optimized)	69.6 H/s	95.7%

Intel Core i7 2600K (Sandy Bridge, 4 cores, 8 MB Cache, 3.40 GHz), 1 thread

Mod	Hashrate	Performance level
-	85.6 H/S	100.0%
Both mods	70.6 H/S	82.5%
Both mods (PGO build)	76.5 H/S	89.4%
Both mods (ASM optimized)	79.2 H/S	92.5%

Intel Core i7 7820X (Skylake-X, 8 cores, 11 MB Cache, 3.60 GHz), 1 thread

Mod	Hashrate	Performance level
-	68.3 H/s	100.0%
INT_MATH	65.9 H/s	96.5%
SHUFFLE	67.3 H/s	98.5%
Both mods	65.0 H/s	95.2%

XMR-STAK used is an old version, so don't expect the same numbers that you have on your mining rigs. What's important here are relative numbers of original and modified Cryptonight versions.

Radeon RX 560 on Windows 10 (overclocked): core @ 1196 MHz, memory @ 2200 MHz, 1 Click PBE Timing Straps, monitor plugged in, intensity 1024, worksize 32:

Mod	Hashrate	Performance level
-	477.1 H/s	100.0%
INT_MATH	448.4 H/s	94.0%
SHUFFLE	457.6 H/s	95.9%
Both mods	447.0 H/s	93.7%
Both mods strided*	469.8 H/s	98.5%

* strided_index = 2, mem_chunk = 2 (64 bytes)

Radeon RX 560 on Windows 10 (RX 550 simulation): core @ 595 MHz, memory @ 2200 MHz, 1 Click PBE Timing Straps, monitor plugged in, intensity 1024, worksize 32:

Mod	Hashrate	Performance level
-	394.3 H/s	100.0%
INT_MATH	357.4 H/s	90.6%
SHUFFLE	343.2 H/s	87.0%
Both mods	316.4 H/s	80.2%
Both mods, intensity 1440*	321.1 H/s	81.4%

* Increasing intensity to 1440 improved both mods performance, but made performance worse in other cases.

It looks like RX 550 needs GPU core overclocking to properly handle new modifications.

GeForce GTX 1080 Ti 11 GB on Windows 10: core 2000 MHz, memory 11800 MHz, monitor plugged in, intensity 1280, worksize 8:

Mod	Hashrate	Performance level
-	908.4 H/s	100.0%
INT_MATH	902.7 H/s	99.4%
SHUFFLE	848.6 H/s	93.4%
Both mods	846.7 H/s	93.2%

GeForce GTX 1060 6 GB on Windows 10: all stock, monitor plugged in, intensity 800, worksize 8:

Mod	Hashrate	Performance level
-	453.6 H/s	100.0%
INT_MATH	452.2 H/s	99.7%
SHUFFLE	422.6 H/s	93.2%
Both mods	421.5 H/s	92.9%

GeForce GTX 1050 2 GB on Windows 10: core 1721 MHz, memory 1877 MHz, monitor unplugged, intensity 448, worksize 8:

Mod	Hashrate	Performance level
-	319.9 H/s	100.0%
INT_MATH	318.1 H/s	99.4%
SHUFFLE	292.5 H/s	91.4%
Both mods	291.0 H/s	91.0%

tevador commented 6 years ago

RX 550 (2 GB, 640 shaders) / Ubuntu 16.04

Mode	Intensity/Worksize	Hashrate
-	600/8	395 H/s
`-DINT_MATH_MOD -DSQRT_OPT_LEVEL=0`	760/32	277 H/s
`-DINT_MATH_MOD -DSQRT_OPT_LEVEL=1`	760/32	345 H/s
`-DINT_MATH_MOD -DSQRT_OPT_LEVEL=2`	760/32	319 H/s
`-DSHUFFLE_MOD -DINT_MATH_MOD -DSQRT_OPT_LEVEL=0`	760/32	218 H/s
`-DSHUFFLE_MOD -DINT_MATH_MOD -DSQRT_OPT_LEVEL=1`	760/32	254 H/s
`-DSHUFFLE_MOD -DINT_MATH_MOD -DSQRT_OPT_LEVEL=2`	760/32	259 H/s

The results are a bit strange. Hashrate without any mods dropped from 425 to 395. INT_MATH with optimization 1 is faster without shuffle, optimization 2 is faster with shuffle mod.

SChernykh commented 6 years ago

The results are a bit strange. Hashrate without any mods dropped from 425 to 395

It was calculated incorrectly before.

INT_MATH with optimization 1 is faster without shuffle, optimization 2 is faster with shuffle mod.

Optimization 1 is for NVIDIA cards only, AMD cards don't need it. Really strange because optimization 2 actually does less computations than optimization 1.

SChernykh commented 6 years ago

@tevador @MoneroCrusher I've improved my shuffle mod GPU code significantly. There is almost no slowdown with shuffle now and much better performance with both mods on RX 550. Can you check it? And we still need someone with Vega 56/64...

MoneroCrusher commented 6 years ago

@SChernykh @tevador I can check for both RX 550 (8 CU & 10 CU) and Vega 56 (and Vega 56 with 64 BIOS Flashed) in a couple hours. Only used the Vega on Windows so far. Are Linux drivers finally up to date? What should I use?

SChernykh commented 6 years ago

@MoneroCrusher You can test on Windows as well, it's not a problem. I've added .sln file for Visual Studio so you can compile it.

P.S. Community edition of Visual Studio (which is free) should be enough for compiling.

SChernykh commented 6 years ago

In the meantime, I've tried to overclock memory on my RX 560 (only memory, I left GPU core clock at default 1196 MHz), here are the results:

Radeon RX 560 on Windows 10 (overclocked): core @ 1196 MHz, memory @ 2275 MHz, monitor plugged in, intensity 1000, worksize 32:

Mod	Hashrate	Performance level
-	407.2 H/s	100.0%
INT_MATH	406.5 H/s	99.8%
SHUFFLE	389.0 H/s	95.5%
Both mods	386.3 H/s	94.9%

Is 2275 MHz a good speed for the memory on RX 560? I didn't get any CPU/GPU mismatch errors and I can't overclock it further - MSI Afterburner just doesn't let me do it.

@MoneroCrusher Did you try to test your Vega?

MoneroCrusher commented 6 years ago

@SChernykh You did those tests with 1 click timing straps? Can you try to do 2 threads? Did not test anything yet but will do now. Would be happy if you could provide me with the Windows binary. I don't have visual studio.

SChernykh commented 6 years ago

@MoneroCrusher

No idea about timing straps. Whatever is default on the stock card I guess. I only used MSI Afterburner and changed memory frequency, that's all.
Performance is the same with two threads at intensity 500.
Added the Windows binary.

MoneroCrusher commented 6 years ago

@SChernykh Can you do the tests with PBE 1 click timing straps? More real life then Thanks for the Windows Binary btw!

I did tests now and wrongly used Worksize 8 first for the mods, but didn't see you guys were using WS 32, so I corrected it afterwards and tried both WS 16 and 32.

Gigabyte RX 550 2 GB, 8 CU, 2 Threads (432/432), 1220/2150, 1 Click PBE Timing Straps, Ubuntu 16.04

Mod	Hashrate (WS 8)	Hashrate (WS 16)	Hashrate (WS 32)
No Mod	467 H/s	440 H/s	Crash
SHUFFLE	409 H/s	453 H/s	Crash
INT_MATH	223 H/s	302 H/s	360 H/s
Both mods	202 H/s	267 H/s	316 H/s

Sapphire RX 550 2 GB, 10 CU, 2 Threads (432/432), 1220/2150, 1 Click PBE Timing Straps, Ubuntu 16.04

Mod	Hashrate (WS 8)	Hashrate (WS 16)	Hashrate (WS 32)
No Mod	528 H/s	479 H/s	470 H/s
SHUFFLE	419 H/s	458 H/s	419 H/s
INT_MATH	229 H/s	354 H/s	353 H/s
Both mods	217 H/s	309 H/s	315 H/s

Vega RX 56, 56 CU, 2 Threads (2016/1716), 950/1417, Windows 10

Mod	Hashrate (WS 8)	Hashrate (WS 16)	Hashrate (WS 32)
No Mod	1650 H/s	1632 H/s	1613 H/s
SHUFFLE	1588 H/s	1639 H/s	1591 H/s
INT_MATH	1052 H/s	1411 H/s	1471 H/s
Both mods	1026 H/s	1321 H/s	1303 H/s

So Worksize 32 helps INT_MATH more, while worksize 16 helps Shuffle more, while worksize 8 helps no mod more. Is there some way to somehow align them?

Also, could somebody ELI5 me why RandomJS permanently prevents ASICs?

SChernykh commented 6 years ago

Can you do the tests with PBE 1 click timing straps? More real life then

I'll do it this evening. Hopefully it won't brick my card. Thanks for the numbers for Vega 56. It seems that it can handle shuffle mod perfectly. As for integer math mod, it's 89% performance compared to no mods and 81% performance for shuffle+int_math compared to shuffle mod. Can you try to tweak parameters some more? Also overclocking GPU core should really help. We need to know how good it can perform.

Also, could somebody ELI5 me why RandomJS permanently prevents ASICs?

Any ASIC that can run random code is basically a CPU. Read this comment: https://github.com/monero-project/monero/issues/3545#issuecomment-398206972

SChernykh commented 6 years ago

Vega RX 56, 56 CU, 2 Threads (2016/1716), 950/1417

Are the last 2 numbers GPU core and memory clocks? You really need to push GPU core to the maximum - integer math mod adds a lot of computations.

MoneroCrusher commented 6 years ago

@SChernykh No, HBM mem has different clocks. 950 is mem and 1417 is core.

Is it necessary for both mods to be implemented for ASIC resitance or would one of them be enough?

Gesendet mit der GMX iPhone App

Am 27.06.18 um 16:09 schrieb SChernykh

Vega RX 56, 56 CU, 2 Threads (2016/1716), 950/1417

Are the last 2 numbers GPU core and memory clocks? You really need to push GPU core to the maximum - integer math mod adds a lot of computations.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/SChernykh/xmr-stak-cpu/issues/1#issuecomment-400685132

SChernykh commented 6 years ago

No, HBM mem has different clocks. 950 is mem and 1417 is core.

Ok, it looks like it's time to leave only 1 square root in int_math mod to make it easier for GPUs. Two square roots were kind of overkill anyway.

Is it necessary for both mods to be implemented for ASIC resitance or would one of them be enough?

They target different classes of ASICs/FPGAs. Shuffle mod targets devices with external memory, making them 4 times slower. Integer math mod targets devices with on-chip memory, making them 8-10 times slower because of high division and square root latency. They work best together. Remove one mod, and you will enable an efficient ASIC/FPGA again - either with on-chip SRAM or with HBM external memory.

MoneroCrusher commented 6 years ago

Will going from 2 square root to 1 square root make it easier game for FPGA? Is it somehow possible to make RX 550 better? Quite some people are mining on those and it would be a pity if they would do 40-50% worse than other GPU in comparison (pre-fork/after-fork)

SChernykh commented 6 years ago

Will going from 2 square root to 1 square root make it easier game for FPGA?

Leaving just 1 square root won't make it easier for FPGA. The point of having a square root is that they'll still need to implement it and waste space on chip for it and that it has high computation latency.

Is it somehow possible to make RX 550 better?

Leaving 1 square root should help. The problem with RX 550 is that they are unbalanced unlike other Radeons. If you calculate GFLOPs/Memory bandwidth ratio, it will be in the range 20-25 GFLOPs/GB/s for all Radeons starting from RX 560 and up to Vega 64. RX 550 has only 10 GFLOP/GB - two times worse.

SChernykh commented 6 years ago

I think I'll make the number of square roots configurable for convenience. You'll be able to test 0, 1 or 2 square roots in int_math mod.

MoneroCrusher commented 6 years ago

Please find a way to disadvantage all GPUs the same, if that's im any way possible!

SChernykh commented 6 years ago

It's possible to slow down all GPUs from RX 560 up to Vega 64 the same, I can't guarantee it with RX 550. But we still have time for experimenting, the next fork is in September/October.

MoneroCrusher commented 6 years ago

@SChernykh So nice you came up with this algo. So in your opinion it will permanently move ASICs and FPGAs from the network? And also it uses much less power than ProgPOW. What's your opinion about ProgPOW?

I hope we could fork POW sooner if it's production ready. There are reasons to believe FPGA/ASICs are already on network.. Edit: really hope there is a way to disadvantage 560-vega more than 550 to balance it out..but let's test!

SChernykh commented 6 years ago

As for FPGAs that are coming in August (BCU1525) - they'll be slowed down from 20 KH/s to less than 5 KH/s (even down to 2 KH/s if my assumptions about division and square root latencies are correct) which will make them much worse than Vega 56/64 in terms of performance per $, so they'll not be mining Cryptonight at all.

As for possible ASICs: devices with external memory will still be ~2.5 times faster than Vega 56/64 if they use the same HBM2 memory. Given that they'll certainly be more expensive, they won't be a serious competition. Devices with on-chip memory like those 220 KH/s Bitmain miners will be down to 20-30 KH/s, still at 550 watts. Much less dangerous to the network.

ProgPOW is perfectly tuned for GPUs, but it's not CPU-friendly like Cryptonight which is a minus for decentralization. ProgPOW ASICs won't be economically viable at all.

SChernykh commented 6 years ago

I've tested RX 560 with one click timing straps, could overclock memory to 2200 MHz where it started giving CPU/GPU mismatch errors (1-3 errors per test run), but I could still test the performance.

Radeon RX 560 on Windows 10 (overclocked): core @ 1196 MHz, memory @ 2200 MHz, 1 Click PBE Timing Straps, monitor plugged in, intensity 1000, worksize 32:

Mod	Hashrate	Performance level
-	448.5 H/s	100.0%
INT_MATH	447.3 H/s	99.7%
SHUFFLE	446.8 H/s	99.6%
Both mods	439.3 H/s	97.9%

It looks like new memory timings made things better compared to plain memory overclock: 97.9% vs 94.9% performance for plain memory overclock.

P.S. I didn't see any difference between 8, 16 and 32 worksizes for version without mods.

SChernykh commented 6 years ago

I've tested it again at 2150 MHz memory - there were no CPU/GPU mismatch errors at all. I wanted to be sure that errors didn't influence test results.

Radeon RX 560 on Windows 10 (overclocked): core @ 1196 MHz, memory @ 2150 MHz, 1 Click PBE Timing Straps, monitor plugged in, intensity 1024, worksize 32:

Mod	Hashrate	Performance level
-	447.4 H/s	100.0%
INT_MATH	438.0 H/s	97.9%
SHUFFLE	440.6 H/s	98.5%
Both mods	434.9 H/s	97.2%

No mods version worked better with 2 threads @ 512 intensity, all versions with mods worked better with 1 thread @ 1024 intensity.

SChernykh commented 6 years ago

really hope there is a way to disadvantage 560-vega more than 550 to balance it out..but let's test!

It's just impossible because 560 has exactly the same memory but 2 times more powerful GPU core. Whatever you do, 560 will be faster than 550.

SChernykh commented 6 years ago

I removed one square root from integer math mod, also tested different thread count, intensity and worksize for different versions. The best I could get:

466.7 H/s (up from 447.4 H/s) for version without mods: 2 threads, intensity 512, worksize 8
439.1 H/s (up from 434.9 H/s) for version with both mods (one square root per iteration): 1 thread, intensity 1024, worksize 32

This is for RX 560. @MoneroCrusher Can you test again on Vega 56 and RX 550? I've updated the repository.

SChernykh commented 6 years ago

@MoneroCrusher If you haven't started testing yet, don't do it for now. I have some very cool changes incoming for the integer math mod. These changes both improve GPU performance AND slowdown ASIC/FPGA two times more, comparing to current integer math mod.

P.S. GPU performance didn't really improve, it stayed the same. But still very cool.

MoneroCrusher commented 6 years ago

I started tests and it has gotten better (around 20% for RX 550, but still not parity like 56/7/80, havent tested Vega yet) but I'll wait then.

Very cool! So the advantage of ASIC will only be 2-3x after your mod?

SChernykh commented 6 years ago

Very cool! So the advantage of ASIC will only be 2-3x after your mod?

Yes. We're now talking about ~15x slowdown for the coming BCU1525 FPGA (20 KH/s -> 1.4 KH/s) and similar slowdown for Bitmain ASICs (220 KH/s -> 15 KH/s). Strange thing: these changes improved int_math mod performance when it's applied alone (10% better), but int_math + shuffle stayed the same, even got 1% slower. I'm sure it can be improved further.

SChernykh commented 6 years ago

I've committed it to the repository: https://github.com/SChernykh/xmr-stak-amd/commit/566f30c42f9ffe4e8fc610a35242d4a8ba0ea063

The trick was to prevent parallel calculation of division and square roots. Now they have to be done in sequence, effectively doubling the latency for ASIC/FPGA. You can start testing now.

SChernykh commented 6 years ago

I've managed to improve integer math mod a bit more: from 254.3 H/s to 275.6 H/s on my simulated RX 550 when combined with shuffle mod, so it's 8% speed up. But I still need numbers for the real RX 550 and Vega 56 to know where it is at now.

P.S. And I've improved it some more from 275.6 H/s up to 277.0 H/s, so 9% speed up compared to the current version. I don't know what else can be done there without making it easier for ASIC/FPGA. I'm out of ideas for today & waiting for the numbers.

MoneroCrusher commented 6 years ago

Just tested with your last version:

Gigabyte RX 550 2 GB, 8 CU, 2 Threads (432/432), 1220/2150, 1 Click PBE Timing Straps, Ubuntu 16.04

Mod	Hashrate (WS 8)	Hashrate (WS 16)	Hashrate (WS 32)
No Mod	471 H/s	438 H/s	412 H/s
SHUFFLE	408 H/s	456 H/s	430 H/s
INT_MATH	268 H/s	357 H/s	418 H/s
Both mods	231 H/s	297 H/s	340 H/s

Sapphire RX 550 2 GB, 10 CU, 2 Threads (432/432), 1170/2100, 1 Click PBE Timing Straps, Ubuntu 16.04

Mod	Hashrate (WS 8)	Hashrate (WS 16)	Hashrate (WS 32)
No Mod	521 H/s	478 H/s	470 H/s
SHUFFLE	420 H/s	456 H/s	429 H/s
INT_MATH	269 H/s	418 H/s	417 H/s
Both mods	246 H/s	336 H/s	338 H/s

Vega RX 56, 56 CU, 2 Threads (2016/1736), 900/1417 (950 mem was too much), Windows 10

Mod	Hashrate (WS 8)	Hashrate (WS 16)	Hashrate (WS 32)
No Mod	1700 H/s	1692 H/s	1633 H/s
SHUFFLE	1590 H/s	1605 H/s	1526 H/s
INT_MATH	1267 H/s	1605 H/s	1615 H/s
Both mods	1039 H/s	1337 H/s	1147 H/s

Integer Math really got a big improvement but both mods are same if not a little worse.

I tried overclocking core as much as possible on the Vega and set power limit to +47%. I set Core to 1736 but actually it only showed numbers between 1605-1655 Mhz. I then hashed at 1490 H/s both mods but was drawing 334W (!) at the wall (just Vega + Intel Celeron which consumes like 25W).

I tried going from 2 threads to one thread on all cards and in every case I got worse hashrates. Don't know if I did something wrong.

Appreciate your work.

SChernykh commented 6 years ago

Ok, I'll experiment some more and update the code in the evening.

SChernykh commented 6 years ago

@MoneroCrusher I got performance up from 254 to 280 H/s - 10% improvement. You can get latest from the repository and test again.

P.S. I've managed to run tests on my RX 560 with memory @ 2200 MHz and cooler @ 100% without mismatch errors. Updated first post with my test results. Got 477 H/s without mods which is a pretty good number for real-life RX 560 mining, so we can consider RX 560 results fully relevant now.

P.P.S. Overclocking GPU core from 1196 MHz to 1450 MHz improved hashrate with both mods from 446.7 H/s to 456.9 H/s.

mobilepolice commented 6 years ago

https://github.com/SChernykh/xmr-stak-amd/commit/97eb7159265c8d26afa84297a05bb76583ac7f4a#diff-d652535bba8ff7cb31febfa7038c1ab4

Unfortunately it doesn't look like m128i_u32 and _u64 are portable, being MSVC constructs only. I don't know how this affects the overall program but I'm unable to compile with c++ 5+

edit: just to add to that, I've got a fairly sizable farm with multiple vendors of RX550 and RX560 cards that I wrote custom memory timing straps for, so was looking to mess with this and find a relationship between core speed and performance. There's not much of that relationship with the current CN v1 variant (stock speeds to overclocking mem speeds don't buy you much)

SChernykh commented 6 years ago

@mobilepolice Yes, I'm developing in Visual Studio most of the time. But fixing it for GCC is not a problem, I'll do it today. I'm also not done with optimizations yet, expect current numbers to improve.

SChernykh commented 6 years ago

@mobilepolice Fixed it, you should be able to build it now: https://github.com/SChernykh/xmr-stak-amd/commit/de285f4ed3ce29a86a4dfa7817acb18feac94ab9

SChernykh commented 6 years ago

@MoneroCrusher @mobilepolice I've submitted a new portion of optimizations, performance with integer math and both mods together became significantly better.

mobilepolice commented 6 years ago

@SChernykh really appreciate the effort on this. unfortunately missing _umul128 and _subborrow_u64 more msvc-only types. it looks like the subborrow may be in the very latest version of gcc-7 if I compile by hand but hadn't got that far. I hashed out a _umul128 but not sure if it's correct (math stuff in c/c++ is very new to me, so my apologies)

static inline uint64_t _umul128(uint64_t a, uint64_t b, uint64_t* hi) {
        unsigned __int128 z = a * b;
        *hi = z >> 64;
        return (z << 64);
}

SChernykh commented 6 years ago

@mobilepolice Wait a bit more till I fix this. I'll also have some more optimizations ready, good that you haven't started testing yet.

mobilepolice commented 6 years ago

Nope! I haven't. I'm about to run out for stuff to work on today but I will be back in about six hours and pick this back up again so take your time! Thanks

SChernykh commented 6 years ago

@mobilepolice I've submitted the fix and also some smaller performance improvements. This time I actually checked that it compiles successfully on Ubuntu 18.04. You can start testing now.

mobilepolice commented 6 years ago

@SChernykh My apologies it's taken me so long to get back to this. BitTube/IPBC had a fork today and I was messing around with a miner and a bunch of settings while working on these tests.

On the below data:

The CNv1 (Monero v7) is taken with a new-ish build of xmr-stak, so these numbers are just for reference with regard to what I'm doing today on these cards.

I typically run 2 threads and 448 intensity. There's actually a few MB left over for more intensity, either in 2 threads mode or single thread (752 is the limit in single thread, 730-740 is "safe" for guaranteed startup) I've found it's not wise to run right up against the limit, as if for any reason there's less memory than there should be available at that time, bad things happen. Cards crash, OS crashes, strange hangs. etc.

Also, on CNv1/Monerov7 I found that using compute units between 8 but less than 16 usually resulted in less performance. Your code seems to behave differently as you'll see in the data below.

I only had one CPU/GPU nonce mismatch, identified by the (1) This might be that card, as my memory timings are very tight on these cards (Elpida memory on everything in this list)

I didn't test the 550's very much as there's not a lot to be gained there. I also saw you were running 1024 intensity on some cards, Are they 4GB?

Unroll 8 seems to work better for me in nearly every case involving INT_MATH.

There's one spurious result, the RX560 Unroll 16 single thread BOTH is better than the Unroll 8 2 thread. Compared to the 560D, this should be the other way around. (338 vs 446 respectively). I will retest tomorrow.

Card	Compute Units	Memory (GB)	Core (MHz)	Mem (MHz)	Threads	Intensity	Worksize	Stock	Shuffle	Int_Math	Both	CNv1 (new xmr-stak, No Unroll)
XFX RX560D (Unroll 16)	14	2	1150	1750	2	448	8	503	459	408	289	542
XFX RX560D (Unroll 16)	14	2	1150	1750	1	732	8	475	434	383	333	-
XFX RX560D (Unroll 16)	14	2	1150	1750	1	744	14	478	442	476	438	-
XFX RX560D (Unroll 8)	14	2	1150	1750	2	448	8	503	460	506	446	542
XFX RX560D (Unroll 8)	14	2	1150	1750	1	744	14	477	442	475	438	-
XFX RX560D (Unroll 4)	14	2	1150	1750	2	448	8	503	460	501	440	542
Sapphire RX550 (Unroll 16)	8	2	1325	1760	2	448	8	469	430	312	257	520
Sapphire RX550 (Unroll 8)	8	2	1325	1760	2	448	8	469	437	343	297	520
Sapphire RX550 (Unroll 4)	8	2	1325	1760	2	448	8	469	439	336	296	520
VisionTek/XFX/HIS RX550 (Unroll 16)	8	2	1325	1760	2	448	8	469	430	312	257	520
XFX/HIS RX560 (Unroll 32)	16	2	1150	1750	2	448	16	502	387	308	273	540
XFX/HIS RX560 (Unroll 16)	16	2	1150	1750	2	448	16	518	481	334	276	540
XFX/HIS RX560 (Unroll 16)	16	2	1150	1750	1	744	16	492	457	490(1)	455	-
XFX/HIS RX560 (Unroll 8)	16	2	1150	1750	2	448	16	515	473	490	442	540
XFX/HIS RX560 (Unroll 4)	16	2	1150	1750	2	448	16	516	474	487	439	540
XFX/HIS RX560 (Unroll 2)	16	2	1150	1750	2	448	16	515	474	477	433	540
XFX/HIS RX560 (Unroll 1)	16	2	1150	1750	2	448	16	516	473	493	432	540

mobilepolice commented 6 years ago

This might be a slightly easier way to look at this data.

SChernykh commented 6 years ago

@mobilepolice Thanks for the numbers. I have a 4 GB RX 560 card with 16 compute units. Since integer math mod does a lot of computations, optimal intensity for it is N*64 where N is a number of compute units. Intensity 1024 ensures that all 1024 shader processors have things to do. Lower intensity results in proportionally lower performance.

Loop unroll setting is totally independent of intensity or thread count by the way. You only need to find it once for each mod and then mess with intensity/threads/worksize.

What parameters did you use on new xmr-stak?

P.S. Did you test worksize=32? It gives the best results on my RX 560 with integer math mod and both mods.

mobilepolice commented 6 years ago

@SChernykh No I didn't test worksize 32 but when I reboot the rig tomorrow that I was using to test (it's remote to me right now) I can try that. it looks like from previous tests ws32 works well on the RX 550's. Any other test situations you'd like me to work through I definitely will.

Definitely the more intensity that can be run, the better performance. Unfortunately it looks like to get anything above the numbers I've been running would take a 4GB card, since I just run the cards out of memory otherwise.

Assuming your 439H/s from your RX560 test a couple days back holds the same numbers, my 2GB card outperforms that a bit. the difference is going to be in timings purely, but that should be an indicator that you can probably squeeze more out of that 4GB card.

for xmr-stak I almost universally use 2 threads, 448 intensity, strided_index 1 and comp_mode true

SChernykh commented 6 years ago

@mobilepolice My RX560 does 447 H/s (both mods) with the latest code and worksize=32. I'm mostly interested in performance with both mods vs performance without mods, so pay attention to tweaking both mods' parameters. I guess I also need to implement strided_index and mem_chunk parameters for proper comparison.

mobilepolice commented 6 years ago

@SChernykh I wouldn't sweat the strided_index and mem_chunk stuff so much, I only gave numbers from the new xmr-stak as a reference I had today. I expect reimplementation into the new xmr-stak to pick up a little performance as well so I understand that the discrepancies right now between new and old with these changes are large.

It looks like RX560 to RX560, I'm seeing about an 11% performance drop compared to your ~5%. This is likely due to the 2GB card not being able to keep the compute units as busy as yours can. This doesn't bother me, I would rather have the performance loss and impact ASIC/FPGA more forcefully than keep my measly few percent. :)

Tomorrow I will test worksize 16/32 and unroll 16/8/4/2/1 in combination on the RX 560

SChernykh commented 6 years ago

It looks like RX560 to RX560, I'm seeing about an 11% performance drop compared to your ~5%. This doesn't bother me, I would rather have the performance loss and impact ASIC/FPGA more forcefully than keep my measly few percent. :)

I guess every single card will get the same performance drop in the end, so relative performance (aka the actual profit per card) will stay the same (+-5% max). RX 550 is the only sad exception so far, it stands out from other Radeons due to very weak GPU core.

SChernykh commented 6 years ago

@mobilepolice I don't know why, but I can't get better numbers with the latest xmr-stak, no matter what combination of settings I try. strided_index=1 even makes things worse for me. Just wanted to test it before implementing it in my code, but I'm not sure now.

SChernykh commented 6 years ago

Huh, I found an interesting pattern: increasing intensity to 1440 improved INT_MATH and both mods performance, but made performance worse for no mods and SHUFFLE mod. It looks like integer math mod really likes worksize=32 and also strongly reacts to higher intensity, much stronger than the original Cryptonight.

Radeon RX 560 on Windows 10 (RX 550 simulation): core @ 595 MHz, memory @ 2200 MHz, 1 Click PBE Timing Straps, monitor plugged in, intensity 1024, worksize 32:

Mod	Hashrate	Performance level
-	394.3 H/s	100.0%
INT_MATH	347.6 H/s	88.2%
INT_MATH, intensity 1440	351.4 H/s	89.1%
SHUFFLE	356.0 H/s	90.3%
Both mods	303.9 H/s	77.1%
Both mods, intensity 1440	319.7 H/s	81.1%

mobilepolice commented 6 years ago

@SChernykh I wasn't able to get out to my shop today and reset the rig that's frozen up that I was using to do this testing with. I will be able to get out there tomorrow and report back on my testing results then. In the meantime I've been looking at adjusting cli-miner.cpp so that if you specify one index in the config.txt file, it will programmatically run through, find the maximum intensity that can be set on a card for n-threads and test all of the different variables and report a hash rate for each. not sure how feasible this is, as it is now I've been doing most of it with a bash script, except I'm only programmatically doing the shuffle/int-math settings in said script.

SChernykh / xmr-stak-cpu

PoW modifications (shuffles and integer math) discussion and tests #1

XMR-STAK used is an old version, so don't expect the same numbers that you have on your mining rigs. What's important here are relative numbers of original and modified Cryptonight versions.