Open SChernykh opened 6 years ago
RX 550 (2 GB, 640 shaders) / Ubuntu 16.04
Mode | Intensity/Worksize | Hashrate |
---|---|---|
- | 600/8 | 395 H/s |
-DINT_MATH_MOD -DSQRT_OPT_LEVEL=0 |
760/32 | 277 H/s |
-DINT_MATH_MOD -DSQRT_OPT_LEVEL=1 |
760/32 | 345 H/s |
-DINT_MATH_MOD -DSQRT_OPT_LEVEL=2 |
760/32 | 319 H/s |
-DSHUFFLE_MOD -DINT_MATH_MOD -DSQRT_OPT_LEVEL=0 |
760/32 | 218 H/s |
-DSHUFFLE_MOD -DINT_MATH_MOD -DSQRT_OPT_LEVEL=1 |
760/32 | 254 H/s |
-DSHUFFLE_MOD -DINT_MATH_MOD -DSQRT_OPT_LEVEL=2 |
760/32 | 259 H/s |
The results are a bit strange. Hashrate without any mods dropped from 425 to 395. INT_MATH with optimization 1 is faster without shuffle, optimization 2 is faster with shuffle mod.
The results are a bit strange. Hashrate without any mods dropped from 425 to 395
It was calculated incorrectly before.
INT_MATH with optimization 1 is faster without shuffle, optimization 2 is faster with shuffle mod.
Optimization 1 is for NVIDIA cards only, AMD cards don't need it. Really strange because optimization 2 actually does less computations than optimization 1.
@tevador @MoneroCrusher I've improved my shuffle mod GPU code significantly. There is almost no slowdown with shuffle now and much better performance with both mods on RX 550. Can you check it? And we still need someone with Vega 56/64...
@SChernykh @tevador I can check for both RX 550 (8 CU & 10 CU) and Vega 56 (and Vega 56 with 64 BIOS Flashed) in a couple hours. Only used the Vega on Windows so far. Are Linux drivers finally up to date? What should I use?
@MoneroCrusher You can test on Windows as well, it's not a problem. I've added .sln file for Visual Studio so you can compile it.
P.S. Community edition of Visual Studio (which is free) should be enough for compiling.
In the meantime, I've tried to overclock memory on my RX 560 (only memory, I left GPU core clock at default 1196 MHz), here are the results:
Radeon RX 560 on Windows 10 (overclocked): core @ 1196 MHz, memory @ 2275 MHz, monitor plugged in, intensity 1000, worksize 32:
Mod | Hashrate | Performance level |
---|---|---|
- | 407.2 H/s | 100.0% |
INT_MATH | 406.5 H/s | 99.8% |
SHUFFLE | 389.0 H/s | 95.5% |
Both mods | 386.3 H/s | 94.9% |
Is 2275 MHz a good speed for the memory on RX 560? I didn't get any CPU/GPU mismatch errors and I can't overclock it further - MSI Afterburner just doesn't let me do it.
@MoneroCrusher Did you try to test your Vega?
@SChernykh You did those tests with 1 click timing straps? Can you try to do 2 threads? Did not test anything yet but will do now. Would be happy if you could provide me with the Windows binary. I don't have visual studio.
@MoneroCrusher
@SChernykh Can you do the tests with PBE 1 click timing straps? More real life then Thanks for the Windows Binary btw!
I did tests now and wrongly used Worksize 8 first for the mods, but didn't see you guys were using WS 32, so I corrected it afterwards and tried both WS 16 and 32.
Gigabyte RX 550 2 GB, 8 CU, 2 Threads (432/432), 1220/2150, 1 Click PBE Timing Straps, Ubuntu 16.04
Mod | Hashrate (WS 8) | Hashrate (WS 16) | Hashrate (WS 32) |
---|---|---|---|
No Mod | 467 H/s | 440 H/s | Crash |
SHUFFLE | 409 H/s | 453 H/s | Crash |
INT_MATH | 223 H/s | 302 H/s | 360 H/s |
Both mods | 202 H/s | 267 H/s | 316 H/s |
Sapphire RX 550 2 GB, 10 CU, 2 Threads (432/432), 1220/2150, 1 Click PBE Timing Straps, Ubuntu 16.04
Mod | Hashrate (WS 8) | Hashrate (WS 16) | Hashrate (WS 32) |
---|---|---|---|
No Mod | 528 H/s | 479 H/s | 470 H/s |
SHUFFLE | 419 H/s | 458 H/s | 419 H/s |
INT_MATH | 229 H/s | 354 H/s | 353 H/s |
Both mods | 217 H/s | 309 H/s | 315 H/s |
Vega RX 56, 56 CU, 2 Threads (2016/1716), 950/1417, Windows 10
Mod | Hashrate (WS 8) | Hashrate (WS 16) | Hashrate (WS 32) |
---|---|---|---|
No Mod | 1650 H/s | 1632 H/s | 1613 H/s |
SHUFFLE | 1588 H/s | 1639 H/s | 1591 H/s |
INT_MATH | 1052 H/s | 1411 H/s | 1471 H/s |
Both mods | 1026 H/s | 1321 H/s | 1303 H/s |
So Worksize 32 helps INT_MATH more, while worksize 16 helps Shuffle more, while worksize 8 helps no mod more. Is there some way to somehow align them?
Also, could somebody ELI5 me why RandomJS permanently prevents ASICs?
Can you do the tests with PBE 1 click timing straps? More real life then
I'll do it this evening. Hopefully it won't brick my card. Thanks for the numbers for Vega 56. It seems that it can handle shuffle mod perfectly. As for integer math mod, it's 89% performance compared to no mods and 81% performance for shuffle+int_math compared to shuffle mod. Can you try to tweak parameters some more? Also overclocking GPU core should really help. We need to know how good it can perform.
Also, could somebody ELI5 me why RandomJS permanently prevents ASICs?
Any ASIC that can run random code is basically a CPU. Read this comment: https://github.com/monero-project/monero/issues/3545#issuecomment-398206972
Vega RX 56, 56 CU, 2 Threads (2016/1716), 950/1417
Are the last 2 numbers GPU core and memory clocks? You really need to push GPU core to the maximum - integer math mod adds a lot of computations.
@SChernykh No, HBM mem has different clocks. 950 is mem and 1417 is core.
Is it necessary for both mods to be implemented for ASIC resitance or would one of them be enough?
Gesendet mit der GMX iPhone App
Am 27.06.18 um 16:09 schrieb SChernykh
Vega RX 56, 56 CU, 2 Threads (2016/1716), 950/1417
Are the last 2 numbers GPU core and memory clocks? You really need to push GPU core to the maximum - integer math mod adds a lot of computations.
-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/SChernykh/xmr-stak-cpu/issues/1#issuecomment-400685132
No, HBM mem has different clocks. 950 is mem and 1417 is core.
Ok, it looks like it's time to leave only 1 square root in int_math mod to make it easier for GPUs. Two square roots were kind of overkill anyway.
Is it necessary for both mods to be implemented for ASIC resitance or would one of them be enough?
They target different classes of ASICs/FPGAs. Shuffle mod targets devices with external memory, making them 4 times slower. Integer math mod targets devices with on-chip memory, making them 8-10 times slower because of high division and square root latency. They work best together. Remove one mod, and you will enable an efficient ASIC/FPGA again - either with on-chip SRAM or with HBM external memory.
Will going from 2 square root to 1 square root make it easier game for FPGA? Is it somehow possible to make RX 550 better? Quite some people are mining on those and it would be a pity if they would do 40-50% worse than other GPU in comparison (pre-fork/after-fork)
Will going from 2 square root to 1 square root make it easier game for FPGA?
Leaving just 1 square root won't make it easier for FPGA. The point of having a square root is that they'll still need to implement it and waste space on chip for it and that it has high computation latency.
Is it somehow possible to make RX 550 better?
Leaving 1 square root should help. The problem with RX 550 is that they are unbalanced unlike other Radeons. If you calculate GFLOPs/Memory bandwidth ratio, it will be in the range 20-25 GFLOPs/GB/s for all Radeons starting from RX 560 and up to Vega 64. RX 550 has only 10 GFLOP/GB - two times worse.
I think I'll make the number of square roots configurable for convenience. You'll be able to test 0, 1 or 2 square roots in int_math mod.
Please find a way to disadvantage all GPUs the same, if that's im any way possible!
It's possible to slow down all GPUs from RX 560 up to Vega 64 the same, I can't guarantee it with RX 550. But we still have time for experimenting, the next fork is in September/October.
@SChernykh So nice you came up with this algo. So in your opinion it will permanently move ASICs and FPGAs from the network? And also it uses much less power than ProgPOW. What's your opinion about ProgPOW?
I hope we could fork POW sooner if it's production ready. There are reasons to believe FPGA/ASICs are already on network.. Edit: really hope there is a way to disadvantage 560-vega more than 550 to balance it out..but let's test!
As for FPGAs that are coming in August (BCU1525) - they'll be slowed down from 20 KH/s to less than 5 KH/s (even down to 2 KH/s if my assumptions about division and square root latencies are correct) which will make them much worse than Vega 56/64 in terms of performance per $, so they'll not be mining Cryptonight at all.
As for possible ASICs: devices with external memory will still be ~2.5 times faster than Vega 56/64 if they use the same HBM2 memory. Given that they'll certainly be more expensive, they won't be a serious competition. Devices with on-chip memory like those 220 KH/s Bitmain miners will be down to 20-30 KH/s, still at 550 watts. Much less dangerous to the network.
ProgPOW is perfectly tuned for GPUs, but it's not CPU-friendly like Cryptonight which is a minus for decentralization. ProgPOW ASICs won't be economically viable at all.
I've tested RX 560 with one click timing straps, could overclock memory to 2200 MHz where it started giving CPU/GPU mismatch errors (1-3 errors per test run), but I could still test the performance.
Radeon RX 560 on Windows 10 (overclocked): core @ 1196 MHz, memory @ 2200 MHz, 1 Click PBE Timing Straps, monitor plugged in, intensity 1000, worksize 32:
Mod | Hashrate | Performance level |
---|---|---|
- | 448.5 H/s | 100.0% |
INT_MATH | 447.3 H/s | 99.7% |
SHUFFLE | 446.8 H/s | 99.6% |
Both mods | 439.3 H/s | 97.9% |
It looks like new memory timings made things better compared to plain memory overclock: 97.9% vs 94.9% performance for plain memory overclock.
P.S. I didn't see any difference between 8, 16 and 32 worksizes for version without mods.
I've tested it again at 2150 MHz memory - there were no CPU/GPU mismatch errors at all. I wanted to be sure that errors didn't influence test results.
Radeon RX 560 on Windows 10 (overclocked): core @ 1196 MHz, memory @ 2150 MHz, 1 Click PBE Timing Straps, monitor plugged in, intensity 1024, worksize 32:
Mod | Hashrate | Performance level |
---|---|---|
- | 447.4 H/s | 100.0% |
INT_MATH | 438.0 H/s | 97.9% |
SHUFFLE | 440.6 H/s | 98.5% |
Both mods | 434.9 H/s | 97.2% |
No mods version worked better with 2 threads @ 512 intensity, all versions with mods worked better with 1 thread @ 1024 intensity.
really hope there is a way to disadvantage 560-vega more than 550 to balance it out..but let's test!
It's just impossible because 560 has exactly the same memory but 2 times more powerful GPU core. Whatever you do, 560 will be faster than 550.
I removed one square root from integer math mod, also tested different thread count, intensity and worksize for different versions. The best I could get:
This is for RX 560. @MoneroCrusher Can you test again on Vega 56 and RX 550? I've updated the repository.
@MoneroCrusher If you haven't started testing yet, don't do it for now. I have some very cool changes incoming for the integer math mod. These changes both improve GPU performance AND slowdown ASIC/FPGA two times more, comparing to current integer math mod.
P.S. GPU performance didn't really improve, it stayed the same. But still very cool.
I started tests and it has gotten better (around 20% for RX 550, but still not parity like 56/7/80, havent tested Vega yet) but I'll wait then.
Very cool! So the advantage of ASIC will only be 2-3x after your mod?
Very cool! So the advantage of ASIC will only be 2-3x after your mod?
Yes. We're now talking about ~15x slowdown for the coming BCU1525 FPGA (20 KH/s -> 1.4 KH/s) and similar slowdown for Bitmain ASICs (220 KH/s -> 15 KH/s). Strange thing: these changes improved int_math mod performance when it's applied alone (10% better), but int_math + shuffle stayed the same, even got 1% slower. I'm sure it can be improved further.
I've committed it to the repository: https://github.com/SChernykh/xmr-stak-amd/commit/566f30c42f9ffe4e8fc610a35242d4a8ba0ea063
The trick was to prevent parallel calculation of division and square roots. Now they have to be done in sequence, effectively doubling the latency for ASIC/FPGA. You can start testing now.
I've managed to improve integer math mod a bit more: from 254.3 H/s to 275.6 H/s on my simulated RX 550 when combined with shuffle mod, so it's 8% speed up. But I still need numbers for the real RX 550 and Vega 56 to know where it is at now.
P.S. And I've improved it some more from 275.6 H/s up to 277.0 H/s, so 9% speed up compared to the current version. I don't know what else can be done there without making it easier for ASIC/FPGA. I'm out of ideas for today & waiting for the numbers.
Just tested with your last version:
Gigabyte RX 550 2 GB, 8 CU, 2 Threads (432/432), 1220/2150, 1 Click PBE Timing Straps, Ubuntu 16.04
Mod | Hashrate (WS 8) | Hashrate (WS 16) | Hashrate (WS 32) |
---|---|---|---|
No Mod | 471 H/s | 438 H/s | 412 H/s |
SHUFFLE | 408 H/s | 456 H/s | 430 H/s |
INT_MATH | 268 H/s | 357 H/s | 418 H/s |
Both mods | 231 H/s | 297 H/s | 340 H/s |
Sapphire RX 550 2 GB, 10 CU, 2 Threads (432/432), 1170/2100, 1 Click PBE Timing Straps, Ubuntu 16.04
Mod | Hashrate (WS 8) | Hashrate (WS 16) | Hashrate (WS 32) |
---|---|---|---|
No Mod | 521 H/s | 478 H/s | 470 H/s |
SHUFFLE | 420 H/s | 456 H/s | 429 H/s |
INT_MATH | 269 H/s | 418 H/s | 417 H/s |
Both mods | 246 H/s | 336 H/s | 338 H/s |
Vega RX 56, 56 CU, 2 Threads (2016/1736), 900/1417 (950 mem was too much), Windows 10
Mod | Hashrate (WS 8) | Hashrate (WS 16) | Hashrate (WS 32) |
---|---|---|---|
No Mod | 1700 H/s | 1692 H/s | 1633 H/s |
SHUFFLE | 1590 H/s | 1605 H/s | 1526 H/s |
INT_MATH | 1267 H/s | 1605 H/s | 1615 H/s |
Both mods | 1039 H/s | 1337 H/s | 1147 H/s |
Integer Math really got a big improvement but both mods are same if not a little worse.
I tried overclocking core as much as possible on the Vega and set power limit to +47%. I set Core to 1736 but actually it only showed numbers between 1605-1655 Mhz. I then hashed at 1490 H/s both mods but was drawing 334W (!) at the wall (just Vega + Intel Celeron which consumes like 25W).
I tried going from 2 threads to one thread on all cards and in every case I got worse hashrates. Don't know if I did something wrong.
Appreciate your work.
Ok, I'll experiment some more and update the code in the evening.
@MoneroCrusher I got performance up from 254 to 280 H/s - 10% improvement. You can get latest from the repository and test again.
P.S. I've managed to run tests on my RX 560 with memory @ 2200 MHz and cooler @ 100% without mismatch errors. Updated first post with my test results. Got 477 H/s without mods which is a pretty good number for real-life RX 560 mining, so we can consider RX 560 results fully relevant now.
P.P.S. Overclocking GPU core from 1196 MHz to 1450 MHz improved hashrate with both mods from 446.7 H/s to 456.9 H/s.
Unfortunately it doesn't look like m128i_u32 and _u64 are portable, being MSVC constructs only. I don't know how this affects the overall program but I'm unable to compile with c++ 5+
edit: just to add to that, I've got a fairly sizable farm with multiple vendors of RX550 and RX560 cards that I wrote custom memory timing straps for, so was looking to mess with this and find a relationship between core speed and performance. There's not much of that relationship with the current CN v1 variant (stock speeds to overclocking mem speeds don't buy you much)
@mobilepolice Yes, I'm developing in Visual Studio most of the time. But fixing it for GCC is not a problem, I'll do it today. I'm also not done with optimizations yet, expect current numbers to improve.
@mobilepolice Fixed it, you should be able to build it now: https://github.com/SChernykh/xmr-stak-amd/commit/de285f4ed3ce29a86a4dfa7817acb18feac94ab9
@MoneroCrusher @mobilepolice I've submitted a new portion of optimizations, performance with integer math and both mods together became significantly better.
@SChernykh really appreciate the effort on this. unfortunately missing _umul128 and _subborrow_u64 more msvc-only types. it looks like the subborrow may be in the very latest version of gcc-7 if I compile by hand but hadn't got that far. I hashed out a _umul128 but not sure if it's correct (math stuff in c/c++ is very new to me, so my apologies)
static inline uint64_t _umul128(uint64_t a, uint64_t b, uint64_t* hi) {
unsigned __int128 z = a * b;
*hi = z >> 64;
return (z << 64);
}
@mobilepolice Wait a bit more till I fix this. I'll also have some more optimizations ready, good that you haven't started testing yet.
Nope! I haven't. I'm about to run out for stuff to work on today but I will be back in about six hours and pick this back up again so take your time! Thanks
@mobilepolice I've submitted the fix and also some smaller performance improvements. This time I actually checked that it compiles successfully on Ubuntu 18.04. You can start testing now.
@SChernykh My apologies it's taken me so long to get back to this. BitTube/IPBC had a fork today and I was messing around with a miner and a bunch of settings while working on these tests.
On the below data:
The CNv1 (Monero v7) is taken with a new-ish build of xmr-stak, so these numbers are just for reference with regard to what I'm doing today on these cards.
I typically run 2 threads and 448 intensity. There's actually a few MB left over for more intensity, either in 2 threads mode or single thread (752 is the limit in single thread, 730-740 is "safe" for guaranteed startup) I've found it's not wise to run right up against the limit, as if for any reason there's less memory than there should be available at that time, bad things happen. Cards crash, OS crashes, strange hangs. etc.
Also, on CNv1/Monerov7 I found that using compute units between 8 but less than 16 usually resulted in less performance. Your code seems to behave differently as you'll see in the data below.
I only had one CPU/GPU nonce mismatch, identified by the (1) This might be that card, as my memory timings are very tight on these cards (Elpida memory on everything in this list)
I didn't test the 550's very much as there's not a lot to be gained there. I also saw you were running 1024 intensity on some cards, Are they 4GB?
Unroll 8 seems to work better for me in nearly every case involving INT_MATH.
There's one spurious result, the RX560 Unroll 16 single thread BOTH is better than the Unroll 8 2 thread. Compared to the 560D, this should be the other way around. (338 vs 446 respectively). I will retest tomorrow.
Card | Compute Units | Memory (GB) | Core (MHz) | Mem (MHz) | Threads | Intensity | Worksize | Stock | Shuffle | Int_Math | Both | CNv1 (new xmr-stak, No Unroll) |
---|---|---|---|---|---|---|---|---|---|---|---|---|
XFX RX560D (Unroll 16) | 14 | 2 | 1150 | 1750 | 2 | 448 | 8 | 503 | 459 | 408 | 289 | 542 |
XFX RX560D (Unroll 16) | 14 | 2 | 1150 | 1750 | 1 | 732 | 8 | 475 | 434 | 383 | 333 | - |
XFX RX560D (Unroll 16) | 14 | 2 | 1150 | 1750 | 1 | 744 | 14 | 478 | 442 | 476 | 438 | - |
XFX RX560D (Unroll 8) | 14 | 2 | 1150 | 1750 | 2 | 448 | 8 | 503 | 460 | 506 | 446 | 542 |
XFX RX560D (Unroll 8) | 14 | 2 | 1150 | 1750 | 1 | 744 | 14 | 477 | 442 | 475 | 438 | - |
XFX RX560D (Unroll 4) | 14 | 2 | 1150 | 1750 | 2 | 448 | 8 | 503 | 460 | 501 | 440 | 542 |
Sapphire RX550 (Unroll 16) | 8 | 2 | 1325 | 1760 | 2 | 448 | 8 | 469 | 430 | 312 | 257 | 520 |
Sapphire RX550 (Unroll 8) | 8 | 2 | 1325 | 1760 | 2 | 448 | 8 | 469 | 437 | 343 | 297 | 520 |
Sapphire RX550 (Unroll 4) | 8 | 2 | 1325 | 1760 | 2 | 448 | 8 | 469 | 439 | 336 | 296 | 520 |
VisionTek/XFX/HIS RX550 (Unroll 16) | 8 | 2 | 1325 | 1760 | 2 | 448 | 8 | 469 | 430 | 312 | 257 | 520 |
XFX/HIS RX560 (Unroll 32) | 16 | 2 | 1150 | 1750 | 2 | 448 | 16 | 502 | 387 | 308 | 273 | 540 |
XFX/HIS RX560 (Unroll 16) | 16 | 2 | 1150 | 1750 | 2 | 448 | 16 | 518 | 481 | 334 | 276 | 540 |
XFX/HIS RX560 (Unroll 16) | 16 | 2 | 1150 | 1750 | 1 | 744 | 16 | 492 | 457 | 490(1) | 455 | - |
XFX/HIS RX560 (Unroll 8) | 16 | 2 | 1150 | 1750 | 2 | 448 | 16 | 515 | 473 | 490 | 442 | 540 |
XFX/HIS RX560 (Unroll 4) | 16 | 2 | 1150 | 1750 | 2 | 448 | 16 | 516 | 474 | 487 | 439 | 540 |
XFX/HIS RX560 (Unroll 2) | 16 | 2 | 1150 | 1750 | 2 | 448 | 16 | 515 | 474 | 477 | 433 | 540 |
XFX/HIS RX560 (Unroll 1) | 16 | 2 | 1150 | 1750 | 2 | 448 | 16 | 516 | 473 | 493 | 432 | 540 |
This might be a slightly easier way to look at this data.
@mobilepolice Thanks for the numbers. I have a 4 GB RX 560 card with 16 compute units. Since integer math mod does a lot of computations, optimal intensity for it is N*64 where N is a number of compute units. Intensity 1024 ensures that all 1024 shader processors have things to do. Lower intensity results in proportionally lower performance.
Loop unroll setting is totally independent of intensity or thread count by the way. You only need to find it once for each mod and then mess with intensity/threads/worksize.
What parameters did you use on new xmr-stak?
P.S. Did you test worksize=32? It gives the best results on my RX 560 with integer math mod and both mods.
@SChernykh No I didn't test worksize 32 but when I reboot the rig tomorrow that I was using to test (it's remote to me right now) I can try that. it looks like from previous tests ws32 works well on the RX 550's. Any other test situations you'd like me to work through I definitely will.
Definitely the more intensity that can be run, the better performance. Unfortunately it looks like to get anything above the numbers I've been running would take a 4GB card, since I just run the cards out of memory otherwise.
Assuming your 439H/s from your RX560 test a couple days back holds the same numbers, my 2GB card outperforms that a bit. the difference is going to be in timings purely, but that should be an indicator that you can probably squeeze more out of that 4GB card.
for xmr-stak I almost universally use 2 threads, 448 intensity, strided_index 1 and comp_mode true
@mobilepolice My RX560 does 447 H/s (both mods) with the latest code and worksize=32. I'm mostly interested in performance with both mods vs performance without mods, so pay attention to tweaking both mods' parameters. I guess I also need to implement strided_index and mem_chunk parameters for proper comparison.
@SChernykh I wouldn't sweat the strided_index and mem_chunk stuff so much, I only gave numbers from the new xmr-stak as a reference I had today. I expect reimplementation into the new xmr-stak to pick up a little performance as well so I understand that the discrepancies right now between new and old with these changes are large.
It looks like RX560 to RX560, I'm seeing about an 11% performance drop compared to your ~5%. This is likely due to the 2GB card not being able to keep the compute units as busy as yours can. This doesn't bother me, I would rather have the performance loss and impact ASIC/FPGA more forcefully than keep my measly few percent. :)
Tomorrow I will test worksize 16/32 and unroll 16/8/4/2/1 in combination on the RX 560
It looks like RX560 to RX560, I'm seeing about an 11% performance drop compared to your ~5%. This doesn't bother me, I would rather have the performance loss and impact ASIC/FPGA more forcefully than keep my measly few percent. :)
I guess every single card will get the same performance drop in the end, so relative performance (aka the actual profit per card) will stay the same (+-5% max). RX 550 is the only sad exception so far, it stands out from other Radeons due to very weak GPU core.
@mobilepolice I don't know why, but I can't get better numbers with the latest xmr-stak, no matter what combination of settings I try. strided_index=1 even makes things worse for me. Just wanted to test it before implementing it in my code, but I'm not sure now.
Huh, I found an interesting pattern: increasing intensity to 1440 improved INT_MATH and both mods performance, but made performance worse for no mods and SHUFFLE mod. It looks like integer math mod really likes worksize=32 and also strongly reacts to higher intensity, much stronger than the original Cryptonight.
Radeon RX 560 on Windows 10 (RX 550 simulation): core @ 595 MHz, memory @ 2200 MHz, 1 Click PBE Timing Straps, monitor plugged in, intensity 1024, worksize 32:
Mod | Hashrate | Performance level |
---|---|---|
- | 394.3 H/s | 100.0% |
INT_MATH | 347.6 H/s | 88.2% |
INT_MATH, intensity 1440 | 351.4 H/s | 89.1% |
SHUFFLE | 356.0 H/s | 90.3% |
Both mods | 303.9 H/s | 77.1% |
Both mods, intensity 1440 | 319.7 H/s | 81.1% |
@SChernykh I wasn't able to get out to my shop today and reset the rig that's frozen up that I was using to do this testing with. I will be able to get out there tomorrow and report back on my testing results then. In the meantime I've been looking at adjusting cli-miner.cpp so that if you specify one index in the config.txt file, it will programmatically run through, find the maximum intensity that can be set on a card for n-threads and test all of the different variables and report a hash rate for each. not sure how feasible this is, as it is now I've been doing most of it with a bash script, except I'm only programmatically doing the shuffle/int-math settings in said script.
The original discussion starts here: https://github.com/monero-project/monero/issues/3545#issuecomment-380117524 GPU version of shuffle and integer math modifications is here: https://github.com/SChernykh/xmr-stak-amd
You can post your performance test results, also your suggestions and concerns here.
AMD Ryzen 7 1700 @ 3.6 GHz, 8 threads
AMD Ryzen 5 2600 @ 4.0 GHz, 1 thread
AMD Ryzen 5 2600 @ 4.0 GHz, 8 threads (affinity 0,2,4,5,6,8,10,11)
Intel Pentium G5400 (Coffee Lake, 2 cores, 4 MB Cache, 3.70 GHz), 2 threads
Intel Core i5 3210M (Ivy Bridge, 2 cores, 3 MB Cache, 2.80 GHz), 1 thread
Intel Core i7 2600K (Sandy Bridge, 4 cores, 8 MB Cache, 3.40 GHz), 1 thread
Intel Core i7 7820X (Skylake-X, 8 cores, 11 MB Cache, 3.60 GHz), 1 thread
XMR-STAK used is an old version, so don't expect the same numbers that you have on your mining rigs. What's important here are relative numbers of original and modified Cryptonight versions.
Radeon RX 560 on Windows 10 (overclocked): core @ 1196 MHz, memory @ 2200 MHz, 1 Click PBE Timing Straps, monitor plugged in, intensity 1024, worksize 32:
* strided_index = 2, mem_chunk = 2 (64 bytes)
Radeon RX 560 on Windows 10 (RX 550 simulation): core @ 595 MHz, memory @ 2200 MHz, 1 Click PBE Timing Straps, monitor plugged in, intensity 1024, worksize 32:
* Increasing intensity to 1440 improved both mods performance, but made performance worse in other cases.
It looks like RX 550 needs GPU core overclocking to properly handle new modifications.
GeForce GTX 1080 Ti 11 GB on Windows 10: core 2000 MHz, memory 11800 MHz, monitor plugged in, intensity 1280, worksize 8:
GeForce GTX 1060 6 GB on Windows 10: all stock, monitor plugged in, intensity 800, worksize 8:
GeForce GTX 1050 2 GB on Windows 10: core 1721 MHz, memory 1877 MHz, monitor unplugged, intensity 448, worksize 8: