fireice-uk / xmr-stak

Free Monero RandomX Miner and unified CryptoNight miner
GNU General Public License v3.0
4.05k stars 1.79k forks source link

Test final Monero POW cryptonight_v8 #1851

Open psychocrypt opened 6 years ago

psychocrypt commented 6 years ago

Monero is changing there POW in October 2018. Please test the implementation of the new algorithm against the test pool (http://killallasics.moneroworld.com/)

You can find the source code of xmr-stak in pull request #1850 or download the zipped source directly.

Please report here only the speed comparison between cryptonight_v7 and cryptonight_v8. If you fund any bugs please report it in the pull request #1850. Please also take the time to mine a few minutes against the testnet pool to check that you not get invalid results.

How to bench the system:

Please start the miner once with ./xmr-stak to create pools.txtand all other config files. Change cryptonight_v8 into cryptonight_v7 to measure the performance of the current monero POW. Please do not forget to remove the backend configs if you switch the algorithm because "strided_index" : 1 is not allowed for cryptonight_v8

CPU:

./xmr-stak  --currency cryptonight_v8 --noAMD --noNVIDIA --benchmark 8 --benchwait  20 --benchwork 30

CUDA/AMD OpenCL:

./xmr-stak --currency cryptonight_v8  --benchmark 8 --benchwait  20 --benchwork 30

CUDA is currently not supported. I am currently try to get some performance out it.

NVIDIA via OpenCL

./xmr-stak --currency cryptonight_v8 --openCLVendor NVIDIA --benchmark 8 --benchwait  20 --benchwork 30

Template for speed reporting:

OS: XX
Backend: (including the type e.g. AMD RX570)
 - CPU
  - NVIDIA (native CUDA or via OpenCL)
  - AMD 
# if the CPU/GPU is overclocked please add the modifications here
speed:  
 - cryptonight_v7: XXX H/s
 - cryptonight_v8: XXX H/s
Miner config: please add here your config for the backend
SChernykh commented 6 years ago

I did't run benchmark this time, just started mining for a few minutes and checked highest reported hashrate.

OS: Windows 7
Backend: CPU, Intel Core i5 3210M

speed:  
 - cryptonight_v7: 74.3 H/s
 - cryptonight_v8: 67.3 H/s

config:
"cpu_threads_conf" :
[
    { "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 0 },
],

This is strange - I get 68.3 H/s with xmrig on this machine. It uses the same asm code, the only difference is that I used Visual Studio 2017 to compile xmr-stak and MSYS2 with GCC 8.2.0 to compile xmrig.

SChernykh commented 6 years ago
OS: Windows 10
Backend: CPU, AMD Ryzen 5 2600 @ 4 GHz

speed:  
 - cryptonight_v7: 630.6 H/s
 - cryptonight_v8: 627.3 H/s

config:
"cpu_threads_conf" :
[
    { "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 0 },
    { "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 2 },
    { "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 4 },
    { "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 5 },
    { "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 6 },
    { "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 8 },
    { "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 10 },
    { "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 11 },
],

xmrig showed identical performance this time.

SChernykh commented 6 years ago
OS: Windows 7
Backend: CPU, Intel Core i7 2600k

speed:  
 - cryptonight_v7: 287.5 H/s
 - cryptonight_v8: 264.3 H/s

config:
"cpu_threads_conf" :
[
    { "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 0 },
    { "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 2 },
    { "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 4 },
    { "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 6 },
],

Again, xmrig showed a bit higher hashrate - 267.9 H/s.

psychocrypt commented 6 years ago

@SChernykh Thanks for your tests. A few hashes difference can be. Depending on which port of the test pool you are mining. If you are lucky and found a lot of hashes than the hash rate will go down a few hashes. But let us wait for other results.

Big thanks again for the asm code.

SChernykh commented 6 years ago

I've also tested RX 560: hashrate numbers here are what was reported as highest when mining. I double checked it - performance is identical, I was mining v7 against v7 pool and v8 against v8 pool, all shares were accepted. I tried a few different configs, but I couldn't find faster settings for v7.

OS: Windows 10
Backend: OpenCL, AMD Radeon RX 560 4GB, 1 click PBE timing straps, core @ 1196 MHz, memory @ 2200 MHz

speed:  
 - cryptonight_v7: 469.3 H/s
 - cryptonight_v8: 469.3 H/s

config for v7:
"gpu_threads_conf" : [
  // gpu: Baffin memory:2752
  // compute units: 16
  { "index" : 0,
    "intensity" : 512, "worksize" : 32,
    "affine_to_cpu" : false, "strided_index" : 2, "mem_chunk" : 2,
    "unroll" : 8, "comp_mode" : true
  },
  { "index" : 0,
    "intensity" : 512, "worksize" : 32,
    "affine_to_cpu" : false, "strided_index" : 2, "mem_chunk" : 2,
    "unroll" : 8, "comp_mode" : true
  },
],

config for v8:
"gpu_threads_conf" : [
  // gpu: Baffin memory:2752
  // compute units: 16
  { "index" : 0,
    "intensity" : 1024, "worksize" : 32,
    "affine_to_cpu" : false, "strided_index" : 2, "mem_chunk" : 2,
    "unroll" : 16, "comp_mode" : true
  },
],
SChernykh commented 6 years ago

60 second benchmark on the same RX 560 with configs listed above:

speed:  
 - cryptonight_v7: 466.2 H/s
 - cryptonight_v8: 461.8 H/s

I'm not sure what numbers to trust more.

SChernykh commented 6 years ago

@MoneroCrusher @mobilepolice @kio3i0j9024vkoenio @Bathmat This is the final code for the next Monero PoW, it would be good if you tested it on everything you got and posted results here.

Edit: everything except NVIDIA GPUs, CUDA version is not ready yet.

Spudz76 commented 6 years ago

CPU is i7-2600 non-K cache 8M

OS: XX
Backend: CPU
speed:  
 - cryptonight_v7: 161 H/s
 - cryptonight_v8: 162 H/s
Miner config:
    { "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 0 },
    { "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 2 },
    { "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 4 },
    { "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 6 },

Also tried prefetch both ways, all three asm, in all cases the above was winner High variance between tests, I blame firefox in the background, but I ran a bunch of times and took the best Did the N=2 optimize get in? I skipped l_p_m:2 tests for now

SChernykh commented 6 years ago

@Spudz76 it's better to test without anything running in background, right after reboot. Double hash asm code is not added to xmr-stak yet.

plavirudar commented 6 years ago

Nvidia OpenCL causes severe lag for my computer (1080ti) and makes it unusable while mining. Previously, using CUDA with bfactor=12 and bsleep=100 causes no slowdown of the computer while doing word processing and basic videos.

Is there a tweak that I'm missing here? Currently it's so laggy that it causes music to stutter and even pressing "h" to show hashrate isn't working properly.

Bathmat commented 6 years ago

Nvidia OpenCL causes severe lag for my computer (1080ti) and makes it unusable while mining. Previously, using CUDA with bfactor=12 and bsleep=100 causes no slowdown of the computer while doing word processing and basic videos.

Is there a tweak that I'm missing here?

@plavirudar you could try lowering the intensity... might help @psychocrypt do you expect to be able to get a working CUDA version? I have an older Win7 rig with a GTX970 and 2x GTX1050s that OpenCL is not playing nice with. I'll keep troubleshooting though.

plavirudar commented 6 years ago

@Bathmat I started off with intensity 896 (the recommended) and it was hashing ~1000h/s with ethlargement+600MHz mem OC (which is similar to its performance on cnh/cnv1, however the computer was unusable. When I dropped intensity to 640, the computer was still unusable, however hashrate fell to 800, which was lower than its performance on cnv2 CUDA.

plavirudar commented 6 years ago

@Spudz76 Are you sure you're using the right CPU? I have an i7-2600 non-k and I'm getting 220 with "asm":"off", 256 with "asm":"intel" , 250 with "asm":"ryzen" and 270 with CNv7 (so ~5% slowdown after using the best asm optimizations). That CPU is also running a bunch of random shitcoin daemons as well, so it's not even performing at max.

Not sure what you meant by OS:XX, I tried mine on Ubuntu 16.04 with threads 0,1,2,3.

w104tcl commented 6 years ago

OS: Kubuntu Backend: - CPU - Intel Core i5-3320M - stock settings speed:

Spudz76 commented 6 years ago

@plavirudar Win7 and all sorts of garbage running in the background (my daily driver desktop box) Definitely not rebooting let alone closing all these tabs. But I have others I can test that are only for mining / not a i7-2600 though / most of them are non-AES and Linux so I was testing the applicable AES-capable stuff I have first

Also this is about comparison from v7 to v8 not global competition. I ran so many passes and took the highest which should account for background task variances So whatever I have holding me back is doing the same thing to v8 the delta still applies (and in this case v8 was faster) I don't mine on this box normally but it is an extra test point (and has a GTX970 as well). I did tell Firefox to quit fiddling with the GPU for offloading and a few other easy avoidance measures (mostly to open more VRAM).

Spudz76 commented 6 years ago

Same Win7 box as above GTX970-4GB stock / driver profile max performance + P0

Backend: OpenCL->NVIDIA
speed:  
 - cryptonight_v7: 448 H/s
 - cryptonight_v8: 378 H/s
Miner config:
  // gpu: GeForce GTX 970 memory:3968
  // compute units: 13
  { "index" : 0,
    "intensity" : 832, "worksize" : 8,
    "affine_to_cpu" : false, "strided_index" : 0, "mem_chunk" : 2,
    "unroll" : 4, "comp_mode" : false
  },

832 = 13 * 8 * 8 = smx * 8 * worksize which seemed to work best unroll 4 and 8 same performance strided 2 and chunk 2,3,4 did not change much (mostly same, some slightly worse) Similar with worksizes, intensities. In all cases windows chokes the entire time the miner is hashing.

Bathmat commented 6 years ago
OS: Win10 1803
Backend: AMD GPUs (4), all bios modded with 1 click timings (all hynix mem, 2000 clock)
0: RX580 8GB Gigabyte, 99W
1: RX480 4GB AMD brand, 82W
2: RX570 4GB Sapphire ITX, 101W
3: RX470 4GB Sapphire, 80W
speed:  
 - cryptonight_v7: 3530 H/s
 - cryptonight_v8: 3350 H/s
HASHRATE REPORT - AMD
| ID |    10s |    60s |    15m | ID |    10s |    60s |    15m |
|  0 |  394.0 |  394.0 |   (na) |  1 |  395.2 |  393.7 |   (na) |
|  2 |  425.5 |  425.5 |   (na) |  3 |  424.9 |  425.2 |   (na) |
|  4 |  435.3 |  435.0 |   (na) |  5 |  434.4 |  434.7 |   (na) |
|  6 |  420.7 |  421.1 |   (na) |  7 |  421.2 |  421.1 |   (na) |
Totals (AMD):  3351.1 3350.5    0.0 H/s
-----------------------------------------------------------------
Totals (ALL):   3351.1 3350.5    0.0 H/s

amd.txt config:

"gpu_threads_conf" : [
  // gpu: Ellesmere memory:3920
  // compute units: 36
  { "index" : 0,
    "intensity" : 896, "worksize" : 16,
    "affine_to_cpu" : false, "strided_index" : 2, "mem_chunk" : 2,
    "unroll" : 4, "comp_mode" : true
  },
  { "index" : 0,
    "intensity" : 896, "worksize" : 8,
    "affine_to_cpu" : false, "strided_index" : 2, "mem_chunk" : 2,
    "unroll" : 4, "comp_mode" : true
  },
  // gpu: Ellesmere memory:3712
  // compute units: 36
  { "index" : 1,
    "intensity" : 896, "worksize" : 16,
    "affine_to_cpu" : false, "strided_index" : 2, "mem_chunk" : 2,
    "unroll" : 4, "comp_mode" : true
  },
  { "index" : 1,
    "intensity" : 896, "worksize" : 8,
    "affine_to_cpu" : false, "strided_index" : 2, "mem_chunk" : 2,
    "unroll" : 4, "comp_mode" : true
  },
  // gpu: Ellesmere memory:3712
  // compute units: 32
  { "index" : 2,
    "intensity" : 896, "worksize" : 16,
    "affine_to_cpu" : false, "strided_index" : 2, "mem_chunk" : 2,
    "unroll" : 4, "comp_mode" : true
  },
  { "index" : 2,
    "intensity" : 896, "worksize" : 8,
    "affine_to_cpu" : false, "strided_index" : 2, "mem_chunk" : 2,
    "unroll" : 4, "comp_mode" : true
  },
  // gpu: Ellesmere memory:3712
  // compute units: 32
  { "index" : 3,
    "intensity" : 896, "worksize" : 16,
    "affine_to_cpu" : false, "strided_index" : 2, "mem_chunk" : 2,
    "unroll" : 4, "comp_mode" : true
  },
  { "index" : 3,
    "intensity" : 896, "worksize" : 8,
    "affine_to_cpu" : false, "strided_index" : 2, "mem_chunk" : 2,
    "unroll" : 4, "comp_mode" : true
  },
],

Performance is in-line with what I was expecting. Power consumption is from HWMonitor for CNv8-2. Power use for CNv7 was lower by 3-7 watts depending on the GPU, but again, this was expected. Turns out that changing unroll didn't seem to have an affect. I tested all 8, all 4, all 1 and a mix of 8 on W:16 threads, 4 on W:8 threads, and total hashrate remained within 10 h/s for each.

Note: I did try a single thread test; however, hashrate was 7% slower than dual thread, and power consumption was the same as dual thread.

kio3i0j9024vkoenio commented 6 years ago

OS: Ubuntu 16.04

Backend: CPU Only

Note that using SChernykh XMR-Stak-CPU latest code with all the same v8 changes and the optimized asm for 1x and 2x threads produces:

Miner config:

"cpu_threads_conf" :
[

{ "low_power_mode" : true, "no_prefetch" : false,  "asm" : "ryzen", "affine_to_cpu" : 0 },
{ "low_power_mode" : true, "no_prefetch" : false,  "asm" : "ryzen", "affine_to_cpu" : 1 },
{ "low_power_mode" : true, "no_prefetch" : false,  "asm" : "ryzen", "affine_to_cpu" : 2 },
{ "low_power_mode" : true, "no_prefetch" : false,  "asm" : "ryzen", "affine_to_cpu" : 3 },
{ "low_power_mode" : false, "no_prefetch" : true,  "asm" : "ryzen", "affine_to_cpu" : 4 },
{ "low_power_mode" : false, "no_prefetch" : true,  "asm" : "ryzen", "affine_to_cpu" : 5 },
{ "low_power_mode" : false, "no_prefetch" : true,  "asm" : "ryzen", "affine_to_cpu" : 6 },
{ "low_power_mode" : false, "no_prefetch" : true,  "asm" : "ryzen", "affine_to_cpu" : 7 },
{ "low_power_mode" : true, "no_prefetch" : false,  "asm" : "ryzen", "affine_to_cpu" : 8 },
{ "low_power_mode" : true, "no_prefetch" : false,  "asm" : "ryzen", "affine_to_cpu" : 9 },
{ "low_power_mode" : true, "no_prefetch" : false,  "asm" : "ryzen", "affine_to_cpu" : 10 },
{ "low_power_mode" : true, "no_prefetch" : false,  "asm" : "ryzen", "affine_to_cpu" : 11 },
{ "low_power_mode" : false, "no_prefetch" : true,  "asm" : "ryzen", "affine_to_cpu" : 12 },
{ "low_power_mode" : false, "no_prefetch" : true,  "asm" : "ryzen", "affine_to_cpu" : 13 },
{ "low_power_mode" : false, "no_prefetch" : true,  "asm" : "ryzen", "affine_to_cpu" : 14 },
{ "low_power_mode" : false, "no_prefetch" : true,  "asm" : "ryzen", "affine_to_cpu" : 15 },
{ "low_power_mode" : true, "no_prefetch" : false,  "asm" : "ryzen", "affine_to_cpu" : 16 },
{ "low_power_mode" : true, "no_prefetch" : false,  "asm" : "ryzen", "affine_to_cpu" : 17 },
{ "low_power_mode" : true, "no_prefetch" : false,  "asm" : "ryzen", "affine_to_cpu" : 18 },
{ "low_power_mode" : true, "no_prefetch" : false,  "asm" : "ryzen", "affine_to_cpu" : 19 },
{ "low_power_mode" : false, "no_prefetch" : true,  "asm" : "ryzen", "affine_to_cpu" : 20 },
{ "low_power_mode" : false, "no_prefetch" : true,  "asm" : "ryzen", "affine_to_cpu" : 21 },
{ "low_power_mode" : false, "no_prefetch" : true,  "asm" : "ryzen", "affine_to_cpu" : 22 },
{ "low_power_mode" : false, "no_prefetch" : true,  "asm" : "ryzen", "affine_to_cpu" : 23 },
{ "low_power_mode" : true, "no_prefetch" : false,  "asm" : "ryzen", "affine_to_cpu" : 24 },
{ "low_power_mode" : true, "no_prefetch" : false,  "asm" : "ryzen", "affine_to_cpu" : 25 },
{ "low_power_mode" : true, "no_prefetch" : false,  "asm" : "ryzen", "affine_to_cpu" : 26 },
{ "low_power_mode" : true, "no_prefetch" : false,  "asm" : "ryzen", "affine_to_cpu" : 27 },
{ "low_power_mode" : false, "no_prefetch" : true,  "asm" : "ryzen", "affine_to_cpu" : 28 },
{ "low_power_mode" : false, "no_prefetch" : true,  "asm" : "ryzen", "affine_to_cpu" : 29 },
{ "low_power_mode" : false, "no_prefetch" : true,  "asm" : "ryzen", "affine_to_cpu" : 30 },
{ "low_power_mode" : false, "no_prefetch" : true,  "asm" : "ryzen", "affine_to_cpu" : 31 },
],
kio3i0j9024vkoenio commented 6 years ago

OS: Ubuntu 16.04

Backend: 8x Nvidia GTX 750 1GB

CUDA

cryptonight_v7 1x GTX 750 1GB: varies from 228 to 249 H/s for each card
cryptonight_v7 8x GTX 750 1GB: 1897 H/s - 100%

OpenCL

cryptonight_v8 1x GTX 750 1GB: varies from 118 to 130 H/s for each card
cryptonight_v8 8x GTX 750 1GB: 1011 H/s - 53.3%

Losing almost half of the hash rate going from V7 to V8 is brutal.

Below are the auto config GPU config files from v7 and v8 for the first two GPU's the remaining six have the exact same settings as the second GPU. The first GPU has a display attached.

Miner config:

V7 CUDA Nvidia.txt configuration

// gpu: GeForce GTX 750 architecture: 50
//      memory: 859/976 MiB
//      smx: 4

{ "index" : 0,
"threads" : 30, "blocks" : 12,
"bfactor" : 8, "bsleep" :  100,
"affine_to_cpu" : false, "sync_mode" : 3,
},

// gpu: GeForce GTX 750 architecture: 50
//      memory: 943/981 MiB
//      smx: 4

{ "index" : 1,
"threads" : 32, "blocks" : 12,
"bfactor" : 2, "bsleep" :  0,
"affine_to_cpu" : false, "sync_mode" : 3,
},

V8 OpenCL AMD.txt Nvidia configuration

// gpu: GeForce GTX 750 memory:848
// compute units: 4
{ "index" : 0,
"intensity" : 416, "worksize" : 8,
"affine_to_cpu" : false, "strided_index" : 2, "mem_chunk" : 2,
"unroll" : 8, "comp_mode" : true
},

// gpu: GeForce GTX 750 memory:853
// compute units: 4
{ "index" : 1,
"intensity" : 416, "worksize" : 8,
"affine_to_cpu" : false, "strided_index" : 2, "mem_chunk" : 2,
"unroll" : 8, "comp_mode" : true
},
kio3i0j9024vkoenio commented 6 years ago

OS: Ubuntu 16.04

Backend: 8x Nvidia GTX 750 1GB

CUDA

cryptonight_v7 1x GTX 750 1GB: varies from 228 to 249 H/s for each card
cryptonight_v7 8x GTX 750 1GB: 1897 H/s - 100%

OpenCL

cryptonight_v8 1x GTX 750 1GB: varies from 156 to 180 H/s for each card
cryptonight_v8 8x GTX 750 1GB: 1373 H/s - 72.4%

Going from V7 to V8 is now a little less brutal but still a 27.6% lower hash rate than v7.

Below are the GPU config files for v7 and v8

I have tweaked the v8 settings from the auto generated to the best settings I could obtain by various changes and retesting.

Unroll of 4 is the best, going to 8 reduces performance Intensity of 352 is also the best, going with the autodefined 416 kills performance Also changing worksize to either 12 or 4 kills performance

Miner config:

V7 CUDA Nvidia.txt configuration

// gpu: GeForce GTX 750 architecture: 50
//      memory: 859/976 MiB
//      smx: 4

{ "index" : 0,
"threads" : 30, "blocks" : 12,
"bfactor" : 8, "bsleep" :  100,
"affine_to_cpu" : false, "sync_mode" : 3,
},

// gpu: GeForce GTX 750 architecture: 50
//      memory: 943/981 MiB
//      smx: 4

{ "index" : 1,
"threads" : 32, "blocks" : 12,
"bfactor" : 2, "bsleep" :  0,
"affine_to_cpu" : false, "sync_mode" : 3,
},

V8 OpenCL AMD.txt Nvidia configuration

// gpu: GeForce GTX 750 memory:848
// compute units: 4
{ "index" : 0,
"intensity" : 352, "worksize" : 8,
"affine_to_cpu" : false, "strided_index" : 2, "mem_chunk" : 2,
"unroll" : 4, "comp_mode" : true
},

// gpu: GeForce GTX 750 memory:853
// compute units: 4
{ "index" : 1,
"intensity" : 352, "worksize" : 8,
"affine_to_cpu" : false, "strided_index" : 2, "mem_chunk" : 2,
"unroll" : 4, "comp_mode" : true
},
SChernykh commented 6 years ago

OS: Kubuntu Backend: - CPU - Intel Core i5-3320M - stock settings speed: cryptonight_v7: 74.1 H/s cryptonight_v8: 74.6 H/s { "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 0 }, { "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 2 },

I can confirm these numbers. I got 73.6 H/s on CNv7 and 74.2 H/s on CNv8 with Core i5-3210M and these settings. Even though it has only 3 MB cache, second CPU thread helps a lot more when running CNv8.

Bathmat commented 6 years ago

@kio3i0j9024vkoenio did you try "strided_index" : 0?

kio3i0j9024vkoenio commented 6 years ago

Just tried "strided_index" : 0 and the results are exactly the same as with "strided_index" : 2:

cryptonight_v7 1x GTX 750 1GB: varies from 228 to 249 H/s for each card cryptonight_v7 8x GTX 750 1GB: 1897 H/s - 100%

cryptonight_v8 1x GTX 750 1GB: varies from 156 to 180 H/s for each card cryptonight_v8 8x GTX 750 1GB: 1376 H/s - 72.5%

EDIT

I have tries many other changes to the config file and the absolute best I can get with OpenCL is:

cryptonight_v8 1x GTX 750 1GB: varies from 158 to 181 H/s for each card cryptonight_v8 8x GTX 750 1GB: 1392 H/s - 73.4%

The final config is:

// gpu: GeForce GTX 750 memory:848
// compute units: 4 
{ "index" : 0,
"intensity" : 352, "worksize" : 8,
"affine_to_cpu" : false, "strided_index" : 0, "mem_chunk" : 0,
"unroll" : 2, "comp_mode" : false
}, 

I hope that the CUDA version can be made available soon and I hope for better results with it.

Bathmat commented 6 years ago

I have a Win7 rig with 3 Nvidia GPUs that is giving me issues with OpenCL... GPUs are one GTX970, and 2 GTX1050. If I run just 1 gpu, hashrates are about what I expect for OpenCL; however, if I try to run all 3, hashrate drops significantly and watching HWmonitor shows that GPU Utilization will only be 100% for 1 gpu at a time and it rotates between the gpus (thus causing the low hashrate). Does anyone know how to force each GPU to work simultaneously using OpenCL and Win7?

Thoughts @Spudz76, @kio3i0j9024vkoenio? I've tried Googling, but my searches are coming up empty. Perhaps something in nvidia-smi? I've never really used nvidia-smi, so I'm not very familiar.

EDIT: P.S. this rig works just fine on CNv7 and CUDA

psychocrypt commented 6 years ago

Everyone can now check the performance of the native CUDA backend. Please take care the default config for CUDA devices is complete different to the old configs. From my first checks it looks like old Kepler GPUs will have only 1/3 performance compared to v7.

plavirudar commented 6 years ago

1080ti, +600 mem, ethlargement, threads 64, blocks 28, bfactor 12, bsleep 100:

980h/s cnv1, 755h/s cnv2 with 64/28, 770h/s cnv2 with 64/56, 790h/s cnv2 with autogenerated 4/224

Bathmat commented 6 years ago

OS: Win10 GPU: GTX1050 (+160 core, +98 mem = 1987 core, 3600 mem)

CNv7 (CUDA): 325 h/s CNv8 (CUDA): 283 h/s 87% of CNv7

EDIT: CNv8 (OpenCL): 296 h/s 🤷‍♂️

Here are some of the configs I tested:

Auto-suggested config:

  // gpu: GeForce GTX 1050 architecture: 61
  //      memory: 1641/2048 MiB
  //      smx: 5
  { "index" : 0,
    "threads" : 4, "blocks" : 40,
    "bfactor" : 8, "bsleep" :  25,
    "affine_to_cpu" : false, "sync_mode" : 3,
  },

HASHRATE REPORT - NVIDIA
| ID |    10s |    60s |    15m |
|  0 |  189.8 |  189.8 |   (na) |
Totals (NVIDIA):   189.8  189.8    0.0 H/s
-----------------------------------------------------------------
Totals (ALL):    189.8  189.8    0.0 H/s

Same config as CNv7:

  { "index" : 0,
    "threads" : 32, "blocks" : 20,
    "bfactor" : 8, "bsleep" :  100,
    "affine_to_cpu" : false, "sync_mode" : 3,
  },

HASHRATE REPORT - NVIDIA
| ID |    10s |    60s |    15m |
|  0 |  275.2 |  275.5 |   (na) |
Totals (NVIDIA):   275.2  275.5    0.0 H/s
-----------------------------------------------------------------
Totals (ALL):    275.2  275.5    0.0 H/s

Best config I could find:

  { "index" : 0,
    "threads" : 32, "blocks" : 10,
    "bfactor" : 8, "bsleep" :  100,
    "affine_to_cpu" : false, "sync_mode" : 3,
  },

HASHRATE REPORT - NVIDIA
| ID |    10s |    60s |    15m |
|  0 |  283.8 |  283.6 |   (na) |
Totals (NVIDIA):   283.8  283.6    0.0 H/s
-----------------------------------------------------------------
Totals (ALL):    283.8  283.6    0.0 H/s

Interestingly, the last config showed that only using 36% of the memory in HWmonitor, whereas, it shows 68% being used with CNv7. I saw similar hashrate using "blocks" : 15 (280h/s) and 52% of memory being used. Trying to use more memory by increasing threads or blocks resulted in lower hashrate.

Bathmat commented 6 years ago

OS: Win10 GPU: GTX-1060 6GB (+150 core, +500 mem = 2000 core, 4300 mem)

CNv7 (CUDA): 520 h/s CNv8 (CUDA): 458h/s 88% of CNv7

Auto-suggested config:

  // gpu: GeForce GTX 1060 6GB architecture: 61
  //      memory: 5080/6144 MiB
  //      smx: 10
  { "index" : 0,
    "threads" : 4, "blocks" : 80,
    "bfactor" : 6, "bsleep" :  25,
    "affine_to_cpu" : false, "sync_mode" : 3,
  },

HASHRATE REPORT - NVIDIA
| ID |    10s |    60s |    15m |
|  0 |  276.5 |  276.1 |   (na) |
Totals (NVIDIA):   276.5  276.1    0.0 H/s
-----------------------------------------------------------------
Totals (ALL):    276.5  276.1    0.0 H/s

Same config as CNv7:

  { "index" : 0,
    "threads" : 32, "blocks" : 30,
    "bfactor" : 8, "bsleep" :  100,
    "affine_to_cpu" : false, "sync_mode" : 3,
  },

HASHRATE REPORT - NVIDIA
| ID |    10s |    60s |    15m |
|  0 |  414.4 |  413.3 |   (na) |
Totals (NVIDIA):   414.4  413.3    0.0 H/s
-----------------------------------------------------------------
Totals (ALL):    414.4  413.3    0.0 H/s

Best config I could find:

  { "index" : 0,
    "threads" : 32, "blocks" : 20,
    "bfactor" : 8, "bsleep" :  100,
    "affine_to_cpu" : false, "sync_mode" : 3,
  },

HASHRATE REPORT - NVIDIA
| ID |    10s |    60s |    15m |
|  0 |  457.6 |  458.2 |   (na) |
Totals (NVIDIA):   457.6  458.2    0.0 H/s
-----------------------------------------------------------------
Totals (ALL):    457.6  458.2    0.0 H/s

@psychocrypt it appears the auto-config for GTX-10xx gpus doesn't work very well. With both of my tests, the best hashrate was found by multiplying SMX times 2 and setting "threads" : 32. Perhaps this should be the auto-config with CNv8?

psychocrypt commented 6 years ago

Thx for the tests. With the auto adjustment I am currently not sure what is the best. I will try your suggestion. Since the memory access patterned used in cn8 are different to cn7 I need first enough feedback like yours to decide what we chose for auto cfg.

Bathmat commented 6 years ago

OS: Win7 GPU: GTX-970 (+200 core, +200 mem = 1470 core, 3700 mem)

CNv7 (CUDA) : 480 h/s CNv8 (CUDA) : 383 h/s 79.7%

Best config:

  // gpu: GeForce GTX 970 architecture: 52
  //      memory: 3884/4096 MiB
  //      smx: 13
  { "index" : 0,
    "threads" : 4, "blocks" : 104,
    "bfactor" : 10, "bsleep" :  100,
    "affine_to_cpu" : false, "sync_mode" : 3,
  },

@psychocrypt With this gpu, it appears your auto-config was the best performance. Actually auto-config had "bfactor" : 6, "bsleep" : 25 and gave 397 h/s, but I bumped it up to 10/100 to reduce lag in Windows (the monitor runs off this gpu).

The whole rig is: 0: GTX970 1: GTX1050 2: GTX1050

CNv7 (CUDA) : 1100 h/s CNv8 (CUDA) : 912 h/s

Config:

  // gpu: GeForce GTX 970 architecture: 52
  //      memory: 3884/4096 MiB
  //      smx: 13
  { "index" : 0,
    "threads" : 4, "blocks" : 104,
    "bfactor" : 10, "bsleep" :  100,
    "affine_to_cpu" : false, "sync_mode" : 3,
  },
  // gpu: GeForce GTX 1050 architecture: 61
  //      memory: 1913/2048 MiB
  //      smx: 5
  { "index" : 1,
    "threads" : 32, "blocks" : 10,
    "bfactor" : 8, "bsleep" :  100,
    "affine_to_cpu" : false, "sync_mode" : 3,
  },
  // gpu: GeForce GTX 1050 architecture: 61
  //      memory: 1913/2048 MiB
  //      smx: 5
  { "index" : 2,
    "threads" : 32, "blocks" : 10,
    "bfactor" : 8, "bsleep" :  100,
    "affine_to_cpu" : false, "sync_mode" : 3,
  },

Threads:
382, 265, 265 = 912

This is actually not bad considering I couldn't get more than one gpu to work at a time with OpenCL. At least this rig won't be completely useless with CNv8 now. 😆

Bathmat commented 6 years ago

Actually @psychocrypt, this config gave the same performance with that GTX-970 (382 h/s):

  { "index" : 0,
    "threads" : 13, "blocks" : 39,
    "bfactor" : 10, "bsleep" :  100,
    "affine_to_cpu" : false, "sync_mode" : 3,
  },
Bathmat commented 6 years ago

Retested my Win10 GTX-1050 and 1060: GTX-1050: 290.8 h/s Config:

  { "index" : 0,
    "threads" : 12, "blocks" : 40,
    "bfactor" : 8, "bsleep" :  100,
    "affine_to_cpu" : false, "sync_mode" : 3,
  },

GTX-1060 6GB: 459.3 h/s Config:

  { "index" : 0,
    "threads" : 10, "blocks" : 80,
    "bfactor" : 8, "bsleep" :  100,
    "affine_to_cpu" : false, "sync_mode" : 3,
  },

@psychocrypt 🤷‍♂️

Bathmat commented 6 years ago

New best results with Win7 rig: CNv7 Total: 1100 h/s CNv8: GTX-970: 393 h/s GTX1050: 285 h/s (times 2) Total: 963 h/s 87.5%

  // gpu: GeForce GTX 970 architecture: 52
  //      memory: 3884/4096 MiB
  //      smx: 13
  { "index" : 0,
    "threads" : 5, "blocks" : 104,
    "bfactor" : 10, "bsleep" :  100,
    "affine_to_cpu" : false, "sync_mode" : 3,
  },
  // gpu: GeForce GTX 1050 architecture: 61
  //      memory: 1913/2048 MiB
  //      smx: 5
  { "index" : 1,
    "threads" : 11, "blocks" : 40,
    "bfactor" : 8, "bsleep" :  100,
    "affine_to_cpu" : false, "sync_mode" : 3,
  },
  // gpu: GeForce GTX 1050 architecture: 61
  //      memory: 1913/2048 MiB
  //      smx: 5
  { "index" : 2,
    "threads" : 11, "blocks" : 40,
    "bfactor" : 8, "bsleep" :  100,
    "affine_to_cpu" : false, "sync_mode" : 3,
  },
psychocrypt commented 6 years ago

I added asm code for double hash to my PR

Spudz76 commented 6 years ago

newest set of CPU patches builds and runs fine on Windows (CPU only tested)

Bathmat commented 6 years ago

OS: Win10 CPU: i3-7350k @ 4.5 Ghz (4MB L3) CNv7 : 165 h/s CNv8 (asm : off) : 139 h/s CNv8 (asm : intel) : 151+ h/s CNv8 (asm : ryzen) : ~144 h/s

"cpu_threads_conf" :
[
    { "low_power_mode" : false, "no_prefetch" : true,  "asm" : "intel", "affine_to_cpu" : 0 },
    { "low_power_mode" : false, "no_prefetch" : true,  "asm" : "intel", "affine_to_cpu" : 2 },

],

P.S. This is my everyday pc with a bunch of stuff running in the background (Chrome, etc...). The CNv7 hashrate is without anything running, but the CNv8 had all that running, so not a fair comparison.

toy1111 commented 6 years ago

Win10 CPU: AMD 1950X (15cores, core 0 commented out) cn7: 1290hs cn8: 1225hs (asm:off) cn8: 1270hs (asm:ryzen) (corrected typos...)

Great that CPU mining will still be good with cn8 and nearly the same with the asm code (great work). Curious about Vega results. I'm seeing the similar hash drops in cn8 as with other cn-heavy variants - about 1400hs vs 1900+hs. I was hopeful since CPU mining was nearly the same that Vegas would still be strong but that doesn't seem to be the case with cn8.

SChernykh commented 6 years ago

@toynn

cn7: 1290hs cn8: 1925hs (asm:off) cn8: 1970hs (asm:ryzen)

Is there a typo in cn7 number? Was it supposed to be 1990 H/s? Vega should be fine with cn8 (at least with Cast XMR): https://github.com/SChernykh/xmr-stak-cpu/issues/1#issuecomment-425674350 - you probably didn't use the optimal config for xmr-stak. Try different combinations of worksize = 16/32 and unroll = 8/16, as well as running 1 or 2 GPU threads with different or same intensities.

kio3i0j9024vkoenio commented 6 years ago

OS: Ubuntu 16.04

Backend: CPU Only

CPU 4x Xeon E7-8837's in a HP DL580 G7

cryptonight_v7: 1624 H/s - 100%
cryptonight_v8: 1522 H/s - 93.7%

Miner config:

"cpu_threads_conf" :
[
{ "low_power_mode" : true, "no_prefetch" : false,  "asm" : "intel", "affine_to_cpu" : 0 },
{ "low_power_mode" : true, "no_prefetch" : false,  "asm" : "intel", "affine_to_cpu" : 1 },
{ "low_power_mode" : true, "no_prefetch" : false,  "asm" : "intel", "affine_to_cpu" : 2 },
{ "low_power_mode" : true, "no_prefetch" : false,  "asm" : "intel", "affine_to_cpu" : 3 },
{ "low_power_mode" : false, "no_prefetch" : true,  "asm" : "intel", "affine_to_cpu" : 4 },
{ "low_power_mode" : false, "no_prefetch" : true,  "asm" : "intel", "affine_to_cpu" : 5 },
{ "low_power_mode" : false, "no_prefetch" : true,  "asm" : "intel", "affine_to_cpu" : 6 },
{ "low_power_mode" : false, "no_prefetch" : true,  "asm" : "intel", "affine_to_cpu" : 7 },
{ "low_power_mode" : true, "no_prefetch" : false,  "asm" : "intel", "affine_to_cpu" : 8 },
{ "low_power_mode" : true, "no_prefetch" : false,  "asm" : "intel", "affine_to_cpu" : 9 },
{ "low_power_mode" : true, "no_prefetch" : false,  "asm" : "intel", "affine_to_cpu" : 10 },
{ "low_power_mode" : true, "no_prefetch" : false,  "asm" : "intel", "affine_to_cpu" : 11 },
{ "low_power_mode" : false, "no_prefetch" : true,  "asm" : "intel", "affine_to_cpu" : 12 },
{ "low_power_mode" : false, "no_prefetch" : true,  "asm" : "intel", "affine_to_cpu" : 13 },
{ "low_power_mode" : false, "no_prefetch" : true,  "asm" : "intel", "affine_to_cpu" : 14 },
{ "low_power_mode" : false, "no_prefetch" : true,  "asm" : "intel", "affine_to_cpu" : 15 },
{ "low_power_mode" : true, "no_prefetch" : false,  "asm" : "intel", "affine_to_cpu" : 16 },
{ "low_power_mode" : true, "no_prefetch" : false,  "asm" : "intel", "affine_to_cpu" : 17 },
{ "low_power_mode" : true, "no_prefetch" : false,  "asm" : "intel", "affine_to_cpu" : 18 },
{ "low_power_mode" : true, "no_prefetch" : false,  "asm" : "intel", "affine_to_cpu" : 19 },
{ "low_power_mode" : false, "no_prefetch" : true,  "asm" : "intel", "affine_to_cpu" : 20 },
{ "low_power_mode" : false, "no_prefetch" : true,  "asm" : "intel", "affine_to_cpu" : 21 },
{ "low_power_mode" : false, "no_prefetch" : true,  "asm" : "intel", "affine_to_cpu" : 22 },
{ "low_power_mode" : false, "no_prefetch" : true,  "asm" : "intel", "affine_to_cpu" : 23 },
{ "low_power_mode" : true, "no_prefetch" : false,  "asm" : "intel", "affine_to_cpu" : 24 },
{ "low_power_mode" : true, "no_prefetch" : false,  "asm" : "intel", "affine_to_cpu" : 25 },
{ "low_power_mode" : true, "no_prefetch" : false,  "asm" : "intel", "affine_to_cpu" : 26 },
{ "low_power_mode" : true, "no_prefetch" : false,  "asm" : "intel", "affine_to_cpu" : 27 },
{ "low_power_mode" : false, "no_prefetch" : true,  "asm" : "intel", "affine_to_cpu" : 28 },
{ "low_power_mode" : false, "no_prefetch" : true,  "asm" : "intel", "affine_to_cpu" : 29 },
{ "low_power_mode" : false, "no_prefetch" : true,  "asm" : "intel", "affine_to_cpu" : 30 },
{ "low_power_mode" : false, "no_prefetch" : true,  "asm" : "intel", "affine_to_cpu" : 31 },
],
kio3i0j9024vkoenio commented 6 years ago

OS: Ubuntu 16.04

Backend: 8x Nvidia GTX 750 1GB

CUDA V7

cryptonight_v7 1x GTX 750 1GB: varies from 228 to 249 H/s for each card cryptonight_v7 8x GTX 750 1GB: 1897 H/s - 100%

OpenCL V8

cryptonight_v8 1x GTX 750 1GB: varies from 158 to 182 H/s for each card cryptonight_v8 8x GTX 750 1GB: 1382 H/s - 72.8%

https://github.com/fireice-uk/xmr-stak/issues/1851#issuecomment-424208972

OpenCL (AMD.TXT) Config for each GTX 750 GPU:

// gpu: GeForce GTX 750 memory:848
// compute units: 4 
{ "index" : 0-7,
"intensity" : 352, "worksize" : 8,
"affine_to_cpu" : false, "strided_index" : 0, "mem_chunk" : 0,
"unroll" : 2, "comp_mode" : false
}

CUDA V8

cryptonight_v8 1x GTX 750 1GB: varies from 137 to 158 H/s for each card cryptonight_v8 8x GTX 750 1GB: 1178 H/s - 62.1%

CUDA (NVIDIA.TXT) Config for each GTX 750 GPU:

// gpu: GeForce GTX 750 architecture: 50                                                
//      memory: 836/976 MiB                                                
//      smx: 4                                                
{ "index" : 0-7,                                                
  "threads" : 32, "blocks" : 8,                                                
  "bfactor" : 2, "bsleep" :  0,                                                
  "affine_to_cpu" : false, "sync_mode" : 3,                                                
  },

The above CUDA Nvidia Configuration is the best that could be obtained by changing the "threads" and "blocks" numbers.

The AUTO defined numbers of "threads": 4, "blocks":32 only produces 822 H/s (98 to 106 H/s for each card)

Other configurations tried:

"threads":32, "blocks":12 produces 1059 H/s (123 to 138 H/s for each card) "threads":40, "blocks":8 produces 1173 H/s (138 to 154 H/s for each card)

Trying T64/B4, T44/B8, T24/B8 all produced worse results.

Bathmat's sugestion of using SMX times 2 and setting "threads" : 32 for AUTO defined numbers works out for the best here also.

https://github.com/fireice-uk/xmr-stak/issues/1851#issuecomment-425781831

After all this the best solution for my 50+ GTX 750/750 Ti mining operation is to run OpenCL and not CUDA because CUDA V8 is 14.8% worse than OpenCL V8 currently.

Still a loss of 27.2% for V8 on OpenCL vs V7 kinda sucks as other GPU's seem to loose only 3-7% going from V7 to V8.

Bathmat commented 6 years ago

@kio3i0j9024vkoenio I also got similar performance by setting "threads" : 10-12 and using "blocks" : SMX times 8, But my oldest GPU is a 970.

Edit: actually the best result with my 970 was "threads" : 5 and "blocks" : SMX times 8

Bathmat commented 6 years ago

Curious about Vega results. I'm seeing the similar hash drops in cn8 as with other cn-heavy variants - about 1400hs vs 1900+hs. I was hopeful since CPU mining was nearly the same that Vegas would still be strong but that doesn't seem to be the case with cn8.

I'm about to buy a Vega just so I can test it, lol. Only problem is that it might take a week to get, so I'm SOL if it's performance with v8 is bad, haha.

Edit, just saw this, so I guess it's fine.... Might as well pull the trigger!

kio3i0j9024vkoenio commented 6 years ago

In my above posts I tested XMR-STAK V8 for CPU only (4x Xeon E7-8837's: 1522 H/s) and GPU only (8x GTX 750 1GB OpenCL: 1382 H/s). So total adding them together gets 2904 H/s or 82.5% of my 3521 H/s I was getting for V7.

However when doing both together CPU and GPU OpenCL on XMR-STAK the actual total is only 2740 H/s or a drop of 164 H/s.

The best I was able to obtain running both CPU and GPU OpenCL is 2845 H/s by disabling one single thread core # 7 on CPU #0 and then in the AMD.txt config set all of the GPU's to "affine_to_cpu" : 7 which gained back 105 H/s.

So my results for my HP DL580 G7 with 8x Nvidia GTX 750's are:

V7: 3521 H/s - 100% V8: 2845 H/s - 80.8%

toy1111 commented 6 years ago

@toynn

cn7: 1290hs cn8: 1925hs (asm:off) cn8: 1970hs (asm:ryzen)

Is there a typo in cn7 number? Was it supposed to be 1990 H/s? Vega should be fine with cn8 (at least with Cast XMR): SChernykh/xmr-stak-cpu#1 (comment) - you probably didn't use the optimal config for xmr-stak. Try different combinations of worksize = 16/32 and unroll = 8/16, as well as running 1 or 2 GPU threads with different or same intensities.

@SChernykh Yes typos - too much Vega on my mind I guess. cn8: 1225hs (asm:off) cn8: 1270hs (asm:ryzen) Corrected post and thank you for catching that. I'll test my Vegas again and report back.

toy1111 commented 6 years ago

I've been testing at http://killallasics.moneroworld.com:7777/ but continuing to get "Result rejected" (low difficulty). Difficulty starts at 10000 and adjust lower. Not sure if is this expected?

SChernykh commented 6 years ago

@toynn It's not expected, are you sure you're running correct algorithm? Double check your pools.txt config.

enerc commented 6 years ago

cn7 on R9 Fury stock ROCM 1.9 - intensity" : 896, "worksize" : 32 , "strided_index" : 2, "mem_chunk" : 2, "unroll" : 8: 690 H/s cn8 on R9 Fury stock ROCM 1.9 - intensity" : 896, "worksize" : 32 , "strided_index" : 2, "mem_chunk" : 2, "unroll" : 8: 434 H/s

cn7 on Vega 56 stock ROCM 1.9 - 2 threads - intensity" : 1792, "worksize" : 32 , "strided_index" : 2, "mem_chunk" : 2, "unroll" : 8: 800 + 800 -> 1600 H/s cn8 on Vega 56 stock ROCM 1.9 - 2 threads - intensity" : 1792, "worksize" : 32 , "strided_index" : 2, "mem_chunk" : 2, "unroll" : 8: 476 + 476 -> 953 H/s

Fury and Vega are seen a significant drop in H/s

SChernykh commented 6 years ago

@enerc Why 476+476? People managed to get much betters numbers for Vega before.

enerc commented 6 years ago

@SChernykh I don't know. During the test the GPU is pulling 190W according to rocm-smi monitor, running SCLK 1630Mhz and MCLK 945 Mhz, temp 52°C. I tried different worksize/mem_chunk/unroll combination and it has low impact.

SChernykh commented 6 years ago

@enerc You need much higher intensity for Vega 56/64 to achieve good numbers.