Open psychocrypt opened 6 years ago
I did't run benchmark this time, just started mining for a few minutes and checked highest reported hashrate.
OS: Windows 7
Backend: CPU, Intel Core i5 3210M
speed:
- cryptonight_v7: 74.3 H/s
- cryptonight_v8: 67.3 H/s
config:
"cpu_threads_conf" :
[
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 0 },
],
This is strange - I get 68.3 H/s with xmrig on this machine. It uses the same asm code, the only difference is that I used Visual Studio 2017 to compile xmr-stak and MSYS2 with GCC 8.2.0 to compile xmrig.
OS: Windows 10
Backend: CPU, AMD Ryzen 5 2600 @ 4 GHz
speed:
- cryptonight_v7: 630.6 H/s
- cryptonight_v8: 627.3 H/s
config:
"cpu_threads_conf" :
[
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 0 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 2 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 4 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 5 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 6 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 8 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 10 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 11 },
],
xmrig showed identical performance this time.
OS: Windows 7
Backend: CPU, Intel Core i7 2600k
speed:
- cryptonight_v7: 287.5 H/s
- cryptonight_v8: 264.3 H/s
config:
"cpu_threads_conf" :
[
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 0 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 2 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 4 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 6 },
],
Again, xmrig showed a bit higher hashrate - 267.9 H/s.
@SChernykh Thanks for your tests. A few hashes difference can be. Depending on which port of the test pool you are mining. If you are lucky and found a lot of hashes than the hash rate will go down a few hashes. But let us wait for other results.
Big thanks again for the asm code.
I've also tested RX 560: hashrate numbers here are what was reported as highest when mining. I double checked it - performance is identical, I was mining v7 against v7 pool and v8 against v8 pool, all shares were accepted. I tried a few different configs, but I couldn't find faster settings for v7.
OS: Windows 10
Backend: OpenCL, AMD Radeon RX 560 4GB, 1 click PBE timing straps, core @ 1196 MHz, memory @ 2200 MHz
speed:
- cryptonight_v7: 469.3 H/s
- cryptonight_v8: 469.3 H/s
config for v7:
"gpu_threads_conf" : [
// gpu: Baffin memory:2752
// compute units: 16
{ "index" : 0,
"intensity" : 512, "worksize" : 32,
"affine_to_cpu" : false, "strided_index" : 2, "mem_chunk" : 2,
"unroll" : 8, "comp_mode" : true
},
{ "index" : 0,
"intensity" : 512, "worksize" : 32,
"affine_to_cpu" : false, "strided_index" : 2, "mem_chunk" : 2,
"unroll" : 8, "comp_mode" : true
},
],
config for v8:
"gpu_threads_conf" : [
// gpu: Baffin memory:2752
// compute units: 16
{ "index" : 0,
"intensity" : 1024, "worksize" : 32,
"affine_to_cpu" : false, "strided_index" : 2, "mem_chunk" : 2,
"unroll" : 16, "comp_mode" : true
},
],
60 second benchmark on the same RX 560 with configs listed above:
speed:
- cryptonight_v7: 466.2 H/s
- cryptonight_v8: 461.8 H/s
I'm not sure what numbers to trust more.
@MoneroCrusher @mobilepolice @kio3i0j9024vkoenio @Bathmat This is the final code for the next Monero PoW, it would be good if you tested it on everything you got and posted results here.
Edit: everything except NVIDIA GPUs, CUDA version is not ready yet.
CPU is i7-2600 non-K cache 8M
OS: XX
Backend: CPU
speed:
- cryptonight_v7: 161 H/s
- cryptonight_v8: 162 H/s
Miner config:
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 0 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 2 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 4 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 6 },
Also tried prefetch both ways, all three asm, in all cases the above was winner High variance between tests, I blame firefox in the background, but I ran a bunch of times and took the best Did the N=2 optimize get in? I skipped l_p_m:2 tests for now
@Spudz76 it's better to test without anything running in background, right after reboot. Double hash asm code is not added to xmr-stak yet.
Nvidia OpenCL causes severe lag for my computer (1080ti) and makes it unusable while mining. Previously, using CUDA with bfactor=12 and bsleep=100 causes no slowdown of the computer while doing word processing and basic videos.
Is there a tweak that I'm missing here? Currently it's so laggy that it causes music to stutter and even pressing "h" to show hashrate isn't working properly.
Nvidia OpenCL causes severe lag for my computer (1080ti) and makes it unusable while mining. Previously, using CUDA with bfactor=12 and bsleep=100 causes no slowdown of the computer while doing word processing and basic videos.
Is there a tweak that I'm missing here?
@plavirudar you could try lowering the intensity... might help @psychocrypt do you expect to be able to get a working CUDA version? I have an older Win7 rig with a GTX970 and 2x GTX1050s that OpenCL is not playing nice with. I'll keep troubleshooting though.
@Bathmat I started off with intensity 896 (the recommended) and it was hashing ~1000h/s with ethlargement+600MHz mem OC (which is similar to its performance on cnh/cnv1, however the computer was unusable. When I dropped intensity to 640, the computer was still unusable, however hashrate fell to 800, which was lower than its performance on cnv2 CUDA.
@Spudz76 Are you sure you're using the right CPU? I have an i7-2600 non-k and I'm getting 220 with "asm":"off"
, 256 with "asm":"intel"
, 250 with "asm":"ryzen"
and 270 with CNv7 (so ~5% slowdown after using the best asm optimizations). That CPU is also running a bunch of random shitcoin daemons as well, so it's not even performing at max.
Not sure what you meant by OS:XX, I tried mine on Ubuntu 16.04 with threads 0,1,2,3.
OS: Kubuntu Backend: - CPU - Intel Core i5-3320M - stock settings speed:
cryptonight_v8: 74.6 H/s
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 0 }, { "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 2 },
@plavirudar Win7 and all sorts of garbage running in the background (my daily driver desktop box) Definitely not rebooting let alone closing all these tabs. But I have others I can test that are only for mining / not a i7-2600 though / most of them are non-AES and Linux so I was testing the applicable AES-capable stuff I have first
Also this is about comparison from v7 to v8 not global competition. I ran so many passes and took the highest which should account for background task variances So whatever I have holding me back is doing the same thing to v8 the delta still applies (and in this case v8 was faster) I don't mine on this box normally but it is an extra test point (and has a GTX970 as well). I did tell Firefox to quit fiddling with the GPU for offloading and a few other easy avoidance measures (mostly to open more VRAM).
Same Win7 box as above GTX970-4GB stock / driver profile max performance + P0
Backend: OpenCL->NVIDIA
speed:
- cryptonight_v7: 448 H/s
- cryptonight_v8: 378 H/s
Miner config:
// gpu: GeForce GTX 970 memory:3968
// compute units: 13
{ "index" : 0,
"intensity" : 832, "worksize" : 8,
"affine_to_cpu" : false, "strided_index" : 0, "mem_chunk" : 2,
"unroll" : 4, "comp_mode" : false
},
832
= 13 * 8 * 8
= smx * 8 * worksize
which seemed to work best
unroll 4 and 8 same performance
strided 2 and chunk 2,3,4 did not change much (mostly same, some slightly worse)
Similar with worksizes, intensities.
In all cases windows chokes the entire time the miner is hashing.
OS: Win10 1803
Backend: AMD GPUs (4), all bios modded with 1 click timings (all hynix mem, 2000 clock)
0: RX580 8GB Gigabyte, 99W
1: RX480 4GB AMD brand, 82W
2: RX570 4GB Sapphire ITX, 101W
3: RX470 4GB Sapphire, 80W
speed:
- cryptonight_v7: 3530 H/s
- cryptonight_v8: 3350 H/s
HASHRATE REPORT - AMD
| ID | 10s | 60s | 15m | ID | 10s | 60s | 15m |
| 0 | 394.0 | 394.0 | (na) | 1 | 395.2 | 393.7 | (na) |
| 2 | 425.5 | 425.5 | (na) | 3 | 424.9 | 425.2 | (na) |
| 4 | 435.3 | 435.0 | (na) | 5 | 434.4 | 434.7 | (na) |
| 6 | 420.7 | 421.1 | (na) | 7 | 421.2 | 421.1 | (na) |
Totals (AMD): 3351.1 3350.5 0.0 H/s
-----------------------------------------------------------------
Totals (ALL): 3351.1 3350.5 0.0 H/s
amd.txt config:
"gpu_threads_conf" : [
// gpu: Ellesmere memory:3920
// compute units: 36
{ "index" : 0,
"intensity" : 896, "worksize" : 16,
"affine_to_cpu" : false, "strided_index" : 2, "mem_chunk" : 2,
"unroll" : 4, "comp_mode" : true
},
{ "index" : 0,
"intensity" : 896, "worksize" : 8,
"affine_to_cpu" : false, "strided_index" : 2, "mem_chunk" : 2,
"unroll" : 4, "comp_mode" : true
},
// gpu: Ellesmere memory:3712
// compute units: 36
{ "index" : 1,
"intensity" : 896, "worksize" : 16,
"affine_to_cpu" : false, "strided_index" : 2, "mem_chunk" : 2,
"unroll" : 4, "comp_mode" : true
},
{ "index" : 1,
"intensity" : 896, "worksize" : 8,
"affine_to_cpu" : false, "strided_index" : 2, "mem_chunk" : 2,
"unroll" : 4, "comp_mode" : true
},
// gpu: Ellesmere memory:3712
// compute units: 32
{ "index" : 2,
"intensity" : 896, "worksize" : 16,
"affine_to_cpu" : false, "strided_index" : 2, "mem_chunk" : 2,
"unroll" : 4, "comp_mode" : true
},
{ "index" : 2,
"intensity" : 896, "worksize" : 8,
"affine_to_cpu" : false, "strided_index" : 2, "mem_chunk" : 2,
"unroll" : 4, "comp_mode" : true
},
// gpu: Ellesmere memory:3712
// compute units: 32
{ "index" : 3,
"intensity" : 896, "worksize" : 16,
"affine_to_cpu" : false, "strided_index" : 2, "mem_chunk" : 2,
"unroll" : 4, "comp_mode" : true
},
{ "index" : 3,
"intensity" : 896, "worksize" : 8,
"affine_to_cpu" : false, "strided_index" : 2, "mem_chunk" : 2,
"unroll" : 4, "comp_mode" : true
},
],
Performance is in-line with what I was expecting. Power consumption is from HWMonitor for CNv8-2. Power use for CNv7 was lower by 3-7 watts depending on the GPU, but again, this was expected. Turns out that changing unroll
didn't seem to have an affect. I tested all 8, all 4, all 1 and a mix of 8 on W:16 threads, 4 on W:8 threads, and total hashrate remained within 10 h/s for each.
Note: I did try a single thread test; however, hashrate was 7% slower than dual thread, and power consumption was the same as dual thread.
OS: Ubuntu 16.04
Backend: CPU Only
CPU 4x Xeon E7-8837's in a HP DL580 G7
cryptonight_v7: 1624 H/s - 100%
cryptonight_v8: 1293 H/s - 79.6%
Note that using SChernykh XMR-Stak-CPU latest code with all the same v8 changes and the optimized asm for 1x and 2x threads produces:
Miner config:
"cpu_threads_conf" :
[
{ "low_power_mode" : true, "no_prefetch" : false, "asm" : "ryzen", "affine_to_cpu" : 0 },
{ "low_power_mode" : true, "no_prefetch" : false, "asm" : "ryzen", "affine_to_cpu" : 1 },
{ "low_power_mode" : true, "no_prefetch" : false, "asm" : "ryzen", "affine_to_cpu" : 2 },
{ "low_power_mode" : true, "no_prefetch" : false, "asm" : "ryzen", "affine_to_cpu" : 3 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "ryzen", "affine_to_cpu" : 4 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "ryzen", "affine_to_cpu" : 5 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "ryzen", "affine_to_cpu" : 6 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "ryzen", "affine_to_cpu" : 7 },
{ "low_power_mode" : true, "no_prefetch" : false, "asm" : "ryzen", "affine_to_cpu" : 8 },
{ "low_power_mode" : true, "no_prefetch" : false, "asm" : "ryzen", "affine_to_cpu" : 9 },
{ "low_power_mode" : true, "no_prefetch" : false, "asm" : "ryzen", "affine_to_cpu" : 10 },
{ "low_power_mode" : true, "no_prefetch" : false, "asm" : "ryzen", "affine_to_cpu" : 11 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "ryzen", "affine_to_cpu" : 12 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "ryzen", "affine_to_cpu" : 13 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "ryzen", "affine_to_cpu" : 14 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "ryzen", "affine_to_cpu" : 15 },
{ "low_power_mode" : true, "no_prefetch" : false, "asm" : "ryzen", "affine_to_cpu" : 16 },
{ "low_power_mode" : true, "no_prefetch" : false, "asm" : "ryzen", "affine_to_cpu" : 17 },
{ "low_power_mode" : true, "no_prefetch" : false, "asm" : "ryzen", "affine_to_cpu" : 18 },
{ "low_power_mode" : true, "no_prefetch" : false, "asm" : "ryzen", "affine_to_cpu" : 19 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "ryzen", "affine_to_cpu" : 20 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "ryzen", "affine_to_cpu" : 21 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "ryzen", "affine_to_cpu" : 22 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "ryzen", "affine_to_cpu" : 23 },
{ "low_power_mode" : true, "no_prefetch" : false, "asm" : "ryzen", "affine_to_cpu" : 24 },
{ "low_power_mode" : true, "no_prefetch" : false, "asm" : "ryzen", "affine_to_cpu" : 25 },
{ "low_power_mode" : true, "no_prefetch" : false, "asm" : "ryzen", "affine_to_cpu" : 26 },
{ "low_power_mode" : true, "no_prefetch" : false, "asm" : "ryzen", "affine_to_cpu" : 27 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "ryzen", "affine_to_cpu" : 28 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "ryzen", "affine_to_cpu" : 29 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "ryzen", "affine_to_cpu" : 30 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "ryzen", "affine_to_cpu" : 31 },
],
OS: Ubuntu 16.04
Backend: 8x Nvidia GTX 750 1GB
CUDA
cryptonight_v7 1x GTX 750 1GB: varies from 228 to 249 H/s for each card
cryptonight_v7 8x GTX 750 1GB: 1897 H/s - 100%
OpenCL
cryptonight_v8 1x GTX 750 1GB: varies from 118 to 130 H/s for each card
cryptonight_v8 8x GTX 750 1GB: 1011 H/s - 53.3%
Losing almost half of the hash rate going from V7 to V8 is brutal.
Below are the auto config GPU config files from v7 and v8 for the first two GPU's the remaining six have the exact same settings as the second GPU. The first GPU has a display attached.
Miner config:
V7 CUDA Nvidia.txt configuration
// gpu: GeForce GTX 750 architecture: 50
// memory: 859/976 MiB
// smx: 4
{ "index" : 0,
"threads" : 30, "blocks" : 12,
"bfactor" : 8, "bsleep" : 100,
"affine_to_cpu" : false, "sync_mode" : 3,
},
// gpu: GeForce GTX 750 architecture: 50
// memory: 943/981 MiB
// smx: 4
{ "index" : 1,
"threads" : 32, "blocks" : 12,
"bfactor" : 2, "bsleep" : 0,
"affine_to_cpu" : false, "sync_mode" : 3,
},
V8 OpenCL AMD.txt Nvidia configuration
// gpu: GeForce GTX 750 memory:848
// compute units: 4
{ "index" : 0,
"intensity" : 416, "worksize" : 8,
"affine_to_cpu" : false, "strided_index" : 2, "mem_chunk" : 2,
"unroll" : 8, "comp_mode" : true
},
// gpu: GeForce GTX 750 memory:853
// compute units: 4
{ "index" : 1,
"intensity" : 416, "worksize" : 8,
"affine_to_cpu" : false, "strided_index" : 2, "mem_chunk" : 2,
"unroll" : 8, "comp_mode" : true
},
OS: Ubuntu 16.04
Backend: 8x Nvidia GTX 750 1GB
CUDA
cryptonight_v7 1x GTX 750 1GB: varies from 228 to 249 H/s for each card
cryptonight_v7 8x GTX 750 1GB: 1897 H/s - 100%
OpenCL
cryptonight_v8 1x GTX 750 1GB: varies from 156 to 180 H/s for each card
cryptonight_v8 8x GTX 750 1GB: 1373 H/s - 72.4%
Going from V7 to V8 is now a little less brutal but still a 27.6% lower hash rate than v7.
Below are the GPU config files for v7 and v8
I have tweaked the v8 settings from the auto generated to the best settings I could obtain by various changes and retesting.
Unroll of 4 is the best, going to 8 reduces performance Intensity of 352 is also the best, going with the autodefined 416 kills performance Also changing worksize to either 12 or 4 kills performance
Miner config:
V7 CUDA Nvidia.txt configuration
// gpu: GeForce GTX 750 architecture: 50
// memory: 859/976 MiB
// smx: 4
{ "index" : 0,
"threads" : 30, "blocks" : 12,
"bfactor" : 8, "bsleep" : 100,
"affine_to_cpu" : false, "sync_mode" : 3,
},
// gpu: GeForce GTX 750 architecture: 50
// memory: 943/981 MiB
// smx: 4
{ "index" : 1,
"threads" : 32, "blocks" : 12,
"bfactor" : 2, "bsleep" : 0,
"affine_to_cpu" : false, "sync_mode" : 3,
},
V8 OpenCL AMD.txt Nvidia configuration
// gpu: GeForce GTX 750 memory:848
// compute units: 4
{ "index" : 0,
"intensity" : 352, "worksize" : 8,
"affine_to_cpu" : false, "strided_index" : 2, "mem_chunk" : 2,
"unroll" : 4, "comp_mode" : true
},
// gpu: GeForce GTX 750 memory:853
// compute units: 4
{ "index" : 1,
"intensity" : 352, "worksize" : 8,
"affine_to_cpu" : false, "strided_index" : 2, "mem_chunk" : 2,
"unroll" : 4, "comp_mode" : true
},
OS: Kubuntu Backend: - CPU - Intel Core i5-3320M - stock settings speed: cryptonight_v7: 74.1 H/s cryptonight_v8: 74.6 H/s { "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 0 }, { "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 2 },
I can confirm these numbers. I got 73.6 H/s on CNv7 and 74.2 H/s on CNv8 with Core i5-3210M and these settings. Even though it has only 3 MB cache, second CPU thread helps a lot more when running CNv8.
@kio3i0j9024vkoenio did you try "strided_index" : 0
?
Just tried "strided_index" : 0 and the results are exactly the same as with "strided_index" : 2:
cryptonight_v7 1x GTX 750 1GB: varies from 228 to 249 H/s for each card cryptonight_v7 8x GTX 750 1GB: 1897 H/s - 100%
cryptonight_v8 1x GTX 750 1GB: varies from 156 to 180 H/s for each card cryptonight_v8 8x GTX 750 1GB: 1376 H/s - 72.5%
EDIT
I have tries many other changes to the config file and the absolute best I can get with OpenCL is:
cryptonight_v8 1x GTX 750 1GB: varies from 158 to 181 H/s for each card cryptonight_v8 8x GTX 750 1GB: 1392 H/s - 73.4%
The final config is:
// gpu: GeForce GTX 750 memory:848
// compute units: 4
{ "index" : 0,
"intensity" : 352, "worksize" : 8,
"affine_to_cpu" : false, "strided_index" : 0, "mem_chunk" : 0,
"unroll" : 2, "comp_mode" : false
},
I hope that the CUDA version can be made available soon and I hope for better results with it.
I have a Win7 rig with 3 Nvidia GPUs that is giving me issues with OpenCL... GPUs are one GTX970, and 2 GTX1050. If I run just 1 gpu, hashrates are about what I expect for OpenCL; however, if I try to run all 3, hashrate drops significantly and watching HWmonitor shows that GPU Utilization will only be 100% for 1 gpu at a time and it rotates between the gpus (thus causing the low hashrate). Does anyone know how to force each GPU to work simultaneously using OpenCL and Win7?
Thoughts @Spudz76, @kio3i0j9024vkoenio? I've tried Googling, but my searches are coming up empty. Perhaps something in nvidia-smi? I've never really used nvidia-smi, so I'm not very familiar.
EDIT: P.S. this rig works just fine on CNv7 and CUDA
Everyone can now check the performance of the native CUDA backend. Please take care the default config for CUDA devices is complete different to the old configs. From my first checks it looks like old Kepler GPUs will have only 1/3 performance compared to v7.
1080ti, +600 mem, ethlargement, threads 64, blocks 28, bfactor 12, bsleep 100:
980h/s cnv1, 755h/s cnv2 with 64/28, 770h/s cnv2 with 64/56, 790h/s cnv2 with autogenerated 4/224
OS: Win10 GPU: GTX1050 (+160 core, +98 mem = 1987 core, 3600 mem)
CNv7 (CUDA): 325 h/s CNv8 (CUDA): 283 h/s 87% of CNv7
EDIT: CNv8 (OpenCL): 296 h/s 🤷♂️
Here are some of the configs I tested:
Auto-suggested config:
// gpu: GeForce GTX 1050 architecture: 61
// memory: 1641/2048 MiB
// smx: 5
{ "index" : 0,
"threads" : 4, "blocks" : 40,
"bfactor" : 8, "bsleep" : 25,
"affine_to_cpu" : false, "sync_mode" : 3,
},
HASHRATE REPORT - NVIDIA
| ID | 10s | 60s | 15m |
| 0 | 189.8 | 189.8 | (na) |
Totals (NVIDIA): 189.8 189.8 0.0 H/s
-----------------------------------------------------------------
Totals (ALL): 189.8 189.8 0.0 H/s
Same config as CNv7:
{ "index" : 0,
"threads" : 32, "blocks" : 20,
"bfactor" : 8, "bsleep" : 100,
"affine_to_cpu" : false, "sync_mode" : 3,
},
HASHRATE REPORT - NVIDIA
| ID | 10s | 60s | 15m |
| 0 | 275.2 | 275.5 | (na) |
Totals (NVIDIA): 275.2 275.5 0.0 H/s
-----------------------------------------------------------------
Totals (ALL): 275.2 275.5 0.0 H/s
Best config I could find:
{ "index" : 0,
"threads" : 32, "blocks" : 10,
"bfactor" : 8, "bsleep" : 100,
"affine_to_cpu" : false, "sync_mode" : 3,
},
HASHRATE REPORT - NVIDIA
| ID | 10s | 60s | 15m |
| 0 | 283.8 | 283.6 | (na) |
Totals (NVIDIA): 283.8 283.6 0.0 H/s
-----------------------------------------------------------------
Totals (ALL): 283.8 283.6 0.0 H/s
Interestingly, the last config showed that only using 36% of the memory in HWmonitor, whereas, it shows 68% being used with CNv7. I saw similar hashrate using "blocks" : 15
(280h/s) and 52% of memory being used. Trying to use more memory by increasing threads
or blocks
resulted in lower hashrate.
OS: Win10 GPU: GTX-1060 6GB (+150 core, +500 mem = 2000 core, 4300 mem)
CNv7 (CUDA): 520 h/s CNv8 (CUDA): 458h/s 88% of CNv7
Auto-suggested config:
// gpu: GeForce GTX 1060 6GB architecture: 61
// memory: 5080/6144 MiB
// smx: 10
{ "index" : 0,
"threads" : 4, "blocks" : 80,
"bfactor" : 6, "bsleep" : 25,
"affine_to_cpu" : false, "sync_mode" : 3,
},
HASHRATE REPORT - NVIDIA
| ID | 10s | 60s | 15m |
| 0 | 276.5 | 276.1 | (na) |
Totals (NVIDIA): 276.5 276.1 0.0 H/s
-----------------------------------------------------------------
Totals (ALL): 276.5 276.1 0.0 H/s
Same config as CNv7:
{ "index" : 0,
"threads" : 32, "blocks" : 30,
"bfactor" : 8, "bsleep" : 100,
"affine_to_cpu" : false, "sync_mode" : 3,
},
HASHRATE REPORT - NVIDIA
| ID | 10s | 60s | 15m |
| 0 | 414.4 | 413.3 | (na) |
Totals (NVIDIA): 414.4 413.3 0.0 H/s
-----------------------------------------------------------------
Totals (ALL): 414.4 413.3 0.0 H/s
Best config I could find:
{ "index" : 0,
"threads" : 32, "blocks" : 20,
"bfactor" : 8, "bsleep" : 100,
"affine_to_cpu" : false, "sync_mode" : 3,
},
HASHRATE REPORT - NVIDIA
| ID | 10s | 60s | 15m |
| 0 | 457.6 | 458.2 | (na) |
Totals (NVIDIA): 457.6 458.2 0.0 H/s
-----------------------------------------------------------------
Totals (ALL): 457.6 458.2 0.0 H/s
@psychocrypt it appears the auto-config for GTX-10xx gpus doesn't work very well. With both of my tests, the best hashrate was found by multiplying SMX times 2 and setting "threads" : 32
. Perhaps this should be the auto-config with CNv8?
Thx for the tests. With the auto adjustment I am currently not sure what is the best. I will try your suggestion. Since the memory access patterned used in cn8 are different to cn7 I need first enough feedback like yours to decide what we chose for auto cfg.
OS: Win7 GPU: GTX-970 (+200 core, +200 mem = 1470 core, 3700 mem)
CNv7 (CUDA) : 480 h/s CNv8 (CUDA) : 383 h/s 79.7%
Best config:
// gpu: GeForce GTX 970 architecture: 52
// memory: 3884/4096 MiB
// smx: 13
{ "index" : 0,
"threads" : 4, "blocks" : 104,
"bfactor" : 10, "bsleep" : 100,
"affine_to_cpu" : false, "sync_mode" : 3,
},
@psychocrypt With this gpu, it appears your auto-config was the best performance. Actually auto-config had "bfactor" : 6, "bsleep" : 25
and gave 397 h/s, but I bumped it up to 10/100 to reduce lag in Windows (the monitor runs off this gpu).
The whole rig is: 0: GTX970 1: GTX1050 2: GTX1050
CNv7 (CUDA) : 1100 h/s CNv8 (CUDA) : 912 h/s
Config:
// gpu: GeForce GTX 970 architecture: 52
// memory: 3884/4096 MiB
// smx: 13
{ "index" : 0,
"threads" : 4, "blocks" : 104,
"bfactor" : 10, "bsleep" : 100,
"affine_to_cpu" : false, "sync_mode" : 3,
},
// gpu: GeForce GTX 1050 architecture: 61
// memory: 1913/2048 MiB
// smx: 5
{ "index" : 1,
"threads" : 32, "blocks" : 10,
"bfactor" : 8, "bsleep" : 100,
"affine_to_cpu" : false, "sync_mode" : 3,
},
// gpu: GeForce GTX 1050 architecture: 61
// memory: 1913/2048 MiB
// smx: 5
{ "index" : 2,
"threads" : 32, "blocks" : 10,
"bfactor" : 8, "bsleep" : 100,
"affine_to_cpu" : false, "sync_mode" : 3,
},
Threads:
382, 265, 265 = 912
This is actually not bad considering I couldn't get more than one gpu to work at a time with OpenCL. At least this rig won't be completely useless with CNv8 now. 😆
Actually @psychocrypt, this config gave the same performance with that GTX-970 (382 h/s):
{ "index" : 0,
"threads" : 13, "blocks" : 39,
"bfactor" : 10, "bsleep" : 100,
"affine_to_cpu" : false, "sync_mode" : 3,
},
Retested my Win10 GTX-1050 and 1060: GTX-1050: 290.8 h/s Config:
{ "index" : 0,
"threads" : 12, "blocks" : 40,
"bfactor" : 8, "bsleep" : 100,
"affine_to_cpu" : false, "sync_mode" : 3,
},
GTX-1060 6GB: 459.3 h/s Config:
{ "index" : 0,
"threads" : 10, "blocks" : 80,
"bfactor" : 8, "bsleep" : 100,
"affine_to_cpu" : false, "sync_mode" : 3,
},
@psychocrypt 🤷♂️
New best results with Win7 rig: CNv7 Total: 1100 h/s CNv8: GTX-970: 393 h/s GTX1050: 285 h/s (times 2) Total: 963 h/s 87.5%
// gpu: GeForce GTX 970 architecture: 52
// memory: 3884/4096 MiB
// smx: 13
{ "index" : 0,
"threads" : 5, "blocks" : 104,
"bfactor" : 10, "bsleep" : 100,
"affine_to_cpu" : false, "sync_mode" : 3,
},
// gpu: GeForce GTX 1050 architecture: 61
// memory: 1913/2048 MiB
// smx: 5
{ "index" : 1,
"threads" : 11, "blocks" : 40,
"bfactor" : 8, "bsleep" : 100,
"affine_to_cpu" : false, "sync_mode" : 3,
},
// gpu: GeForce GTX 1050 architecture: 61
// memory: 1913/2048 MiB
// smx: 5
{ "index" : 2,
"threads" : 11, "blocks" : 40,
"bfactor" : 8, "bsleep" : 100,
"affine_to_cpu" : false, "sync_mode" : 3,
},
I added asm code for double hash to my PR
newest set of CPU patches builds and runs fine on Windows (CPU only tested)
OS: Win10
CPU: i3-7350k @ 4.5 Ghz (4MB L3)
CNv7 : 165 h/s
CNv8 (asm : off
) : 139 h/s
CNv8 (asm : intel
) : 151+ h/s
CNv8 (asm : ryzen
) : ~144 h/s
"cpu_threads_conf" :
[
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 0 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 2 },
],
P.S. This is my everyday pc with a bunch of stuff running in the background (Chrome, etc...). The CNv7 hashrate is without anything running, but the CNv8 had all that running, so not a fair comparison.
Win10 CPU: AMD 1950X (15cores, core 0 commented out) cn7: 1290hs cn8: 1225hs (asm:off) cn8: 1270hs (asm:ryzen) (corrected typos...)
Great that CPU mining will still be good with cn8 and nearly the same with the asm code (great work). Curious about Vega results. I'm seeing the similar hash drops in cn8 as with other cn-heavy variants - about 1400hs vs 1900+hs. I was hopeful since CPU mining was nearly the same that Vegas would still be strong but that doesn't seem to be the case with cn8.
@toynn
cn7: 1290hs cn8: 1925hs (asm:off) cn8: 1970hs (asm:ryzen)
Is there a typo in cn7 number? Was it supposed to be 1990 H/s? Vega should be fine with cn8 (at least with Cast XMR): https://github.com/SChernykh/xmr-stak-cpu/issues/1#issuecomment-425674350 - you probably didn't use the optimal config for xmr-stak. Try different combinations of worksize = 16/32 and unroll = 8/16, as well as running 1 or 2 GPU threads with different or same intensities.
OS: Ubuntu 16.04
Backend: CPU Only
CPU 4x Xeon E7-8837's in a HP DL580 G7
cryptonight_v7: 1624 H/s - 100%
cryptonight_v8: 1522 H/s - 93.7%
Miner config:
"cpu_threads_conf" :
[
{ "low_power_mode" : true, "no_prefetch" : false, "asm" : "intel", "affine_to_cpu" : 0 },
{ "low_power_mode" : true, "no_prefetch" : false, "asm" : "intel", "affine_to_cpu" : 1 },
{ "low_power_mode" : true, "no_prefetch" : false, "asm" : "intel", "affine_to_cpu" : 2 },
{ "low_power_mode" : true, "no_prefetch" : false, "asm" : "intel", "affine_to_cpu" : 3 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 4 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 5 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 6 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 7 },
{ "low_power_mode" : true, "no_prefetch" : false, "asm" : "intel", "affine_to_cpu" : 8 },
{ "low_power_mode" : true, "no_prefetch" : false, "asm" : "intel", "affine_to_cpu" : 9 },
{ "low_power_mode" : true, "no_prefetch" : false, "asm" : "intel", "affine_to_cpu" : 10 },
{ "low_power_mode" : true, "no_prefetch" : false, "asm" : "intel", "affine_to_cpu" : 11 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 12 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 13 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 14 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 15 },
{ "low_power_mode" : true, "no_prefetch" : false, "asm" : "intel", "affine_to_cpu" : 16 },
{ "low_power_mode" : true, "no_prefetch" : false, "asm" : "intel", "affine_to_cpu" : 17 },
{ "low_power_mode" : true, "no_prefetch" : false, "asm" : "intel", "affine_to_cpu" : 18 },
{ "low_power_mode" : true, "no_prefetch" : false, "asm" : "intel", "affine_to_cpu" : 19 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 20 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 21 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 22 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 23 },
{ "low_power_mode" : true, "no_prefetch" : false, "asm" : "intel", "affine_to_cpu" : 24 },
{ "low_power_mode" : true, "no_prefetch" : false, "asm" : "intel", "affine_to_cpu" : 25 },
{ "low_power_mode" : true, "no_prefetch" : false, "asm" : "intel", "affine_to_cpu" : 26 },
{ "low_power_mode" : true, "no_prefetch" : false, "asm" : "intel", "affine_to_cpu" : 27 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 28 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 29 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 30 },
{ "low_power_mode" : false, "no_prefetch" : true, "asm" : "intel", "affine_to_cpu" : 31 },
],
OS: Ubuntu 16.04
Backend: 8x Nvidia GTX 750 1GB
CUDA V7
cryptonight_v7 1x GTX 750 1GB: varies from 228 to 249 H/s for each card cryptonight_v7 8x GTX 750 1GB: 1897 H/s - 100%
OpenCL V8
cryptonight_v8 1x GTX 750 1GB: varies from 158 to 182 H/s for each card cryptonight_v8 8x GTX 750 1GB: 1382 H/s - 72.8%
https://github.com/fireice-uk/xmr-stak/issues/1851#issuecomment-424208972
OpenCL (AMD.TXT) Config for each GTX 750 GPU:
// gpu: GeForce GTX 750 memory:848
// compute units: 4
{ "index" : 0-7,
"intensity" : 352, "worksize" : 8,
"affine_to_cpu" : false, "strided_index" : 0, "mem_chunk" : 0,
"unroll" : 2, "comp_mode" : false
}
CUDA V8
cryptonight_v8 1x GTX 750 1GB: varies from 137 to 158 H/s for each card cryptonight_v8 8x GTX 750 1GB: 1178 H/s - 62.1%
CUDA (NVIDIA.TXT) Config for each GTX 750 GPU:
// gpu: GeForce GTX 750 architecture: 50
// memory: 836/976 MiB
// smx: 4
{ "index" : 0-7,
"threads" : 32, "blocks" : 8,
"bfactor" : 2, "bsleep" : 0,
"affine_to_cpu" : false, "sync_mode" : 3,
},
The above CUDA Nvidia Configuration is the best that could be obtained by changing the "threads" and "blocks" numbers.
The AUTO defined numbers of "threads": 4, "blocks":32 only produces 822 H/s (98 to 106 H/s for each card)
Other configurations tried:
"threads":32, "blocks":12 produces 1059 H/s (123 to 138 H/s for each card) "threads":40, "blocks":8 produces 1173 H/s (138 to 154 H/s for each card)
Trying T64/B4, T44/B8, T24/B8 all produced worse results.
Bathmat's sugestion of using SMX times 2 and setting "threads" : 32 for AUTO defined numbers works out for the best here also.
https://github.com/fireice-uk/xmr-stak/issues/1851#issuecomment-425781831
After all this the best solution for my 50+ GTX 750/750 Ti mining operation is to run OpenCL and not CUDA because CUDA V8 is 14.8% worse than OpenCL V8 currently.
Still a loss of 27.2% for V8 on OpenCL vs V7 kinda sucks as other GPU's seem to loose only 3-7% going from V7 to V8.
@kio3i0j9024vkoenio I also got similar performance by setting "threads" : 10-12 and using "blocks" : SMX times 8, But my oldest GPU is a 970.
Edit: actually the best result with my 970 was "threads" : 5 and "blocks" : SMX times 8
Curious about Vega results. I'm seeing the similar hash drops in cn8 as with other cn-heavy variants - about 1400hs vs 1900+hs. I was hopeful since CPU mining was nearly the same that Vegas would still be strong but that doesn't seem to be the case with cn8.
I'm about to buy a Vega just so I can test it, lol. Only problem is that it might take a week to get, so I'm SOL if it's performance with v8 is bad, haha.
Edit, just saw this, so I guess it's fine.... Might as well pull the trigger!
In my above posts I tested XMR-STAK V8 for CPU only (4x Xeon E7-8837's: 1522 H/s) and GPU only (8x GTX 750 1GB OpenCL: 1382 H/s). So total adding them together gets 2904 H/s or 82.5% of my 3521 H/s I was getting for V7.
However when doing both together CPU and GPU OpenCL on XMR-STAK the actual total is only 2740 H/s or a drop of 164 H/s.
The best I was able to obtain running both CPU and GPU OpenCL is 2845 H/s by disabling one single thread core # 7 on CPU #0 and then in the AMD.txt config set all of the GPU's to "affine_to_cpu" : 7 which gained back 105 H/s.
So my results for my HP DL580 G7 with 8x Nvidia GTX 750's are:
V7: 3521 H/s - 100% V8: 2845 H/s - 80.8%
@toynn
cn7: 1290hs cn8: 1925hs (asm:off) cn8: 1970hs (asm:ryzen)
Is there a typo in cn7 number? Was it supposed to be 1990 H/s? Vega should be fine with cn8 (at least with Cast XMR): SChernykh/xmr-stak-cpu#1 (comment) - you probably didn't use the optimal config for xmr-stak. Try different combinations of worksize = 16/32 and unroll = 8/16, as well as running 1 or 2 GPU threads with different or same intensities.
@SChernykh Yes typos - too much Vega on my mind I guess. cn8: 1225hs (asm:off) cn8: 1270hs (asm:ryzen) Corrected post and thank you for catching that. I'll test my Vegas again and report back.
I've been testing at http://killallasics.moneroworld.com:7777/ but continuing to get "Result rejected" (low difficulty). Difficulty starts at 10000 and adjust lower. Not sure if is this expected?
@toynn It's not expected, are you sure you're running correct algorithm? Double check your pools.txt config.
cn7 on R9 Fury stock ROCM 1.9 - intensity" : 896, "worksize" : 32 , "strided_index" : 2, "mem_chunk" : 2, "unroll" : 8: 690 H/s cn8 on R9 Fury stock ROCM 1.9 - intensity" : 896, "worksize" : 32 , "strided_index" : 2, "mem_chunk" : 2, "unroll" : 8: 434 H/s
cn7 on Vega 56 stock ROCM 1.9 - 2 threads - intensity" : 1792, "worksize" : 32 , "strided_index" : 2, "mem_chunk" : 2, "unroll" : 8: 800 + 800 -> 1600 H/s cn8 on Vega 56 stock ROCM 1.9 - 2 threads - intensity" : 1792, "worksize" : 32 , "strided_index" : 2, "mem_chunk" : 2, "unroll" : 8: 476 + 476 -> 953 H/s
Fury and Vega are seen a significant drop in H/s
@enerc Why 476+476? People managed to get much betters numbers for Vega before.
@SChernykh I don't know. During the test the GPU is pulling 190W according to rocm-smi monitor, running SCLK 1630Mhz and MCLK 945 Mhz, temp 52°C. I tried different worksize/mem_chunk/unroll combination and it has low impact.
@enerc You need much higher intensity for Vega 56/64 to achieve good numbers.
Monero is changing there POW in October 2018. Please test the implementation of the new algorithm against the test pool (http://killallasics.moneroworld.com/)
You can find the source code of xmr-stak in pull request #1850 or download the zipped source directly.
Please report here only the speed comparison between
cryptonight_v7
andcryptonight_v8
. If you fund any bugs please report it in the pull request #1850. Please also take the time to mine a few minutes against the testnet pool to check that you not get invalid results.How to bench the system:
Please start the miner once with
./xmr-stak
to createpools.txt
and all other config files. Changecryptonight_v8
intocryptonight_v7
to measure the performance of the current monero POW. Please do not forget to remove the backend configs if you switch the algorithm because"strided_index" : 1
is not allowed forcryptonight_v8
CPU:
CUDA/AMD OpenCL:
CUDA is currently not supported. I am currently try to get some performance out it.
NVIDIA via OpenCL
Template for speed reporting: