ldong commented 6 years ago

Basic information

OS: Ubuntu 16.4, AWS EC2 p2.8xlarge

CPU(s):                32
Thread(s) per core:    2
Core(s) per socket:    16
Socket(s):             1

Wed Jan 10 00:35:01 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 387.26                 Driver Version: 387.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:17.0 Off |                    0 |
| N/A   76C    P0   127W / 149W |   2115MiB / 11439MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | 00000000:00:18.0 Off |                    0 |
| N/A   60C    P0   138W / 149W |   2115MiB / 11439MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | 00000000:00:19.0 Off |                    0 |
| N/A   78C    P0   118W / 149W |   2115MiB / 11439MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | 00000000:00:1A.0 Off |                    0 |
| N/A   62C    P0   120W / 149W |   2115MiB / 11439MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla K80           Off  | 00000000:00:1B.0 Off |                    0 |
| N/A   82C    P0   128W / 149W |   2115MiB / 11439MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla K80           Off  | 00000000:00:1C.0 Off |                    0 |
| N/A   61C    P0   127W / 149W |   2115MiB / 11439MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla K80           Off  | 00000000:00:1D.0 Off |                    0 |
| N/A   82C    P0   129W / 149W |   2115MiB / 11439MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla K80           Off  | 00000000:00:1E.0 Off |                    0 |
| N/A   67C    P0   129W / 149W |   2115MiB / 11439MiB |    100%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0     64734      C   ./bin/xmr-stak                              2093MiB |
|    1     64734      C   ./bin/xmr-stak                              2093MiB |
|    2     64734      C   ./bin/xmr-stak                              2093MiB |
|    3     64734      C   ./bin/xmr-stak                              2093MiB |
|    4     64734      C   ./bin/xmr-stak                              2093MiB |
|    5     64734      C   ./bin/xmr-stak                              2093MiB |
|    6     64734      C   ./bin/xmr-stak                              2093MiB |
|    7     64734      C   ./bin/xmr-stak                              2093MiB |
+-----------------------------------------------------------------------------+

Is my server fully loaded? I see the memory is 2115MiB / 11439MiB, its like 1/5 of the full memory usage. Though utilization is 100%.

I wish there is an option such as--cuda-parallel-hash, could boost/ max-out the memory usage.

By the way, my hashrate is

HASHRATE REPORT - CPU
| ID |    10s |    60s |    15m | ID |    10s |    60s |    15m |
|  0 |   31.9 |   32.1 |   32.1 |  1 |   30.5 |   30.8 |   30.8 |
|  2 |   33.9 |   34.2 |   34.1 |  3 |   28.6 |   28.9 |   28.9 |
|  4 |   31.6 |   31.9 |   31.9 |  5 |   32.9 |   33.0 |   33.1 |
|  6 |   31.0 |   31.1 |   31.1 |  7 |   36.4 |   36.8 |   36.7 |
|  8 |   36.9 |   37.1 |   37.1 |  9 |   34.3 |   34.6 |   34.5 |
| 10 |   31.9 |   32.1 |   32.1 | 11 |   30.9 |   31.2 |   31.2 |
| 12 |   37.8 |   38.0 |   38.0 | 13 |   36.3 |   36.6 |   36.6 |
| 14 |   36.9 |   37.2 |   37.1 | 15 |   32.0 |   32.2 |   32.1 |
| 16 |   29.7 |   30.0 |   29.9 | 17 |   32.8 |   33.0 |   33.0 |
| 18 |   32.2 |   32.5 |   32.5 | 19 |   31.2 |   31.5 |   31.5 |
| 20 |   33.9 |   34.1 |   34.1 | 21 |   33.0 |   33.0 |   33.2 |
| 22 |   31.0 |   31.1 |   31.1 |
-----------------------------------------------------
HASHRATE REPORT - NVIDIA
| ID |    10s |    60s |    15m | ID |    10s |    60s |    15m |
|  0 |  470.8 |  470.8 |  470.1 |  1 |  447.6 |  446.7 |  447.0 |
|  2 |  466.6 |  465.8 |  465.9 |  3 |  474.1 |  473.2 |  473.0 |
|  4 |  471.3 |  471.2 |  470.6 |  5 |  474.2 |  472.4 |  471.5 |
|  6 |  467.8 |  465.4 |  465.4 |  7 |  466.4 |  464.5 |  464.7 |
-----------------------------------------------------
Totals:   4496.4 4492.6 4490.9 H/s
Highest:  4523.9 H/s

But I felt it could have done better. Please help and clarify. Thanks @psychocrypt @fireice-uk

baldpope commented 6 years ago

I have a similar request, I'm not seeing 100% utilization on CPU mining

HASHRATE REPORT - CPU | ID | 10s | 60s | 15m | ID | 10s | 60s | 15m | | 0 | 36.6 | 36.6 | (na) | 1 | 36.8 | 36.8 | (na) | | 2 | 36.6 | 36.6 | (na) | 3 | 41.4 | 41.4 | (na) | | 4 | 41.6 | 41.6 | (na) | 5 | 41.7 | 41.7 | (na) | | 6 | 42.4 | 42.4 | (na) | 7 | 42.4 | 42.4 | (na) | | 8 | 42.6 | 42.6 | (na) | 9 | 42.6 | 42.6 | (na) | | 10 | 37.7 | 37.7 | (na) | 11 | 36.4 | 36.4 | (na) | | 12 | 37.4 | 37.4 | (na) | 13 | 36.2 | 36.1 | (na) | | 14 | 37.4 | 37.5 | (na) | 15 | 37.5 | 37.5 | (na) | | 16 | 41.5 | 41.6 | (na) | 17 | 42.2 | 42.1 | (na) | | 18 | 40.6 | 40.6 | (na) | 19 | 41.0 | 41.1 | (na) | | 20 | 42.2 | 42.2 | (na) | 21 | 42.2 | 42.1 | (na) | | 22 | 41.8 | 41.8 | (na) | 23 | 37.8 | 37.8 | (na) | | 24 | 37.4 | 37.5 | (na) | 25 | 38.1 | 38.1 | (na) |

Totals: 1032.0 1032.3 (na) H/s Highest: 1033.5 H/s

Appears to be using only 26 threads when 40 threads are available. Is this a coded limit or a limit reached on the box? I've added the memory summary and the output from /proc/cpuinfo for cpu 39 below:

baldpope@zcash-n2:~$ free -m total used free shared buff/cache available Mem: 257850 722 256894 9 234 256328 Swap: 262029 0 262029

baldpope@zcash-n2:~$ cat /proc/meminfo MemTotal: 264038972 kB MemFree: 263060020 kB MemAvailable: 262480316 kB Buffers: 13064 kB Cached: 124900 kB SwapCached: 0 kB Active: 115900 kB Inactive: 78312 kB Active(anon): 59340 kB Inactive(anon): 9016 kB Active(file): 56560 kB Inactive(file): 69296 kB Unevictable: 3660 kB Mlocked: 3660 kB SwapTotal: 268318716 kB SwapFree: 268318716 kB Dirty: 0 kB Writeback: 0 kB AnonPages: 113420 kB Mapped: 40304 kB Shmem: 9688 kB Slab: 101664 kB SReclaimable: 32580 kB SUnreclaim: 69084 kB KernelStack: 7296 kB PageTables: 2972 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 400338200 kB Committed_AS: 256488 kB VmallocTotal: 34359738367 kB VmallocUsed: 0 kB VmallocChunk: 0 kB HardwareCorrupted: 0 kB AnonHugePages: 81920 kB CmaTotal: 0 kB CmaFree: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 144020 kB DirectMap2M: 3936256 kB DirectMap1G: 266338304 kB

baldpope@zcash-n2:~$ vmstat -s 264038976 K total memory 739188 K used memory 115904 K active memory 78312 K inactive memory 263060144 K free memory 13072 K buffer memory 226568 K swap cache 268318720 K total swap 0 K used swap 268318720 K free swap 3274046 non-nice user cpu ticks 2 nice user cpu ticks 2662 system cpu ticks 3904994 idle cpu ticks 136 IO-wait cpu ticks 0 IRQ cpu ticks 30 softirq cpu ticks 0 stolen cpu ticks 218741 pages paged in 2628 pages paged out 0 pages swapped in 0 pages swapped out 8716986 interrupts 187799 CPU context switches 1515594017 boot time 2151 forks

baldpope@zcash-n2:~$ cat /proc/cpuinfo processor : 39 vendor_id : GenuineIntel cpu family : 6 model : 79 model name : Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz stepping : 1 microcode : 0xb000010 cpu MHz : 2600.062 cache size : 25600 KB physical id : 1 siblings : 20 core id : 12 cpu cores : 10 apicid : 57 initial apicid : 57 fpu : yes fpu_exception : yes cpuid level : 20 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb invpcid_single intel_pt kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts bugs : bogomips : 4801.64 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management:

baldpope commented 6 years ago

this thread at reddit ( https://www.reddit.com/r/MoneroMining/comments/72hmxs/xmrstakcpu_only_using_60_of_cpu ) implies a limit based on L3 cache size - is this correct?

baldpope commented 6 years ago

baldpope@zcash-n2:~$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 40 On-line CPU(s) list: 0-39 Thread(s) per core: 2 Core(s) per socket: 10 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz Stepping: 1 CPU MHz: 2599.781 CPU max MHz: 3400.0000 CPU min MHz: 1200.0000 BogoMIPS: 4801.64 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 25600K NUMA node0 CPU(s): 0-9,20-29 NUMA node1 CPU(s): 10-19,30-39 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb invpcid_single intel_pt kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts

mvpel commented 6 years ago

@baldpope - the E5-2640 v4 has 10 physical cores each, and so you've got a two-socket machine.

There's a bug in the CPU autodetect that mishandles the cache size detection, so you may not be getting the optimal setup via autodetect. Until that patch comes out, you will want to hand-tune your CPU threads.

The key is that you want 10 threads per socket to cover all of the physical CPU cores in each processor, with affinity set up so that the 25MB cache on each socket, and the NUMA nodes, all can stay as hot as physically possible.

Run the following command and examine the output:

egrep " id|processor" /proc/cpuinfo

This will show you how each Linux processor ID number maps to the physical and Hyperthread cores within your system.

What you'll see on your system is that 0-9 are physical ID 0 core id 0-9, 10-19 are physical id 1 core id 0-9, and then processors 20-29 and 30-39 are the Hyperthreads for each of the previous 20, repeating same physical/core id numbers as the processor numbers increase.

Looking at the cpuid output above, there's a clue to that in the NUMA node lines at the end - node0 is socket 0, and node1 is socket 1. This makes sense - each socket has a batch of memory that is closest to it.

Your cache is big enough at 25MB per socket that it's not a bottleneck for the 2MB scratchpads needed by the Cryptonight algorithm. If it was 16MB, for example, you wouldn't be able to fit 10 of them in the fast L3 cache, and so you'd only want to run 7 threads per socket, not 10 (need to leave some L3 room for the network stack, etc.)

So since you don't have to worry about cache size, this means you are going to want 10 threads on each socket, mapped to each of the physical cores on the system. Try this configuration and see if your hashrate increases:

"cpu_threads_conf" :
[
    { "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 0 },
    { "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 1 },
    { "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 2 },
    { "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 3 },
    { "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 4 },
    { "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 5 },
    { "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 6 },
    { "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 7 },
    { "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 8 },
    { "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 9 },

    { "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 10 },
    { "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 11 },
    { "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 12 },
    { "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 13 },
    { "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 14 },
    { "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 15 },
    { "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 16 },
    { "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 17 },
    { "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 18 },
    { "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 19 },
],

The two sections of the config represent NUMA nodes 0 and 1. The mapping would be different under Windows, and I think on Dell servers under Linux - odd/even rather than sequential - so the above grouping is for illustrative purposes only. Everyone needs to evaluate their own hardware and OS layout.

With a 25MB cache, you can fit up to 12 Cryptonight scratchpads in L3 cache before they start stepping on each other. So it could be that xmr-stak Hyperthreads might be interleave-able enough while waiting on memory access, even from L3, so as to not jostle each other around on the physical core too much. If you want to try that, just add four more threads with affinity to cores 20 and 21 for socket 0's physical 0 and 1, and 30 and 31 for socket 1's physical 0 and 1.

The principle here is basically the same as on @ldong 's GPUs- your limiting factor is the number of physical processor cores on the system rather than the size of the L3 cache memory, while for @ldong, he has far more than enough GPU memory to satisfy all of the available computing threads on the GPU with plenty of memory left over.

So for @ldong - yes, your GPUs are fully utilized - there's no more room for additional hashrate in the processors even though there's plenty of memory left over. And you should go through the same evaluation of your CPU affinities as I describe above for your own processor, looking at your L3 cache size and NUMA nodes.

psychocrypt commented 6 years ago

you can not use more than 2gb main memory of the k80. Kepler gpus flushing the tlb if you do random access on more than 2gb. This will reduce your performance.

ldong commented 6 years ago

Thank you both @psychocrypt @mvpel for clarifying and analysis.

wilbyang commented 6 years ago

@ldong I get the exactly same result as you did. Would you mind telling me you've got improvement already or not? and how?

fireice-uk / xmr-stak

how to max-out memory usage and fully utilized? #851

Basic information