Closed ldong closed 6 years ago
I have a similar request, I'm not seeing 100% utilization on CPU mining
Totals: 1032.0 1032.3 (na) H/s Highest: 1033.5 H/s
Appears to be using only 26 threads when 40 threads are available. Is this a coded limit or a limit reached on the box? I've added the memory summary and the output from /proc/cpuinfo for cpu 39 below:
baldpope@zcash-n2:~$ free -m total used free shared buff/cache available Mem: 257850 722 256894 9 234 256328 Swap: 262029 0 262029
baldpope@zcash-n2:~$ cat /proc/meminfo MemTotal: 264038972 kB MemFree: 263060020 kB MemAvailable: 262480316 kB Buffers: 13064 kB Cached: 124900 kB SwapCached: 0 kB Active: 115900 kB Inactive: 78312 kB Active(anon): 59340 kB Inactive(anon): 9016 kB Active(file): 56560 kB Inactive(file): 69296 kB Unevictable: 3660 kB Mlocked: 3660 kB SwapTotal: 268318716 kB SwapFree: 268318716 kB Dirty: 0 kB Writeback: 0 kB AnonPages: 113420 kB Mapped: 40304 kB Shmem: 9688 kB Slab: 101664 kB SReclaimable: 32580 kB SUnreclaim: 69084 kB KernelStack: 7296 kB PageTables: 2972 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 400338200 kB Committed_AS: 256488 kB VmallocTotal: 34359738367 kB VmallocUsed: 0 kB VmallocChunk: 0 kB HardwareCorrupted: 0 kB AnonHugePages: 81920 kB CmaTotal: 0 kB CmaFree: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB DirectMap4k: 144020 kB DirectMap2M: 3936256 kB DirectMap1G: 266338304 kB
baldpope@zcash-n2:~$ vmstat -s 264038976 K total memory 739188 K used memory 115904 K active memory 78312 K inactive memory 263060144 K free memory 13072 K buffer memory 226568 K swap cache 268318720 K total swap 0 K used swap 268318720 K free swap 3274046 non-nice user cpu ticks 2 nice user cpu ticks 2662 system cpu ticks 3904994 idle cpu ticks 136 IO-wait cpu ticks 0 IRQ cpu ticks 30 softirq cpu ticks 0 stolen cpu ticks 218741 pages paged in 2628 pages paged out 0 pages swapped in 0 pages swapped out 8716986 interrupts 187799 CPU context switches 1515594017 boot time 2151 forks
baldpope@zcash-n2:~$ cat /proc/cpuinfo processor : 39 vendor_id : GenuineIntel cpu family : 6 model : 79 model name : Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz stepping : 1 microcode : 0xb000010 cpu MHz : 2600.062 cache size : 25600 KB physical id : 1 siblings : 20 core id : 12 cpu cores : 10 apicid : 57 initial apicid : 57 fpu : yes fpu_exception : yes cpuid level : 20 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb invpcid_single intel_pt kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts bugs : bogomips : 4801.64 clflush size : 64 cache_alignment : 64 address sizes : 46 bits physical, 48 bits virtual power management:
this thread at reddit ( https://www.reddit.com/r/MoneroMining/comments/72hmxs/xmrstakcpu_only_using_60_of_cpu ) implies a limit based on L3 cache size - is this correct?
baldpope@zcash-n2:~$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 40 On-line CPU(s) list: 0-39 Thread(s) per core: 2 Core(s) per socket: 10 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 79 Model name: Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz Stepping: 1 CPU MHz: 2599.781 CPU max MHz: 3400.0000 CPU min MHz: 1200.0000 BogoMIPS: 4801.64 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 25600K NUMA node0 CPU(s): 0-9,20-29 NUMA node1 CPU(s): 10-19,30-39 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb invpcid_single intel_pt kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts
@baldpope - the E5-2640 v4 has 10 physical cores each, and so you've got a two-socket machine.
There's a bug in the CPU autodetect that mishandles the cache size detection, so you may not be getting the optimal setup via autodetect. Until that patch comes out, you will want to hand-tune your CPU threads.
The key is that you want 10 threads per socket to cover all of the physical CPU cores in each processor, with affinity set up so that the 25MB cache on each socket, and the NUMA nodes, all can stay as hot as physically possible.
Run the following command and examine the output:
egrep " id|processor" /proc/cpuinfo
This will show you how each Linux processor ID number maps to the physical and Hyperthread cores within your system.
What you'll see on your system is that 0-9 are physical ID 0 core id 0-9, 10-19 are physical id 1 core id 0-9, and then processors 20-29 and 30-39 are the Hyperthreads for each of the previous 20, repeating same physical/core id numbers as the processor numbers increase.
Looking at the cpuid output above, there's a clue to that in the NUMA node lines at the end - node0 is socket 0, and node1 is socket 1. This makes sense - each socket has a batch of memory that is closest to it.
Your cache is big enough at 25MB per socket that it's not a bottleneck for the 2MB scratchpads needed by the Cryptonight algorithm. If it was 16MB, for example, you wouldn't be able to fit 10 of them in the fast L3 cache, and so you'd only want to run 7 threads per socket, not 10 (need to leave some L3 room for the network stack, etc.)
So since you don't have to worry about cache size, this means you are going to want 10 threads on each socket, mapped to each of the physical cores on the system. Try this configuration and see if your hashrate increases:
"cpu_threads_conf" :
[
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 0 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 1 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 2 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 3 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 4 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 5 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 6 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 7 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 8 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 9 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 10 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 11 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 12 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 13 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 14 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 15 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 16 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 17 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 18 },
{ "low_power_mode" : false, "no_prefetch" : true, "affine_to_cpu" : 19 },
],
The two sections of the config represent NUMA nodes 0 and 1. The mapping would be different under Windows, and I think on Dell servers under Linux - odd/even rather than sequential - so the above grouping is for illustrative purposes only. Everyone needs to evaluate their own hardware and OS layout.
With a 25MB cache, you can fit up to 12 Cryptonight scratchpads in L3 cache before they start stepping on each other. So it could be that xmr-stak Hyperthreads might be interleave-able enough while waiting on memory access, even from L3, so as to not jostle each other around on the physical core too much. If you want to try that, just add four more threads with affinity to cores 20 and 21 for socket 0's physical 0 and 1, and 30 and 31 for socket 1's physical 0 and 1.
The principle here is basically the same as on @ldong 's GPUs- your limiting factor is the number of physical processor cores on the system rather than the size of the L3 cache memory, while for @ldong, he has far more than enough GPU memory to satisfy all of the available computing threads on the GPU with plenty of memory left over.
So for @ldong - yes, your GPUs are fully utilized - there's no more room for additional hashrate in the processors even though there's plenty of memory left over. And you should go through the same evaluation of your CPU affinities as I describe above for your own processor, looking at your L3 cache size and NUMA nodes.
you can not use more than 2gb main memory of the k80. Kepler gpus flushing the tlb if you do random access on more than 2gb. This will reduce your performance.
Thank you both @psychocrypt @mvpel for clarifying and analysis.
@ldong I get the exactly same result as you did. Would you mind telling me you've got improvement already or not? and how?
Basic information
OS: Ubuntu 16.4, AWS EC2 p2.8xlarge
Is my server fully loaded? I see the memory is
2115MiB / 11439MiB
, its like 1/5 of the full memory usage. Though utilization is 100%.I wish there is an option such as
--cuda-parallel-hash
, could boost/ max-out the memory usage.By the way, my hashrate is
But I felt it could have done better. Please help and clarify. Thanks @psychocrypt @fireice-uk