Open hvesalai opened 6 years ago
That's not a news. For some tasks virtual HT threads do provide gains, for others - they don't.
Also, in some cases CPU cache benefits are bigger then gains from starting additional thread.
If I run my operations with the default OMP_NUM_THREADS (i.e. not being set), operations take up to 10 times more clock time than when using just 1 core.
What's your box, actually? default number of threads for us is 1/2 of Runtime.availableProcessors() usually. Mind adding some details?
I'll have to ask the admins. Here's the last entry from cpuinfo if that is enough.
processor : 11
vendor_id : GenuineIntel
cpu family : 6
model : 63
model name : Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz
stepping : 2
microcode : 0x3a
cpu MHz : 1200.775
cache size : 15360 KB
physical id : 0
siblings : 12
core id : 5
cpu cores : 6
apicid : 11
initial apicid : 11
fpu : yes
fpu_exception : yes
cpuid level : 15
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb intel_ppin tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts
bugs :
bogomips : 6984.43
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
And the other host has
processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 158
model name : Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
stepping : 9
microcode : 0x84
cpu MHz : 800.000
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 3
cpu cores : 4
apicid : 7
initial apicid : 7
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips : 7200.00
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
If 1/2 is the default, what could cause OMP_NUM_THREADS=4
to have different behaviour than nothing defined on the box with 8 cores?
So, your i7 has 4 cores, and your xeon has 6 cores. You don't have 8 core box.
Now please tell me, what's your actual ops you're referring to?
Also, please tell me - are you able to run snapshots on your box?
I must say I'm not a hardware guys, so I take your word for the number of cores. My knowledge of the CPUs is limited to what cat /proc/cpuinfo
gives.
The computation I'm running consists of approximately 1200 steps composed of the following operations:
Yes, I think I've been able to run snapshots before. I use sbt so that might limit my ability somewhat.
can you please exclude mmul
from your operations and do your tests again? mmul
is the only thing in your list that should use mkl i think.
I'll replace them by Nd4j.ones
better Nd4j.createUninitialized(). It'll be just allocation without anything else.
Also, please tell me what's your data dimensionality. Maybe you could also post the code you're using, so i could profile it in java and c++?
I changed mmul
to createUninitialized
. Now i7 machine with default OMP_NUM_THREADS fluctuates between 8-25ms per cycle, clearly faster than the 100ms it was with mmul. If I set OMP_NUM_THREADS=1, then the cycle time is 9-10ms, which is the same as with mmul.
As for the dimensionality of mmul operations: here's a breakdown of the counts and dimensions. As you can see, there are some basic matrix vector multiplications and then some vector vector outer products.
count result a b
32 (128, 1) <- (128, 155) * (155, 1)
24 (155, 1) <- (155, 128) * (128, 1)
24 (128, 165) <- (128, 1) * (1, 165)
24 (128, 1) <- (128, 165) * (165, 1)
20 (165, 1) <- (165, 128) * (128, 1)
17 (128, 155) <- (128, 1) * (1, 155)
15 (128, 155) <- (128, 1) * (1, 155)
15 (128, 1) <- (128, 384) * (384, 1)
14 (384, 1) <- (384, 128) * (128, 1)
14 (128, 384) <- (128, 1) * (1, 384)
6 (37, 1) <- (37, 128) * (128, 1)
6 (37, 128) <- (37, 1) * (1, 128)
6 (128, 1) <- (128, 37) * (37, 1)
1 (384, 1) <- (384, 128) * (128, 1)
1 (128, 384) <- (128, 1) * (1, 384)
I'm sorry that I can't share the actual computation code that you could profile. However, to give you some background, the ND4J interfacing part of the code is basically a simple loop executing instructions from a list of operations provided by an external system. Very similar to what I imagine DL4J's autodiff is doing.
As pseudocode the input list could look something like this:
0 argument(1)
1 argument(2)
2 mmul(0,1)
3 argument(3)
4 add(2,3)
Where I then map the instructions as:
argument(n) -> Nd4j.create(....)
mmul(a,b) -> a.mmul(b)
add(a,b) -> a.add(b)
(fixed typos in above comment... most notably mmu -> mmul)
So, in other words something is wrong with MKL configuration in runtime. We had few lines of code for that purpose in libnd4j, not sure if we still use it or not. Let me add some logging and get back to you.
@saudet back in days we had few custom calls in libnd4j to finetune mkl in runtime, but now i can't find those. Were they removed?
Yes, they're part of the presets for OpenBLAS here: https://github.com/bytedeco/javacpp-presets/blob/master/openblas/src/main/java/org/bytedeco/javacpp/presets/openblas_nolapack.java#L155
We should investigate them once again, probably something changed in MKL, so we should update those methods.
While debugging this I noticed that using ND4J results in a wild amount of very short lived threads being created on OS-level.
That's called OpenMP, actual implementation depends on your os etc, and with MKL installed it's probably Intel OpenMP. Usually these threads are cached and reused by OpenMP. We create exactly 0 pthreads/winthreads ourselves.
OMP_PROC_BIND=true
fixes it. I have absolutely no idea what it does, but while googling I hit this stackoverflow article https://stackoverflow.com/questions/24368576/poor-performance-due-to-hyper-threading-with-openmp-how-to-bind-threads-to-core
Fixes what, original issue?
Yes, original issue with mmul being slow with default number of threads.
Not the large number of threads being created and thrashed.
Strange. What's your MKL version?
I'm using mkl-2018.3-1.4.2-linux-x86_64.jar so I guess it's 2018.3 then
So, in both cases you have single cpu systems. Are you 100% sure you're not running app in VM environment?
Well... that would be the hoax of the century. We are renting machines from Hezner, so I don't have physical access to them and thus I cannot be 100% sure. But let's say 99%.
> facter|grep virtual
is_virtual => false
virtual => physical
I see. Weird.
I have two hosts, one with 12 cores, one with 8.
If I run my operations with the default OMP_NUM_THREADS (i.e. not being set), operations take up to 10 times more clock time than when using just 1 core.
Here is a comparison from the 12 core host:
OMP_NUM_THREADS and rought times per cycle 1 = 10ms 2 = 8ms 3 = 7ms 4 = 7-10ms (starting to see a lot of variance) 5 = 8-25ms 6 = 40-80ms 7 = 100-200ms 8 = 900-1000ms
On the 8 core host I'm seeing similar time for OMP_NUM_THREADS=1..6, but with the defaults a cycle takes rougly 100ms.
What is especially strange is that when no OMP_NUM_THREADS is set, the cycle time is ~100ms and the INFO at the start is
When setting OMP_NUM_THREADS=4, the cycle time is 9ms and the INFO at the start is
I would have expected the performance to be the same for these two cases.