deeplearning4j / deeplearning4j

Suite of tools for deploying and training deep learning models using the JVM. Highlights include model import for keras, tensorflow, and onnx/pytorch, a modular and tiny c++ library for running math code and a java based math library on top of the core c++ library. Also includes samediff: a pytorch/tensorflow like library for running deep learn...
http://deeplearning4j.konduit.ai
Apache License 2.0
13.61k stars 3.83k forks source link

Using more cores slows down MKL #6183

Open hvesalai opened 6 years ago

hvesalai commented 6 years ago

I have two hosts, one with 12 cores, one with 8.

If I run my operations with the default OMP_NUM_THREADS (i.e. not being set), operations take up to 10 times more clock time than when using just 1 core.

Here is a comparison from the 12 core host:

OMP_NUM_THREADS and rought times per cycle 1 = 10ms 2 = 8ms 3 = 7ms 4 = 7-10ms (starting to see a lot of variance) 5 = 8-25ms 6 = 40-80ms 7 = 100-200ms 8 = 900-1000ms

On the 8 core host I'm seeing similar time for OMP_NUM_THREADS=1..6, but with the defaults a cycle takes rougly 100ms.

What is especially strange is that when no OMP_NUM_THREADS is set, the cycle time is ~100ms and the INFO at the start is

2018-08-16 14:11:46.576 [run-main-0] INFO  org.nd4j.linalg.factory.Nd4jBackend - Loaded [CpuBackend] backend
2018-08-16 14:11:46.905 [run-main-0] INFO  org.nd4j.nativeblas.NativeOpsHolder - Number of threads used for NativeOps: 4
2018-08-16 14:11:46.959 [run-main-0] INFO  org.nd4j.nativeblas.Nd4jBlas - Number of threads used for BLAS: 4
2018-08-16 14:11:46.961 [run-main-0] INFO  o.n.l.a.o.e.DefaultOpExecutioner - Backend used: [CPU]; OS: [Linux]
2018-08-16 14:11:46.961 [run-main-0] INFO  o.n.l.a.o.e.DefaultOpExecutioner - Cores: [8]; Memory: [1.8GB];
2018-08-16 14:11:46.961 [run-main-0] INFO  o.n.l.a.o.e.DefaultOpExecutioner - Blas vendor: [MKL]

When setting OMP_NUM_THREADS=4, the cycle time is 9ms and the INFO at the start is

2018-08-16 14:09:25.467 [run-main-0] INFO  org.nd4j.linalg.factory.Nd4jBackend - Loaded [CpuBackend] backend
2018-08-16 14:09:25.796 [run-main-0] INFO  org.nd4j.nativeblas.NativeOpsHolder - Number of threads used for NativeOps: 4
2018-08-16 14:09:25.849 [run-main-0] INFO  org.nd4j.nativeblas.Nd4jBlas - Number of threads used for BLAS: 4
2018-08-16 14:09:25.852 [run-main-0] INFO  o.n.l.a.o.e.DefaultOpExecutioner - Backend used: [CPU]; OS: [Linux]
2018-08-16 14:09:25.852 [run-main-0] INFO  o.n.l.a.o.e.DefaultOpExecutioner - Cores: [8]; Memory: [1.8GB];
2018-08-16 14:09:25.852 [run-main-0] INFO  o.n.l.a.o.e.DefaultOpExecutioner - Blas vendor: [MKL]

I would have expected the performance to be the same for these two cases.

raver119 commented 6 years ago

That's not a news. For some tasks virtual HT threads do provide gains, for others - they don't.

Also, in some cases CPU cache benefits are bigger then gains from starting additional thread.

raver119 commented 6 years ago

If I run my operations with the default OMP_NUM_THREADS (i.e. not being set), operations take up to 10 times more clock time than when using just 1 core.

What's your box, actually? default number of threads for us is 1/2 of Runtime.availableProcessors() usually. Mind adding some details?

hvesalai commented 6 years ago

I'll have to ask the admins. Here's the last entry from cpuinfo if that is enough.

processor       : 11
vendor_id       : GenuineIntel
cpu family      : 6
model           : 63
model name      : Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz
stepping        : 2
microcode       : 0x3a
cpu MHz         : 1200.775
cache size      : 15360 KB
physical id     : 0
siblings        : 12
core id         : 5
cpu cores       : 6
apicid          : 11
initial apicid  : 11
fpu             : yes
fpu_exception   : yes
cpuid level     : 15
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb intel_ppin tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts
bugs            :
bogomips        : 6984.43
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

And the other host has

processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 158
model name      : Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
stepping        : 9
microcode       : 0x84
cpu MHz         : 800.000
cache size      : 8192 KB
physical id     : 0
siblings        : 8
core id         : 3
cpu cores       : 4
apicid          : 7
initial apicid  : 7
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips        : 7200.00
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:
hvesalai commented 6 years ago

If 1/2 is the default, what could cause OMP_NUM_THREADS=4 to have different behaviour than nothing defined on the box with 8 cores?

raver119 commented 6 years ago

So, your i7 has 4 cores, and your xeon has 6 cores. You don't have 8 core box.

Now please tell me, what's your actual ops you're referring to?

raver119 commented 6 years ago

Also, please tell me - are you able to run snapshots on your box?

hvesalai commented 6 years ago

I must say I'm not a hardware guys, so I take your word for the number of cores. My knowledge of the CPUs is limited to what cat /proc/cpuinfo gives.

The computation I'm running consists of approximately 1200 steps composed of the following operations:

Yes, I think I've been able to run snapshots before. I use sbt so that might limit my ability somewhat.

raver119 commented 6 years ago

can you please exclude mmul from your operations and do your tests again? mmul is the only thing in your list that should use mkl i think.

hvesalai commented 6 years ago

I'll replace them by Nd4j.ones

raver119 commented 6 years ago

better Nd4j.createUninitialized(). It'll be just allocation without anything else.

Also, please tell me what's your data dimensionality. Maybe you could also post the code you're using, so i could profile it in java and c++?

hvesalai commented 6 years ago

I changed mmul to createUninitialized. Now i7 machine with default OMP_NUM_THREADS fluctuates between 8-25ms per cycle, clearly faster than the 100ms it was with mmul. If I set OMP_NUM_THREADS=1, then the cycle time is 9-10ms, which is the same as with mmul.

As for the dimensionality of mmul operations: here's a breakdown of the counts and dimensions. As you can see, there are some basic matrix vector multiplications and then some vector vector outer products.

count result        a            b
   32 (128, 1)   <- (128, 155) * (155, 1)
   24 (155, 1)   <- (155, 128) * (128, 1)
   24 (128, 165) <- (128, 1)   * (1, 165)
   24 (128, 1)   <- (128, 165) * (165, 1)
   20 (165, 1)   <- (165, 128) * (128, 1)
   17 (128, 155) <- (128, 1)   * (1, 155)
   15 (128, 155) <- (128, 1)   * (1, 155)
   15 (128, 1)   <- (128, 384) * (384, 1)
   14 (384, 1)   <- (384, 128) * (128, 1)
   14 (128, 384) <- (128, 1)   * (1, 384)
    6 (37, 1)    <- (37, 128)  * (128, 1)
    6 (37, 128)  <- (37, 1)    * (1, 128)
    6 (128, 1)   <- (128, 37)  * (37, 1)
    1 (384, 1)   <- (384, 128) * (128, 1)
    1 (128, 384) <- (128, 1)   * (1, 384)

I'm sorry that I can't share the actual computation code that you could profile. However, to give you some background, the ND4J interfacing part of the code is basically a simple loop executing instructions from a list of operations provided by an external system. Very similar to what I imagine DL4J's autodiff is doing.

As pseudocode the input list could look something like this:

0 argument(1)
1 argument(2)
2 mmul(0,1)
3 argument(3)
4 add(2,3)

Where I then map the instructions as:

argument(n) -> Nd4j.create(....)
mmul(a,b) -> a.mmul(b)
add(a,b) -> a.add(b)
hvesalai commented 6 years ago

(fixed typos in above comment... most notably mmu -> mmul)

raver119 commented 6 years ago

So, in other words something is wrong with MKL configuration in runtime. We had few lines of code for that purpose in libnd4j, not sure if we still use it or not. Let me add some logging and get back to you.

raver119 commented 6 years ago

@saudet back in days we had few custom calls in libnd4j to finetune mkl in runtime, but now i can't find those. Were they removed?

saudet commented 6 years ago

Yes, they're part of the presets for OpenBLAS here: https://github.com/bytedeco/javacpp-presets/blob/master/openblas/src/main/java/org/bytedeco/javacpp/presets/openblas_nolapack.java#L155

raver119 commented 6 years ago

We should investigate them once again, probably something changed in MKL, so we should update those methods.

hvesalai commented 6 years ago

While debugging this I noticed that using ND4J results in a wild amount of very short lived threads being created on OS-level.

raver119 commented 6 years ago

That's called OpenMP, actual implementation depends on your os etc, and with MKL installed it's probably Intel OpenMP. Usually these threads are cached and reused by OpenMP. We create exactly 0 pthreads/winthreads ourselves.

hvesalai commented 6 years ago

OMP_PROC_BIND=true fixes it. I have absolutely no idea what it does, but while googling I hit this stackoverflow article https://stackoverflow.com/questions/24368576/poor-performance-due-to-hyper-threading-with-openmp-how-to-bind-threads-to-core

raver119 commented 6 years ago

Fixes what, original issue?

hvesalai commented 6 years ago

Yes, original issue with mmul being slow with default number of threads.

Not the large number of threads being created and thrashed.

raver119 commented 6 years ago

Strange. What's your MKL version?

hvesalai commented 6 years ago

I'm using mkl-2018.3-1.4.2-linux-x86_64.jar so I guess it's 2018.3 then

raver119 commented 6 years ago

So, in both cases you have single cpu systems. Are you 100% sure you're not running app in VM environment?

hvesalai commented 6 years ago

Well... that would be the hoax of the century. We are renting machines from Hezner, so I don't have physical access to them and thus I cannot be 100% sure. But let's say 99%.

> facter|grep virtual
is_virtual => false
virtual => physical
raver119 commented 6 years ago

I see. Weird.