OpenBLAS threads - Githubissues

debasish83 commented 8 years ago

Hi Alex,

I am trying to reproduce the benchmark results and I have a quick question of how many OpenBLAS threads you have used and what's the runtime scalability that you got. I am expecting with N threads, compute runtime should improve from M secs to M/N/2 secs.

Here is what I am trying:

I have 20 nodes and 16 cores on each node.

SparkContext: 20 nodes, 16 cores, sc.defaultParallelism 320

def gramSize(n: Int) = (n*n+1)/2

// I have not used saxpy f2jBLAS and NativeBLAS yet but that will be used over here for comparisons. // I am not sure if f2jBLAS can run on multiple threads or not but OpenBLAS should run fine

val combOp = (v1: Array[Float], v2: Array[Float]) => { var i = 0 while (i < v1.length) { v1(i) += v2(i) i += 1 } v1 }

val n = gramSize(4096) val vv = sc.parallelize(0 until sc.defaultParallelism).map(i => Array.fill[Float](0)) vv.persist

Option 1: 320 partitions, 1 thread on combOp per partition

val start = System.nanoTime(); vv.treeReduce(combOp, 2); val reduceTime = (System.nanoTime() - start)*1e-9 reduceTime: Double = 5.6390302430000006

Option 2: 20 partitions, 1 thread on combOp per partition

val coalescedvv = vv.coalesce(20) coalescedvv.count

val start = System.nanoTime(); coalescedvv.treeReduce(combOp, 2); val reduceTime = (System.nanoTime() - start)*1e-9 reduceTime: Double = 3.9140685640000004

Option 3: 20 partitions, OpenBLAS numThread=16 per partition

Setting up OpenBLAS on cluster, I will update soon.

Let me know your thoughts. I think if underlying operations are Dense BLAS level1, level2 or level3, running with higher OpenBLAS threads and reducing number of partitions should help in decreasing cross partition shuffle.

avulanov commented 8 years ago

Hi Debasish, I am not sure that I understand the question. I will try to address some of the things you mentioned. With OpenBLAS, I don't set the number of threads and leave it to decide how many threads to use. With this, I observe almost 100% usage of CPU during matrices multiplication.

With regards to reduce time in Spark treeReduce, total time is proportional to the number of partitions. Summation of two vectors that you are performing should have the same performance both with JVM and native BLAS. According to netlib-java benchmarks, native BLAS provides substantial speedup only for BLAS level 3 which includes matrices multiplication. Please refer to https://github.com/fommil/netlib-java

debasish83 commented 8 years ago

If I don't set export OPENBLAS_NUM_THREADS=N, then is it not the case that OpenBLAS will run with 1 thread ? I will double check...

I figured out treeReduce bottleneck is number of partitions and so I run treeReduce on NUM_EXECUTORS but I am guessing my combOp and seqOp code can be further optimized using OPENBLAS_NUM_THREADS=4/8...I was all set to experiment but got stuck due to gcc4.4.7 on our cluster...netlib-java .so files are compiled with gcc 4.8.2...I am trying to fix it...I will report my findings for ANN benchmark as well.

debasish83 commented 8 years ago

I got OpenBLAS operational on GCC 4.4 CentOS 6.5...For dpotrs and dpotrf I am getting good runtime improvement with multiple threads on Spark Master (around 2X compared to single thread OpenBLAS)...there were some issues in dgetrf_parallel and I opened up an issue...I will experiment with 20 partitions + multithreaded BLAS with 16 cores vs 320 partitions + single thread BLAS...Less number of partitions should optimize on treeAggregate combOp but idea here is not to degrade the compute by restricting the number of partitions...I checked CPU utilization grows to > 100 % which means multiple threads are getting utilized by OpenBLAS as I expected...It should help improve ANN benchmark as well since BLAS3 operations should benefit most out of it...

avulanov / ann-benchmark

OpenBLAS threads #3