Open debasish83 opened 8 years ago
Hi Debasish, I am not sure that I understand the question. I will try to address some of the things you mentioned. With OpenBLAS, I don't set the number of threads and leave it to decide how many threads to use. With this, I observe almost 100% usage of CPU during matrices multiplication.
With regards to reduce time in Spark treeReduce
, total time is proportional to the number of partitions. Summation of two vectors that you are performing should have the same performance both with JVM and native BLAS. According to netlib-java benchmarks, native BLAS provides substantial speedup only for BLAS level 3 which includes matrices multiplication. Please refer to https://github.com/fommil/netlib-java
If I don't set export OPENBLAS_NUM_THREADS=N, then is it not the case that OpenBLAS will run with 1 thread ? I will double check...
I figured out treeReduce bottleneck is number of partitions and so I run treeReduce on NUM_EXECUTORS but I am guessing my combOp and seqOp code can be further optimized using OPENBLAS_NUM_THREADS=4/8...I was all set to experiment but got stuck due to gcc4.4.7 on our cluster...netlib-java .so files are compiled with gcc 4.8.2...I am trying to fix it...I will report my findings for ANN benchmark as well.
I got OpenBLAS operational on GCC 4.4 CentOS 6.5...For dpotrs and dpotrf I am getting good runtime improvement with multiple threads on Spark Master (around 2X compared to single thread OpenBLAS)...there were some issues in dgetrf_parallel and I opened up an issue...I will experiment with 20 partitions + multithreaded BLAS with 16 cores vs 320 partitions + single thread BLAS...Less number of partitions should optimize on treeAggregate combOp but idea here is not to degrade the compute by restricting the number of partitions...I checked CPU utilization grows to > 100 % which means multiple threads are getting utilized by OpenBLAS as I expected...It should help improve ANN benchmark as well since BLAS3 operations should benefit most out of it...
Hi Alex,
I am trying to reproduce the benchmark results and I have a quick question of how many OpenBLAS threads you have used and what's the runtime scalability that you got. I am expecting with N threads, compute runtime should improve from M secs to M/N/2 secs.
Here is what I am trying:
I have 20 nodes and 16 cores on each node.
SparkContext: 20 nodes, 16 cores, sc.defaultParallelism 320
def gramSize(n: Int) = (n*n+1)/2
// I have not used saxpy f2jBLAS and NativeBLAS yet but that will be used over here for comparisons. // I am not sure if f2jBLAS can run on multiple threads or not but OpenBLAS should run fine
val combOp = (v1: Array[Float], v2: Array[Float]) => { var i = 0 while (i < v1.length) { v1(i) += v2(i) i += 1 } v1 }
val n = gramSize(4096) val vv = sc.parallelize(0 until sc.defaultParallelism).map(i => Array.fill[Float](0)) vv.persist
Option 1: 320 partitions, 1 thread on combOp per partition
val start = System.nanoTime(); vv.treeReduce(combOp, 2); val reduceTime = (System.nanoTime() - start)*1e-9 reduceTime: Double = 5.6390302430000006
Option 2: 20 partitions, 1 thread on combOp per partition
val coalescedvv = vv.coalesce(20) coalescedvv.count
val start = System.nanoTime(); coalescedvv.treeReduce(combOp, 2); val reduceTime = (System.nanoTime() - start)*1e-9 reduceTime: Double = 3.9140685640000004
Option 3: 20 partitions, OpenBLAS numThread=16 per partition
Setting up OpenBLAS on cluster, I will update soon.
Let me know your thoughts. I think if underlying operations are Dense BLAS level1, level2 or level3, running with higher OpenBLAS threads and reducing number of partitions should help in decreasing cross partition shuffle.