Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
I have a problem is that when I do the dnn predict, where I will use the SyncCopyFromCPU and the Forward, the batch_size and fea_num is 40,default blas is openblas(I have also tried the Intel mkl,but it doesn't work),the cpu is broadwell,58 logical core total。
I have 32-58 worker thread,each thead only have 1 openmp thread,I worry that open too many openmp thread will decrease the performance。
After the test, I found that , the predict totally cost 13.9ms, SyncCopyFromCPU cost 275us, but the Forward cost 11ms, have can i reduce the forward cost time ?
I have a problem is that when I do the dnn predict, where I will use the SyncCopyFromCPU and the Forward, the batch_size and fea_num is 40,default blas is openblas(I have also tried the Intel mkl,but it doesn't work),the cpu is broadwell,58 logical core total。
I have 32-58 worker thread,each thead only have 1 openmp thread,I worry that open too many openmp thread will decrease the performance。
After the test, I found that , the predict totally cost 13.9ms, SyncCopyFromCPU cost 275us, but the Forward cost 11ms, have can i reduce the forward cost time ?