Closed wizardk closed 4 years ago
Hi! When I was developing Simd I have priority to achieve maximal single-thread performance. I suggested that multi-threading will be performed on the task level. It is best way if we have large number of small tasks. Unfortunately I had no time to optimize case of big task which can be parallelized inside itself. So this issue is still open.
@ermig1979 Hi, thanks for your reply. Maybe I didn't make it clear. Your can see https://github.com/microsoft/onnxruntime/issues/2512 and https://github.com/pytorch/pytorch/issues/26948. I'm using multi-processing and multi-threading on task level. And I wonder whether there is performance degradation if there are many parallelized threads in task level which are using SIMD to accelerate speed.
Oh I am sorry that I didn't understand your question - maybe I too hasted to answer.
If we return ti your question I want to note that CPU has two main kinds of resources:
Thanks for your helpful answer. I know the average performance should be worse as the number of parallel tasks increases, but I saw very huge deteriorations of performance in my experiments. Do you think it makes sense?
It's all about RAM bandwidth. Thank you again.
How about using mmap. Does it change the nature of how the memory bandwidth is affected
Hi. The restriction of memory bandwidth is fundamental for all modern architectures (both CPU and GPU). As increasing of calculation capabilities of a compute device is easier than increasing of memory bandwidth so all manufacturers displaced balance to side of the first one. Other words it is not important how you alocate memory in software level if there are hardware restrictions.
@ermig1979 Hi, greate works! I wonder is there any performance variation in multithreading parallel calling? I mean there are one or more processes, and each process has one or more threads, and each thread calling SIMD based code. At the meanwhile, the total thread number is less than CPU core number.
I have tested libtorch, onnxruntimes(eigen or mkl-dnn) and their performances of multithreading parallel inference deteriorated rapidly as the number of threads increased. So I asked this question and want to find a way to optimize the speed of NN inference in parallel environment.