What is the performance of SIMD in multithreading calling?

ermig1979 / Simd

C++ image processing and machine learning library with using of SIMD: SSE, AVX, AVX-512, AMX for x86/x64, VMX(Altivec) and VSX(Power7) for PowerPC, NEON for ARM.

http://ermig1979.github.io/Simd

MIT License

2.04k stars 407 forks source link

What is the performance of SIMD in multithreading calling? #113

Closed wizardk closed 4 years ago

wizardk commented 4 years ago

@ermig1979 Hi, greate works! I wonder is there any performance variation in multithreading parallel calling? I mean there are one or more processes, and each process has one or more threads, and each thread calling SIMD based code. At the meanwhile, the total thread number is less than CPU core number.

I have tested libtorch, onnxruntimes(eigen or mkl-dnn) and their performances of multithreading parallel inference deteriorated rapidly as the number of threads increased. So I asked this question and want to find a way to optimize the speed of NN inference in parallel environment.

ermig1979 commented 4 years ago

Hi! When I was developing Simd I have priority to achieve maximal single-thread performance. I suggested that multi-threading will be performed on the task level. It is best way if we have large number of small tasks. Unfortunately I had no time to optimize case of big task which can be parallelized inside itself. So this issue is still open.

wizardk commented 4 years ago

@ermig1979 Hi, thanks for your reply. Maybe I didn't make it clear. Your can see https://github.com/microsoft/onnxruntime/issues/2512 and https://github.com/pytorch/pytorch/issues/26948. I'm using multi-processing and multi-threading on task level. And I wonder whether there is performance degradation if there are many parallelized threads in task level which are using SIMD to accelerate speed.

ermig1979 commented 4 years ago

Oh I am sorry that I didn't understand your question - maybe I too hasted to answer.

If we return ti your question I want to note that CPU has two main kinds of resources:

Computing resources.
Memory bandwidth. And the first one is the mush greater then the second one. So effective calculation algorithm must minimize access to main memory. For example, algorithm of matrix multiplication has such properties (required memory bandwidth ~O(N^2), computing resources ~O(N^3)). Unfortunately structure of new types of neural networks, for example MobileNetV2, minimizes required computing resources and little increases required memory bandwidth. When we increase number of task we are firstly restricted by memory bandwidth and then by number of CPU cores. And we see deterioration of performance.

wizardk commented 4 years ago

Thanks for your helpful answer. I know the average performance should be worse as the number of parallel tasks increases, but I saw very huge deteriorations of performance in my experiments. Do you think it makes sense?

wizardk commented 4 years ago

It's all about RAM bandwidth. Thank you again.

Dyl777 commented 1 year ago

How about using mmap. Does it change the nature of how the memory bandwidth is affected

ermig1979 commented 1 year ago

Hi. The restriction of memory bandwidth is fundamental for all modern architectures (both CPU and GPU). As increasing of calculation capabilities of a compute device is easier than increasing of memory bandwidth so all manufacturers displaced balance to side of the first one. Other words it is not important how you alocate memory in software level if there are hardware restrictions.