Improve lock performance with SpinMutex and OpenMP static scheduling

kks32 commented 4 years ago

Describe the PR Mutex locks take a significant amount of time in OpenMP parallel versions of the code. This update is to reduce the lock by trying a spinlock and reduce the wait times across threads.

Enable setting the chunk size in OpenMP schedule using

export OMP_SCHEDULE="static, 4"

This allows for a finer control without having to recompile for different chunk sizes when running different problems. If the variable is not set, the code will still run.

Additional context Spinlocks vs mutexes are a computational bottleneck. However, we cannot use std::atomic, so this is a good workaround the lack of std::atomic support for Eigen and Vector containers. The speed is improved by about 4 - 5 % for the 3D hydrostatic column.

OMP Schedule static chunk size	Time (s)	Speedup
no schedule	742	1
4	725.3	1.023
16	721.5	1.028
128	735.2	1.009
512	736.7	1.007
1000	760	0.976
SpinMutex with Chunk-4	711	1.044

kks32 commented 4 years ago

Thank for this @kks32! I leave my comments but I don't seem to understand your table. What are the number represent (I thought it's how you divide your chunks but apparently not because of the last row)?

So if you divide into chunks of 4, you get 4-5% speedup, but did you try with 2 or 8? Any thoughts on using "dynamic" or "auto"?

There are two things on the table. 1. OpenMP performance with different chunk sizes and 2. SpinMutex implementation with a certain number of chunk size. If you let OpenMP decide, that's our baseline (742s). Dynamic will be slower. The best number of chunk-sizes and threads must be decided by the user running to code. These are only for the 3D hydrostatic column.

kks32 commented 4 years ago

https://app.circleci.com/pipelines/github/kks32/mpm?branch=performance%2Flock CI passed.

cb-geo / mpm

Improve lock performance with SpinMutex and OpenMP static scheduling #676