cb-geo / mpm

CB-Geo High-Performance Material Point Method
https://www.cb-geo.com/research/mpm
Other
238 stars 82 forks source link

Improve lock performance with SpinMutex and OpenMP static scheduling #676

Closed kks32 closed 4 years ago

kks32 commented 4 years ago

Describe the PR Mutex locks take a significant amount of time in OpenMP parallel versions of the code. This update is to reduce the lock by trying a spinlock and reduce the wait times across threads.

Enable setting the chunk size in OpenMP schedule using

export OMP_SCHEDULE="static, 4"

This allows for a finer control without having to recompile for different chunk sizes when running different problems. If the variable is not set, the code will still run.

Additional context Spinlocks vs mutexes are a computational bottleneck. However, we cannot use std::atomic, so this is a good workaround the lack of std::atomic support for Eigen and Vector containers. The speed is improved by about 4 - 5 % for the 3D hydrostatic column.

OMP Schedule static chunk size Time (s) Speedup
no schedule 742 1
4 725.3 1.023
16 721.5 1.028
128 735.2 1.009
512 736.7 1.007
1000 760 0.976
SpinMutex with Chunk-4 711 1.044
kks32 commented 4 years ago

Thank for this @kks32! I leave my comments but I don't seem to understand your table. What are the number represent (I thought it's how you divide your chunks but apparently not because of the last row)?

So if you divide into chunks of 4, you get 4-5% speedup, but did you try with 2 or 8? Any thoughts on using "dynamic" or "auto"?

There are two things on the table. 1. OpenMP performance with different chunk sizes and 2. SpinMutex implementation with a certain number of chunk size. If you let OpenMP decide, that's our baseline (742s). Dynamic will be slower. The best number of chunk-sizes and threads must be decided by the user running to code. These are only for the 3D hydrostatic column.

kks32 commented 4 years ago

https://app.circleci.com/pipelines/github/kks32/mpm?branch=performance%2Flock CI passed.