Based on the GPU implementation of gsrb_shared using shared memory, this PR brings the same improvement when using CPUs with OMP by caching phi in a local array.
With 2047^2 cells and 48 OMP threads, this PR gives a 78% speedup of HPMG compared to development.
[ ] Small enough (< few 100s of lines), otherwise it should probably be split into smaller PRs
[ ] Tested (describe the tests in the PR description)
[ ] Runs on GPU (basic: the code compiles and run well with the new module)
[ ] Contains an automated test (checksum and/or comparison with theory)
[ ] Documented: all elements (classes and their members, functions, namespaces, etc.) are documented
Based on the GPU implementation of gsrb_shared using shared memory, this PR brings the same improvement when using CPUs with OMP by caching phi in a local array.
With 2047^2 cells and 48 OMP threads, this PR gives a 78% speedup of HPMG compared to development.
const
isconst
)