Faster multigrid solve with OMP

Based on the GPU implementation of gsrb_shared using shared memory, this PR brings the same improvement when using CPUs with OMP by caching phi in a local array.

With 2047^2 cells and 48 OMP threads, this PR gives a 78% speedup of HPMG compared to development.

[ ] Small enough (< few 100s of lines), otherwise it should probably be split into smaller PRs
[ ] Tested (describe the tests in the PR description)
[ ] Runs on GPU (basic: the code compiles and run well with the new module)
[ ] Contains an automated test (checksum and/or comparison with theory)
[ ] Documented: all elements (classes and their members, functions, namespaces, etc.) are documented
[ ] Constified (All that can be const is const)
[ ] Code is clean (no unwanted comments, )
[ ] Style and code conventions are respected at the bottom of https://github.com/Hi-PACE/hipace
[ ] Proper label and GitHub project, if applicable

Hi-PACE / hipace

Faster multigrid solve with OMP #1160