In this PR, the temp density arrays that were used for the plasma current deposition were removed. Instead, thread safety is ensured by splitting the tiles into four groups such that tiles within a group don’t overlap. Shown below is the chi array after the first group of tiles was deposited. This cleans up the code, as the temp densities don’t have to be allocated and managed anymore, and gives a small performance improvement because the lockAdd to the main array is not necessary anymore.
for (int tile_perm_x=0; tile_perm_x<2; ++tile_perm_x) {
for (int tile_perm_y=0; tile_perm_y<2; ++tile_perm_y) {
#pragma omp parallel for collapse(2) if(do_tiling)
for (int itilex=tile_perm_x; itilex<ntilex; itilex+=2) {
for (int itiley=tile_perm_y; itiley<ntiley; itiley+=2) {
// the index is transposed to be the same as in amrex::DenseBins::build
const int tile_index = itilex * ntiley + itiley;
// Deposit one tile at tile_index
}}}}
Performance for a 2047*2047*300 grid, exactly one tile per thread on dual 48 core CPUs:
[ ] Small enough (< few 100s of lines), otherwise it should probably be split into smaller PRs
[ ] Tested (describe the tests in the PR description)
[ ] Runs on GPU (basic: the code compiles and run well with the new module)
[ ] Contains an automated test (checksum and/or comparison with theory)
[ ] Documented: all elements (classes and their members, functions, namespaces, etc.) are documented
In this PR, the temp density arrays that were used for the plasma current deposition were removed. Instead, thread safety is ensured by splitting the tiles into four groups such that tiles within a group don’t overlap. Shown below is the chi array after the first group of tiles was deposited. This cleans up the code, as the temp densities don’t have to be allocated and managed anymore, and gives a small performance improvement because the lockAdd to the main array is not necessary anymore.
Performance for a
2047*2047*300
grid, exactly one tile per thread on dual 48 core CPUs:const
isconst
)