eth-cscs / DLA-Future

DLA-Future
https://eth-cscs.github.io/DLA-Future/master/
BSD 3-Clause "New" or "Revised" License
64 stars 14 forks source link

Revert "boost priority threads for mpi" (Partially revert #1036) #1071

Closed rasolca closed 9 months ago

rasolca commented 9 months ago

This partially reverts commit 510ccd54cbaee2fb23ce9c01c64ae18cacedccd6.

rasolca commented 9 months ago

cscs-ci run

codecov-commenter commented 9 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (414374d) 94.05% compared to head (75b2ed2) 94.06%.

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #1071 +/- ## ======================================= Coverage 94.05% 94.06% ======================================= Files 148 148 Lines 9201 9199 -2 Branches 1164 1164 ======================================= - Hits 8654 8653 -1 + Misses 324 323 -1 Partials 223 223 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

msimberg commented 9 months ago

It looks like on clariden exporting FI_CXI_RDZV_THRESHOLD=131072 seems to also work around the hang (FI_CXI_RDZV_THRESHOLD=65536 is not enough), at least on a run like this: OMP_NUM_THREADS=1 srun -p nvgpu -N 2 --time 00:02:00 -u --uenv-file=$UENV_FILE --cpu-bind=mask_cpu:ffff000000000000ffff000000000000,ffff000000000000ffff00000000,ffff000000000000ffff0000,ffff000000000000ffff -n 8 gpu2ranks_slurm_cuda miniapp/miniapp_bt_band_to_tridiag --type d --m 30097 --n 512 --mb 1024 --nb 128 --b 128 --grid-rows 4 --grid-cols 2 --nruns 1 --dlaf:bt-band-to-tridiag-hh-apply-group-size=128 --dlaf:band-to-tridiag-1d-block-size-base=2048 --nwarmups 0.