Without barrier, if any thread arrives here early it can spend a comparatively long time here and make the time in exchange show up as quite long.
In the baseline implementation the pack/unpack can be very slow, so minor variability in that time can show up as a very long exchange iteration.
https://github.com/cwpearson/tempi/blob/72811c815de085bbd8db566f0ad95fb15d3b65ab/bin/bench_halo_exchange.cpp#L583-L592
Without barrier, if any thread arrives here early it can spend a comparatively long time here and make the time in exchange show up as quite long. In the baseline implementation the pack/unpack can be very slow, so minor variability in that time can show up as a very long exchange iteration.