In the Turek-Hron benchmark global_to_overlap_finish() takes up 1.59% (per callgrind) of total runtime: about half of that is from doing binary search. We could probably make this a lot faster by doing all of the index translation ahead of time (that way we can also use local_element()).
There are similar performance issues in overlap_to_global_start(), which uses index_within_set_binary_search, though that's only about half as bad (0.3% of total runtime)
Another thing worth checking is the ghost update in overlap_to_global_finish(): the ghost update here is taking up (same benchmark) about 1% of total runtime, which is all spent in MPI_Waitall.
In the Turek-Hron benchmark
global_to_overlap_finish()
takes up 1.59% (per callgrind) of total runtime: about half of that is from doing binary search. We could probably make this a lot faster by doing all of the index translation ahead of time (that way we can also uselocal_element()
).There are similar performance issues in
overlap_to_global_start()
, which usesindex_within_set_binary_search
, though that's only about half as bad (0.3% of total runtime)