drwells / fiddle

4 stars 3 forks source link

Investigate performance of Scatter #142

Closed drwells closed 1 year ago

drwells commented 1 year ago

In the Turek-Hron benchmark global_to_overlap_finish() takes up 1.59% (per callgrind) of total runtime: about half of that is from doing binary search. We could probably make this a lot faster by doing all of the index translation ahead of time (that way we can also use local_element()).

There are similar performance issues in overlap_to_global_start(), which uses index_within_set_binary_search, though that's only about half as bad (0.3% of total runtime)

drwells commented 1 year ago

Another thing worth checking is the ghost update in overlap_to_global_finish(): the ghost update here is taking up (same benchmark) about 1% of total runtime, which is all spent in MPI_Waitall.