Closed JiakunYan closed 1 year ago
More details:
I also implemented tracing for HPX. The blue bar shows the number of messages sent every 0.1 seconds. The orange line shows the total bytes sent every 0.1 seconds.
Rank 0: Before the change:
After the change:
Rank 21: Before the change:
After the change:
Rank 31: Before the change:
After the change:
Before the change, there are time durations when Octo-tiger is sending almost no messages. With some prints, I found Octo-Tiger was performing “form tree” at that time. There are also small spikes of messages in these “form tree” duration. The time of these spikes changes from rank to rank. Therefore, I think Octo-Tiger is doing poorly in parallelizing the “form tree” task between ranks, and this PR greatly improves it.
I also tested with HPX MPI parcelport with max_level=6 (max_level=7 didn't finish within 5 minutes). The "total time" improved from ~14 seconds to 11.7 seconds.
Thanks for all your work on this! I think we can merge this (the one failing test should be unrelated to this PR).
@hkaiser Actually I am curious: what does it mean to have a sync
launch policy for dataflow? Based on my understanding, the dataflow just creates a thread that will be ready to run once all the input futures are ready?
This PR changes the HPX launch policy of the dataflow in form_tree from sync to async. This change greatly improves the scalability of Octo-Tiger.
Experiment setting: SDSC Expanse, AMD EPYC 7742 (128 core/node), 32 nodes, max level is 7, stop step is 5, rotating star, HPX LCI parcelport.
Before this change, the execution time is
In particular
The time spent on forming tree is 14.5 s.
After the change, the execution time is
In particular