Greatly improve the performance of form_tree when regridding.

JiakunYan commented 1 year ago

This PR changes the HPX launch policy of the dataflow in form_tree from sync to async. This change greatly improves the scalability of Octo-Tiger.

Experiment setting: SDSC Expanse, AMD EPYC 7742 (128 core/node), 32 nodes, max level is 7, stop step is 5, rotating star, HPX LCI parcelport.

Before this change, the execution time is

   Total: 61.0584
   Computation: 42.8371 (70.1575 %)
   Regrid: 34.1035 (55.8538 %)
   Computation + Regrid: 76.9405 (126.011 %)

In particular

checking for refinement
regridding
Regridded tree in 0.048255 seconds
rebalancing 196809 nodes with 172208 leaves
Rebalanced tree in 0.127481 seconds
forming tree connections
32248 amr boundaries
Formed tree in 14.557350 seconds
solving gravity
regrid done in 15.936172 seconds

The time spent on forming tree is 14.5 s.

After the change, the execution time is

   Total: 51.4378
   Computation: 43.717 (84.9899 %)
   Regrid: 15.2881 (29.7216 %)
   Computation + Regrid: 59.0051 (114.712 %)

In particular

regridding
Regridded tree in 0.049329 seconds
rebalancing 196809 nodes with 172208 leaves
Rebalanced tree in 0.120191 seconds
forming tree connections
32248 amr boundaries
Formed tree in 6.408102 seconds
solving gravity
regrid done in 7.681241 seconds

JiakunYan commented 1 year ago

More details:

I also implemented tracing for HPX. The blue bar shows the number of messages sent every 0.1 seconds. The orange line shows the total bytes sent every 0.1 seconds.

Rank 0: Before the change:

After the change:

Rank 21: Before the change:

After the change:

Rank 31: Before the change:

After the change:

Before the change, there are time durations when Octo-tiger is sending almost no messages. With some prints, I found Octo-Tiger was performing “form tree” at that time. There are also small spikes of messages in these “form tree” duration. The time of these spikes changes from rank to rank. Therefore, I think Octo-Tiger is doing poorly in parallelizing the “form tree” task between ranks, and this PR greatly improves it.

JiakunYan commented 1 year ago

I also tested with HPX MPI parcelport with max_level=6 (max_level=7 didn't finish within 5 minutes). The "total time" improved from ~14 seconds to 11.7 seconds.

G-071 commented 1 year ago

Thanks for all your work on this! I think we can merge this (the one failing test should be unrelated to this PR).

JiakunYan commented 1 year ago

@hkaiser Actually I am curious: what does it mean to have a sync launch policy for dataflow? Based on my understanding, the dataflow just creates a thread that will be ready to run once all the input futures are ready?

STEllAR-GROUP / octotiger

Greatly improve the performance of form_tree when regridding. #443