DARMA-tasking / LB-analysis-framework

Analysis framework for exploring, testing, and comparing load balancing strategies
Other
3 stars 1 forks source link

#414 modify traversal of ranks to better emulate asynchrony #423

Closed ppebay closed 1 year ago

ppebay commented 1 year ago

Resolves #414

Also improved the code as a result of the review of the control flow, by factoring out commonalities between the 2 transfer strategies and moving them to the abstract base strategy.

Resolves #375 as well, as discovered by @cwschilly

ppebay commented 1 year ago

Randomizing traversal order typical results in a different behavior of the LB iterations (below with the Recursive strategy) --- although we note than an optimal configuration is nonetheless attained: Screen Shot 2023-08-03 at 12 51 25 AM Screen Shot 2023-08-03 at 12 51 32 AM Screen Shot 2023-08-03 at 12 51 40 AM

In contrast, when deterministic_transfer is set to True, it is serendipitous but nonetheless observable that always starting with rank 0 and continuing in standard index order allows for faster convergence:

Screen Shot 2023-08-03 at 1 45 29 AM Screen Shot 2023-08-03 at 1 45 35 AM

ppebay commented 1 year ago

In the case of the Clustering strategy, by adding another "layer" of randomization (in cluster selection as discussed @lifflander) we can indeed reproduce less-than-optimal results sometimes: Screen Shot 2023-08-03 at 2 17 38 AM Screen Shot 2023-08-03 at 2 18 04 AM

In contrast when deterministic_transfer is set to True we always find the known optimal value of 90.

ppebay commented 1 year ago

@cwschilly I am leaving the PR in this state, when you get a chance can you please look into the remaining CI failures? Thanks

cwschilly commented 1 year ago

@ppebay I just had to rebase on develop to get @tlamonthezie 's CI fix (#420) -- the PR now passes all tests

cwschilly commented 1 year ago

@lifflander @ppebay This PR is ready for review

ppebay commented 1 year ago

Will let @cwschilly now test the CI as discussed with ordered sets for the targets when transfer phase is deterministic

cwschilly commented 1 year ago

@ppebay @lifflander I have run the acceptance test on this PR several times and analyzed the output, comparing with other runs that have failed. This PR consistently passes the test. I suspect it has to do with changing deterministic_transfer to True in conf.yaml. I believe this resolves #375