DARMA-tasking / LB-analysis-framework

Analysis framework for exploring, testing, and comparing load balancing strategies
Other
3 stars 1 forks source link

#446: Resolve performance issue with subclustering improvements #467

Closed ppebay closed 10 months ago

ppebay commented 10 months ago

Resolves #446

Related (and sequel to) #464 and PR #465

ppebay commented 10 months ago

Improvements to sub-clustering subsampling validated, too-complex case now passing in reasonable time even on minimal platform (1.6Ghz dual-core i5):

out_00 out_16

nlslatt commented 10 months ago

@ppebay I tried using this PR and I noticed the imbalance get worse (from one iteration to the next) at the subcluster stage (0.0481416 to 0.0514804 to 0.0737504):

[lbsInformAndTransferAlgorithm] Starting iteration 11 with total work of 362751.7461443505
[lbsInformAndTransferAlgorithm] Sent 512 initial information messages with fanout=4
[lbsInformAndTransferAlgorithm] Average number of peers known to ranks: 127.9921875 (99.99% of 128)
[lbsTransferStrategyBase] Executing transfer phase with average load: 2833.9980167527383
[lbsClusteringTransferStrategy] Built 90 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Transferring subcluster with 16 objects to rank 19
[lbsClusteringTransferStrategy] Built 82 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Transferring subcluster with 12 objects to rank 19
[lbsClusteringTransferStrategy] Built 89 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Transferring subcluster with 12 objects to rank 19
[lbsClusteringTransferStrategy] Swapped 157 cluster pairs amongst 166068 tries (0.09%)
[lbsClusteringTransferStrategy] Transferred 3 subcluster amongst 193 tries (1.55%)
[lbsInformAndTransferAlgorithm] Transferred 3253 objects amongst 2226401 proposed (99.85%)
[lbsInformAndTransferAlgorithm] Iteration 11 completed (0 skipped ranks) in 338.700 seconds
[lbsStatistics] Descriptive statistics of iteration 11 rank work:
[lbsStatistics]     cardinality: 128 sum: 362752 imbalance: 0.0481416
[lbsStatistics]     minimum: 2702.6 average: 2834 maximum: 2970.43
[lbsStatistics]     standard deviation: 63.2745 variance: 4003.66
[lbsStatistics]     skewness: 0.412394 kurtosis: 1.94129
[lbsInformAndTransferAlgorithm] Starting iteration 12 with total work of 362751.7461443505
[lbsInformAndTransferAlgorithm] Sent 512 initial information messages with fanout=4
[lbsInformAndTransferAlgorithm] Average number of peers known to ranks: 127.90625 (99.93% of 128)
[lbsTransferStrategyBase] Executing transfer phase with average load: 2833.9980167527383
[lbsClusteringTransferStrategy] Built 59 subclusters from 7 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Transferring subcluster with 15 objects to rank 19
[lbsClusteringTransferStrategy] Built 82 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Transferring subcluster with 12 objects to rank 19
[lbsClusteringTransferStrategy] Built 86 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Transferring subcluster with 12 objects to rank 19
[lbsClusteringTransferStrategy] Built 87 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Transferring subcluster with 9 objects to rank 19
[lbsClusteringTransferStrategy] Built 81 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Transferring subcluster with 12 objects to rank 19
[lbsClusteringTransferStrategy] Swapped 145 cluster pairs amongst 245216 tries (0.06%)
[lbsClusteringTransferStrategy] Transferred 5 subcluster amongst 361 tries (1.39%)
[lbsInformAndTransferAlgorithm] Transferred 2882 objects amongst 3277546 proposed (99.91%)
[lbsInformAndTransferAlgorithm] Iteration 12 completed (0 skipped ranks) in 499.786 seconds
[lbsStatistics] Descriptive statistics of iteration 12 rank work:
[lbsStatistics]     cardinality: 128 sum: 362752 imbalance: 0.0514804
[lbsStatistics]     minimum: 2751.11 average: 2834 maximum: 2979.89
[lbsStatistics]     standard deviation: 54.891 variance: 3013.02
[lbsStatistics]     skewness: 0.485104 kurtosis: 1.87609
[lbsInformAndTransferAlgorithm] Starting iteration 13 with total work of 362751.7461443505
[lbsInformAndTransferAlgorithm] Sent 512 initial information messages with fanout=4
[lbsInformAndTransferAlgorithm] Average number of peers known to ranks: 128.0 (100.00% of 128)
[lbsTransferStrategyBase] Executing transfer phase with average load: 2833.9980167527383
[lbsClusteringTransferStrategy] Built 80 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Transferring subcluster with 10 objects to rank 19
[lbsClusteringTransferStrategy] Built 86 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Transferring subcluster with 10 objects to rank 19
[lbsClusteringTransferStrategy] Built 83 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Transferring subcluster with 7 objects to rank 19
[lbsClusteringTransferStrategy] Built 84 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Transferring subcluster with 11 objects to rank 19
[lbsClusteringTransferStrategy] Built 79 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Transferring subcluster with 9 objects to rank 19
[lbsClusteringTransferStrategy] Built 84 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Transferring subcluster with 11 objects to rank 19
[lbsClusteringTransferStrategy] Built 82 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Transferring subcluster with 10 objects to rank 19
[lbsClusteringTransferStrategy] Built 76 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Transferring subcluster with 10 objects to rank 19
[lbsClusteringTransferStrategy] Built 83 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Transferring subcluster with 10 objects to rank 19
[lbsClusteringTransferStrategy] Swapped 133 cluster pairs amongst 334936 tries (0.04%)
[lbsClusteringTransferStrategy] Transferred 9 subcluster amongst 716 tries (1.26%)
[lbsInformAndTransferAlgorithm] Transferred 3086 objects amongst 4427700 proposed (99.93%)
[lbsInformAndTransferAlgorithm] Iteration 13 completed (0 skipped ranks) in 745.310 seconds
[lbsStatistics] Descriptive statistics of iteration 13 rank work:
[lbsStatistics]     cardinality: 128 sum: 362752 imbalance: 0.0737504
[lbsStatistics]     minimum: 2768.71 average: 2834 maximum: 3043.01
[lbsStatistics]     standard deviation: 50.0139 variance: 2501.39
[lbsStatistics]     skewness: 1.08659 kurtosis: 4.17963

I noticed that all subcluster transfers are to rank 19 for all three subcluster iterations. I wonder if something isn't being updated correctly. I'm going to re-run with additional iterations to see if the pattern holds.

ppebay commented 10 months ago

@nlslatt this clearly looks like a bug, thanks for the finding. Is it with a dataset that I already have?

nlslatt commented 10 months ago

@nlslatt this clearly looks like a bug, thanks for the finding. Is it with a dataset that I already have?

No, this is with the dataset I used to evaluate the cluster swaps improvement. This isn't easily reproducible even with my dataset. Perhaps we need a deterministically non-deterministic mode for debugging. :)

This time, I'm seeing an attempt to use subclusters on iteration 10. It does not transfer any of the built subclusters but does swap cluster pairs, improving the imbalance. Then, on iteration 11, it builds many more subclusters (taking a lot longer) than in my previous run but only transfers one of them, worsening the imbalance. It seems that it's still on the second iteration that considers subclusters that things go wrong. I'm still leaning toward something not being updated correctly from iteration to iteration, but I haven't looked at your code. I may have to let it run for another hour or two before I have more to report.

[lbsInformAndTransferAlgorithm] Starting iteration 9 with total work of 362751.74614435044
[lbsInformAndTransferAlgorithm] Sent 512 initial information messages with fanout=4
[lbsInformAndTransferAlgorithm] Average number of peers known to ranks: 127.8125 (99.85% of 128)
[lbsTransferStrategyBase] Executing transfer phase with average load: 2833.998016752738
[lbsClusteringTransferStrategy] Swapped 195 cluster pairs amongst 26587 tries (0.73%)
[lbsInformAndTransferAlgorithm] Transferred 5830 objects amongst 404312 proposed (98.56%)
[lbsInformAndTransferAlgorithm] Iteration 9 completed (0 skipped ranks) in 49.973 seconds
[lbsStatistics] Descriptive statistics of iteration 9 rank work:
[lbsStatistics]     cardinality: 128 sum: 362752 imbalance: 0.0575055
[lbsStatistics]     minimum: 2494.39 average: 2834 maximum: 2996.97
[lbsStatistics]     standard deviation: 96.1849 variance: 9251.53
[lbsStatistics]     skewness: -0.248876 kurtosis: 2.80818
[lbsInformAndTransferAlgorithm] Starting iteration 10 with total work of 362751.74614435044
[lbsInformAndTransferAlgorithm] Sent 512 initial information messages with fanout=4
[lbsInformAndTransferAlgorithm] Average number of peers known to ranks: 128.0 (100.00% of 128)
[lbsTransferStrategyBase] Executing transfer phase with average load: 2833.998016752738
[lbsClusteringTransferStrategy] Built 90 subclusters from 9 clusters in 0.002 seconds
[lbsClusteringTransferStrategy] Built 89 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 90 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 88 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 59 subclusters from 7 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Swapped 156 cluster pairs amongst 144315 tries (0.11%)
[lbsClusteringTransferStrategy] Transferred 0 subcluster amongst 416 tries (0.00%)
[lbsInformAndTransferAlgorithm] Transferred 3614 objects amongst 2088115 proposed (99.83%)
[lbsInformAndTransferAlgorithm] Iteration 10 completed (0 skipped ranks) in 357.724 seconds
[lbsStatistics] Descriptive statistics of iteration 10 rank work:
[lbsStatistics]     cardinality: 128 sum: 362752 imbalance: 0.0434878
[lbsStatistics]     minimum: 2718.87 average: 2834 maximum: 2957.24
[lbsStatistics]     standard deviation: 69.3736 variance: 4812.69
[lbsStatistics]     skewness: 0.265948 kurtosis: 1.59925
[lbsInformAndTransferAlgorithm] Starting iteration 11 with total work of 362751.74614435044
[lbsInformAndTransferAlgorithm] Sent 512 initial information messages with fanout=4
[lbsInformAndTransferAlgorithm] Average number of peers known to ranks: 127.9921875 (99.99% of 128)
[lbsTransferStrategyBase] Executing transfer phase with average load: 2833.998016752738
[lbsClusteringTransferStrategy] Built 87 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 90 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 72 subclusters from 8 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 90 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 88 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 70 subclusters from 8 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 88 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 90 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 87 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 90 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 61 subclusters from 7 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 90 subclusters from 9 clusters in 0.002 seconds
[lbsClusteringTransferStrategy] Built 89 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 86 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 90 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 60 subclusters from 7 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 79 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 89 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 90 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 81 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 86 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 90 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 77 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 86 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 84 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 60 subclusters from 7 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 84 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 89 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 85 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 61 subclusters from 7 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 86 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 89 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 90 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 90 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 90 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 85 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 77 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 81 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 90 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 81 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Built 87 subclusters from 9 clusters in 0.001 seconds
[lbsClusteringTransferStrategy] Transferring subcluster with 11 objects to rank 71
[lbsClusteringTransferStrategy] Swapped 101 cluster pairs amongst 612770 tries (0.02%)
[lbsClusteringTransferStrategy] Transferred 1 subcluster amongst 3410 tries (0.03%)
[lbsInformAndTransferAlgorithm] Transferred 1380 objects amongst 8617477 proposed (99.98%)
[lbsInformAndTransferAlgorithm] Iteration 11 completed (0 skipped ranks) in 1685.609 seconds
[lbsStatistics] Descriptive statistics of iteration 11 rank work:
[lbsStatistics]     cardinality: 128 sum: 362752 imbalance: 0.0621801
[lbsStatistics]     minimum: 2758.01 average: 2834 maximum: 3010.22
[lbsStatistics]     standard deviation: 65.9965 variance: 4355.54
[lbsStatistics]     skewness: 0.643645 kurtosis: 1.79914
ppebay commented 10 months ago

Thanks for the additional info @nlslatt Can you please confirm that if you entirely skip the sub-clustering stage, this problem goes does not occur?

nlslatt commented 10 months ago

Thanks for the additional info @nlslatt Can you please confirm that if you entirely skip the sub-clustering stage, this problem goes does not occur?

Is there an option that turns off sub-clustering completely? I see that it's still trying to do cluster swaps after sub-clustering begins, so I'd like to run lots of iterations without sub-clustering activating.

nlslatt commented 10 months ago

I may have to let it run for another hour or two before I have more to report.

This run is only transferring subclusters to rank 71, never anywhere else. It's hard to say whether it makes sense to do that because the imbalance is improving once again, but it's the result of a combination of cluster swaps and subcluster transfers to rank 71. It's still running so I may have more to say later.

nlslatt commented 10 months ago

I may have to let it run for another hour or two before I have more to report.

This run is only transferring subclusters to rank 71, never anywhere else. It's hard to say whether it makes sense to do that because the imbalance is improving once again, but it's the result of a combination of cluster swaps and subcluster transfers to rank 71. It's still running so I may have more to say later.

The imbalance is increasing again, so continuing to transfer to the same rank (71 in this case) is not appropriate.

ppebay commented 10 months ago

@nlslatt if you put a continue statement just above line 151:

            # Iterate over subclusters only when no swaps were possible                             

like so:

            continue
            # Iterate over subclusters only when no swaps were possible                             

this will skip the subclustering stage for each rank. I am looking at it right now and can't reproduce the bug (but that's one for sure, no doubt about it)

nlslatt commented 10 months ago

Thanks for the additional info @nlslatt Can you please confirm that if you entirely skip the sub-clustering stage, this problem goes does not occur?

Is there an option that turns off sub-clustering completely? I see that it's still trying to do cluster swaps after sub-clustering begins, so I'd like to run lots of iterations without sub-clustering activating.

I added continue before line 152 of lbsClusterTransferStrategy.py, which I think should avoid sub-clustering. Even without sub-clustering, iterations 10 and on are taking a long time (740 seconds for iteration 10, compared to 70 seconds for iteration 9). It would seem that the sub-clustering is not what is slowing it down. I will post again when I see if the imbalance keeps decreasing or not.

nlslatt commented 10 months ago

I added continue before line 152 of lbsClusterTransferStrategy.py, which I think should avoid sub-clustering. Even without sub-clustering, iterations 10 and on are taking a long time (740 seconds for iteration 10, compared to 70 seconds for iteration 9). It would seem that the sub-clustering is not what is slowing it down. I will post again when I see if the imbalance keeps decreasing or not.

The iterations are slowing down because many more cluster swaps are being attempted (393k for iteration 10 compared to 39k for iteration 9). On iteration 11, it got worse still with 668k tries in 1337 seconds. Is there a way to speed this up that won't undermine the improvement in imbalance? Although there were swaps on iteration 11, the imbalance and maximum did not improve at the precision at which they were printed. The imbalance and maximum did not get worse (at least so far).

ppebay commented 10 months ago

Bug located. It was in the pseudo-random selection of the subcluster target in non-deterministic mode. There was a mismatch between the selected target -- and the corresponding criterion value (error introduced on 08-16-2023 by self.

ppebay commented 10 months ago

@nlslatt @lifflander if you get a chance to look at this PR

@cwschilly when approved please prepare a 1.0.2 micro-release to integrate these latest improvements thanks

nlslatt commented 10 months ago

@ppebay @cwschilly I'd like this PR to be properly rebased on the modified develop before reviewing.

cwschilly commented 10 months ago

@nlslatt @ppebay This PR is ready for review