SSoelvsten / adiar

An I/O-efficient implementation of (Binary) Decision Diagrams
https://ssoelvsten.github.io/adiar/
MIT License
20 stars 13 forks source link

Improve Memory Distribution for Levelized Priority Queue #386

Open SSoelvsten opened 1 year ago

SSoelvsten commented 1 year ago

When going from v1.0.1 to v1.1.0, we changed the way the levelized priority queue distributes its memory. Specifically, for M being the total amount of memory given to the levelized priority queue, the share given to the overflow priority queue was changed as follows:

Version Memory to Overflow
v1.0 max(5MiB, (M - 5MiB ⋅ (#sorters + 1))) / 10)
v1.1 max(8MiB, M / 4 ⋅ (#sorters + 1))

That is, since #sorters = 2 the amount of memory given to the overflow changed from roughly M/10 to M/12. The sorter's memory distribution was also changed as follows

Version Phase 1 Phase 2 Phase 3
v1.0 5MiB ... = Phase 1
v1.1 max(5MiB, M/(16 ⋅ (#sorters - 1)) ... = Phase 1

Which essentially has increased the phase 1 memory in the v1.1 experiments from 5 MiB to 3 GiB (on a machine with 300 GiB of RAM - half of which is given to the levelized priority queue).

Based on my current experiments on Grendel, this has had up to a 2x factor increase on the Tic-Tac-Toe benchmarks but also up to a 2x factor decrease on the N-Queens.

The difference between these benchmarks with regards to the levelized priority queue is shown below:

Benchmark N Bucket hits Overflow hits
Queens 13 91.13% 8.87%
14 92.46% 7.54%
15 93.30% 6.70%
Tic-Tac-Toe 22 100.00% 0.00%
24 100.00% 0.00%

Quite a lot of this difference in performance probably has to do with the initialization time of the merge sorter. That is, v1.2.0 probably has fixed a lot of this issue. Specifically, v1.2 has upper bounded the phase 1 and 3 size based on the maximum number of elements to be sorted.

Introductory Task: Change the memory distribution for the external memory sorter and overflow queue to test how/whether performance changes.

Based on our insights from above, we may be able to squeeze out even more performance by taking this insight and optimising the external memory case.

Plan A: Meta-information during Sweeps

If we know the 'average length' of each arc, i.e. how many levels each arc spans on average (excluding the source level but including the target level or vica versa). Using this number, we can change the weighted_share computation inside of the bucketed levelized priority queue to increase the share if the distance is large.

Since we are talking about an average, we care about the sum of all lengths divided by the number of arcs. Yet, if an arc spans three levels, then 3 = 1+1+1 and since plus is commutative (and associative) then it does not matter whether we account for all three at the same time. We only need to add 1 for all the arcs in the levelized priority queue before processing a level; any nodes/arcs that are merged can be subtracted separately.

We can use this number to find a (linear?) value between the minimum of 8 MiB and some maximum. One such maximum would be to only let the buckets take the minimal amount of space they require.

Plan B: Simpler meta-information during Sweeps

Since all bucketed LPQs only have a LOOK_AHEAD of 1, then we can also just care about how many are for the immediate next level (easy to take into account, since we can just store the label of the previously finished level).

This has the following added benefits:

  1. It is only dependant on the shape of the output, not of the input. Hence, we can easily compute the exact value. This means it can be used as part of the trivial O(1) negative cases in equality checking.
  2. This information can maybe be combined with the over-approximation of the i-level cuts to make these even more precise?

Plan C: (Machine) Learning

Always have some form of "statistics" turned on and use the data to "learn" a good value based on the problem at hand.

SSoelvsten commented 1 year ago

Looking at the experiments, the slowdown was primarily for smaller instances. Yet, this has been fixed with v.1.2. There is still merit in trying to improve the memory distribution, but I do not think it can be as significant as first thought.