WIth #169 and #168 we added support for the levelized_priority_queue switching to internal memory if its max_size fits into the given amount of main memory. We now similarly want to derive a max_size for the levelized priority queues in all other algorithms and then switch to an internal memory variant if possible.

We'll start with using the simple bounds below. We will later ( #165 ) use some (most often) much better bounds. So, it is a good idea to already now add an (inline) helper function to compute each max_size below.

Worst-case bounds for top-down Levelized Priority Queue (and the others)

The following are some simple O(1) derivable bounds on the levelized priority queues. Based on these, we can then be used to switch between the different versions.

[x] count: N (Steffan) There are a total of 2N arcs to forward the count, but to create two requests from a single node at least one request has to be consumed.
[x] product_construction: (N_f+2) · (N_g+2) + 2 (Anna) At worst, every node is paired with every other node (or leaf). Yet again, even though this totals 2(N₁+2)·(N₂+2) arcs at least one arc is consumed for every two created. Unlike for Path/SAT Count, we push the requests for the children before popping their parent.
- [x] Use the same bound on the secondary priority queue.
[x] quantification: N_f² + 2 (Anna) At worst, every node is paired with every other node and we are also traversing the entire original BDD. Notice, that from the use of is_negating and is_commutative we never have tuples with boolean values, so there is no +2 inside of the quadratic.
- [x] Use the same bound on the secondary priority queue.
[x] bdd_ite: N_f · (N_g+2) · (N_h+2) + N_g + N_h + 2 (Anna) Same argumentation as for the product construction, but similar to Quantification we can abuse the collapsing into only one or two parts of the input.
- [x] Use the same bound on the secondary and tertiary priority queues.
[x] substitution: N+2 (Anna) There are a total of 2N number of arcs, where again at most half can be present at the same time except plus the two arcs for the current node.
[x] compare_check: N₁ · N₂ (Steffan or Anna) This depends on the early-termination strategy for the level. The one used for zdd_subset, zdd_subseteq, and zdd_disjoint all have the above quadratic.
- [x] is_isomorphic: N The early-termination on too many tuples treated on the same level results in exactly 2N requests placed in the queue, of which at most N can be there at the same time, since in-going arcs are consumed before pushing new arcs. The best thing is probably to place this derivation inside of a policy.
[x] intercut: (Anna) This algorithm has two priority queues, each of which has its own responsibilities.
1. Input nodes: 2N+2 Each arc to a node of the input may be doubled by a cut. From here the "at least one is consumed for two to be added" argument again applies.
2. Cutting arcs: 2N+2 The same bound, but the argument is (arguably) slightly different. We may cut every arc (including the arcs to the false and the true sinks), but even though the inserted node has two children there is no exponential explosion. Since these two arcs are so very restricted where they can point to, then the number of arcs consumed will make up for the number of arcs produced.

Waste of I/Os

I am currently running the 8x8 Knight's Tour problem (Bryant version) on the #128 pull request, and after several days it has written ~45 TiB to disk and only read ~5 TiB. This difference probably is due to file system caching, which hints at there being no need for having sent them to disk in the first place.

EDIT: This difference may not be so problematic. For example here is the data from the EPFL/Arbiter (size). Even though there was ~50 GiB of free memory for the file system cache, there is not a major difference in the amount read and written.

Max Mem used        : 124.96G (s95n31)
Max Disk Write      : 32.52T (s95n31)
Max Disk Read       : 37.25T (s95n31)

Running time of external memory merge sorter

As I am rerunning experiments on the BDD Benchmark repository I notice, that Adiar exhibits a considerable slowdown for smaller instances depending on the memory M it was initialised with. The more memory it was given, the more time it uses. For example, for the 8-Queens problem we get the following timings

Memory	Construction time	Counting time
128 MiB	633 ms	10 ms
6144 MiB	1509 ms	226 ms

I have tested the benchmarks with Adiar at different commits, and it always seems to have been an issue. But, I found the following interesting commits.

Commits f2355720c3e83dc7cb474e0488b91f98a71e64aa , afbe7b191d18f937a9cd12b7946ae103deb6ca8d , b6fa1e533b4ba9bc1e4a11b2a75e9d561e6d2796 , and 0762d19895fddbf774c8deb4fec9a3e0e84229c8 have resulted in a considerable performance increase for adiar::bdd_apply for smaller instances. Larger instances seem to be unchanged.
The adiar::bdd_satcount function exhibits a major slowdown ever since commits 7220df916f3a8fb0d884f4c7328d970d827aeb19 and 2d05ecfac0400b220daa92ff14833d99dc887854 . Prior to these commits the difference was 0ms at 128 MiB to 6ms at 6144 MiB, but after these commits it goes from 9ms to ~200ms.

I tested the original tpie::merge_sorter to sort a small set of elements in src/main.cpp

    const int MAX = 1000;

    auto start = std::chrono::high_resolution_clock::now();

    tpie::merge_sorter<int, false, std::less<int>> sorter;
    sorter.set_available_memory(tpie::get_memory_manager().available());

    tpie::dummy_progress_indicator pi;

    sorter.begin();

    // Add 1000 elements in reverse order
    for (int i = MAX; i > 0; i = i-2) { sorter.push(i); }
    for (int i = MAX - 1; i > 0; i = i-2) { sorter.push(i); }

    sorter.end();
    sorter.calc(pi);

    for (int i = 1; i <= MAX; i++) { assert(sorter.pull() == i); }

    auto stop = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);

    std::cout << "Time taken by function: " << duration.count() << " microseconds" << std::endl;

For different values of MAX we see the following timings (numbers are quick estimates of the average)

MAX	128 MiB	6144 MiB
10	90 us	130 us
100	110 us	150 us
1000	180 us	220 us
10000	1200 us	1200 us
100000	14000 us	13500 us

Similar issues may also exist for tpie::priority_queue and tpie::file_stream. We have very few elements in each sorter due to the nature of the levelized priority queue. We have 8²·8·(1+1+2) = 18432 uses of the tpie::merge_sorter throughout the entire Apply-Reduce cycle during construction of the 8-Queens board. The above mentioned slowdown of 876 ms is an average of 47 us slowdown per sorter; this matches what we see for 10 elements above.

Running time of TPIE's internal memory data structures

I redid the above experiment on TPIE's internal memory data structures. Using the tpie::internal_vector with tpie::parallel_sort is indeed faster than using tpie::merge_sorter and its running time is independent of the amount of memory supplied. But, the speed gain is underwhelming. The only difference is the guarantee that we don't get a 10x slowdown due to increasing the amount of memory from 128 MiB to 300 GiB.

MAX (n)	10	100	1000	10000	100000
Time (us)	20 us	25 us	100 us	970 us	11000 us
Improvement (%)	77.8	77.3	44.4	19.2	18.5

I also ran the same test on tpie::internal_priority_queue to sort this set of numbers.

MAX (n)	10	100	1000	10000	100000
Time (us)	1 us	9 us	70 us	1000 us	10500 us

Which is much more akin to what I was hoping for; even surpassing what I expected. More experiments are needed to verify this indeed is the case in general. But, if we can decrease the amount of time spent for very small instances to 1/90 as for n=10 then our algorithms will be on par with the other recursive BDD packages. This would eliminate one of the biggest arguments against adopting this BDD package.

I just ran perf record --call-graph dwarf build/src/adiar_queens on the benchmark repository to see a breakdown in the performance. Almost all of the running time seems to be spent on the operations within TPIE. This is a good thing, since we then can expect most of our performance difference is due to these data structures rather than some overhead we introduce. In other words, we can assume that the performance increases we have measured above translate 1:1 into Adiar.

tpie::internal_priority_queue vs std::sort

From the experiment above we see that the parallel quick sort is as slow as the binary heap. This is most likely due to the overhead of managing and spawning threads. From various papers we know that quicksort should be faster than heapsort. Our previous experiment contradicts this statement, so I rewrote the test-program such that:

We generate randomized input
We compare tpie::internal_priority_queue with std::sort.

The number of us spend for each sorting is as follows (rough estimates of the average over several runs).

N	10	100	1000	10000	100000	1000000
`tpie::internal_priority_queue`	10	170	2050	20100	285000	3350000
`std::sort`	6	80	1000	9900	150000	1600000

That is std::sort is roughly doubly as fast as the heap-sort. This is exactly what we would expect.

Here is the full version of the src/main.cpp file I created for this experiment.

// TPIE Imports
#include <tpie/tpie.h>

// ADIAR Imports
#include <adiar/adiar.h>

#include <chrono>
#include <algorithm>

#include <stdlib.h>     /* srand, rand */
#include <time.h>       /* time */

int main(int argc, char* argv[]) {
  size_t M = 1024;

  try {
    if (argc > 1) {
      M = std::stoi(argv[1]);
    }
  } catch (std::invalid_argument const &ex) {
    tpie::log_info() << "Invalid number: " << argv[1] << std::endl;
  } catch (std::out_of_range const &ex) {
    tpie::log_info() << "Number out of range: " << argv[1] << std::endl;
  }

  adiar::adiar_init(M * 1024 * 1024);

  {
    // ===== Your code starts here =====

    // Generate Inputs
    const size_t MAX = 1000000;

    tpie::internal_vector<int> input(MAX);

    srand (time(NULL));

    for (size_t i = 0; i < MAX; i++) {
      input.push_back(rand() % MAX);
    }

    // Sort with tpie::internal_priority_queue, i.e. heap-sort
    {
      auto start = std::chrono::high_resolution_clock::now();

      tpie::internal_priority_queue<int> pq(MAX);
      for (const auto i : input) { pq.push(i); }

      int latest_value = -2147483648;
      for (size_t i = 0; i < MAX; i++) {
        assert(latest_value <= pq.top());
        latest_value = pq.top();
        pq.pop();
      }

      auto stop = std::chrono::high_resolution_clock::now();
      auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start).count();

      std::cout << "tpie::internal_priority_queue: " << duration << "us" << std::endl;
    }

    // Sort with std::sort, i.e. quicksort
    {
      auto start = std::chrono::high_resolution_clock::now();

      tpie::internal_vector<int> v(MAX);
      for (const auto i : input) { v.push_back(i); }

      std::sort(v.begin(), v.end(), std::less<int>());

      int latest_value = -2147483648;
      for (const auto i : v) {
        assert(latest_value <= i);
        latest_value = i;
      }

      auto stop = std::chrono::high_resolution_clock::now();
      auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start).count();

      std::cout << "std::sort : " << duration << "us" << std::endl;
    }

    // =====  Your code ends here  =====
  }

  adiar::adiar_deinit();
  exit(0);
}

SSoelvsten / adiar

Switch to internal memory levelized priority queue for top-down algorithms based on input size #98

Worst-case bounds for top-down Levelized Priority Queue (and the others)

Waste of I/Os

Running time of external memory merge sorter

Running time of TPIE's internal memory data structures

tpie::internal_priority_queue vs std::sort