SSoelvsten / adiar

An I/O-efficient implementation of (Binary) Decision Diagrams
https://ssoelvsten.github.io/adiar/
MIT License
24 stars 13 forks source link

Switch to internal memory levelized priority queue for top-down algorithms based on input size #98

Closed SSoelvsten closed 2 years ago

SSoelvsten commented 3 years ago

WIth #169 and #168 we added support for the levelized_priority_queue switching to internal memory if its max_size fits into the given amount of main memory. We now similarly want to derive a max_size for the levelized priority queues in all other algorithms and then switch to an internal memory variant if possible.

We'll start with using the simple bounds below. We will later ( #165 ) use some (most often) much better bounds. So, it is a good idea to already now add an (inline) helper function to compute each max_size below.

Worst-case bounds for top-down Levelized Priority Queue (and the others)

The following are some simple O(1) derivable bounds on the levelized priority queues. Based on these, we can then be used to switch between the different versions.

SSoelvsten commented 3 years ago

Waste of I/Os

I am currently running the 8x8 Knight's Tour problem (Bryant version) on the #128 pull request, and after several days it has written ~45 TiB to disk and only read ~5 TiB. This difference probably is due to file system caching, which hints at there being no need for having sent them to disk in the first place.

EDIT: This difference may not be so problematic. For example here is the data from the EPFL/Arbiter (size). Even though there was ~50 GiB of free memory for the file system cache, there is not a major difference in the amount read and written.

Max Mem used        : 124.96G (s95n31)
Max Disk Write      : 32.52T (s95n31)
Max Disk Read       : 37.25T (s95n31)
SSoelvsten commented 3 years ago

Running time of external memory merge sorter

As I am rerunning experiments on the BDD Benchmark repository I notice, that Adiar exhibits a considerable slowdown for smaller instances depending on the memory M it was initialised with. The more memory it was given, the more time it uses. For example, for the 8-Queens problem we get the following timings

Memory Construction time Counting time
128 MiB 633 ms 10 ms
6144 MiB 1509 ms 226 ms

I have tested the benchmarks with Adiar at different commits, and it always seems to have been an issue. But, I found the following interesting commits.

I tested the original tpie::merge_sorter to sort a small set of elements in src/main.cpp

    const int MAX = 1000;

    auto start = std::chrono::high_resolution_clock::now();

    tpie::merge_sorter<int, false, std::less<int>> sorter;
    sorter.set_available_memory(tpie::get_memory_manager().available());

    tpie::dummy_progress_indicator pi;

    sorter.begin();

    // Add 1000 elements in reverse order
    for (int i = MAX; i > 0; i = i-2) { sorter.push(i); }
    for (int i = MAX - 1; i > 0; i = i-2) { sorter.push(i); }

    sorter.end();
    sorter.calc(pi);

    for (int i = 1; i <= MAX; i++) { assert(sorter.pull() == i); }

    auto stop = std::chrono::high_resolution_clock::now();
    auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);

    std::cout << "Time taken by function: " << duration.count() << " microseconds" << std::endl;

For different values of MAX we see the following timings (numbers are quick estimates of the average)

MAX 128 MiB 6144 MiB
10 90 us 130 us
100 110 us 150 us
1000 180 us 220 us
10000 1200 us 1200 us
100000 14000 us 13500 us

Similar issues may also exist for tpie::priority_queue and tpie::file_stream. We have very few elements in each sorter due to the nature of the levelized priority queue. We have 82·8·(1+1+2) = 18432 uses of the tpie::merge_sorter throughout the entire Apply-Reduce cycle during construction of the 8-Queens board. The above mentioned slowdown of 876 ms is an average of 47 us slowdown per sorter; this matches what we see for 10 elements above.

SSoelvsten commented 3 years ago

Running time of TPIE's internal memory data structures

I redid the above experiment on TPIE's internal memory data structures. Using the tpie::internal_vector with tpie::parallel_sort is indeed faster than using tpie::merge_sorter and its running time is independent of the amount of memory supplied. But, the speed gain is underwhelming. The only difference is the guarantee that we don't get a 10x slowdown due to increasing the amount of memory from 128 MiB to 300 GiB.

MAX (n) 10 100 1000 10000 100000
Time (us) 20 us 25 us 100 us 970 us 11000 us
Improvement (%) 77.8 77.3 44.4 19.2 18.5

I also ran the same test on tpie::internal_priority_queue to sort this set of numbers.

MAX (n) 10 100 1000 10000 100000
Time (us) 1 us 9 us 70 us 1000 us 10500 us

Which is much more akin to what I was hoping for; even surpassing what I expected. More experiments are needed to verify this indeed is the case in general. But, if we can decrease the amount of time spent for very small instances to 1/90 as for n=10 then our algorithms will be on par with the other recursive BDD packages. This would eliminate one of the biggest arguments against adopting this BDD package.

SSoelvsten commented 3 years ago

I just ran perf record --call-graph dwarf build/src/adiar_queens on the benchmark repository to see a breakdown in the performance. Almost all of the running time seems to be spent on the operations within TPIE. This is a good thing, since we then can expect most of our performance difference is due to these data structures rather than some overhead we introduce. In other words, we can assume that the performance increases we have measured above translate 1:1 into Adiar.

SSoelvsten commented 2 years ago

tpie::internal_priority_queue vs std::sort

From the experiment above we see that the parallel quick sort is as slow as the binary heap. This is most likely due to the overhead of managing and spawning threads. From various papers we know that quicksort should be faster than heapsort. Our previous experiment contradicts this statement, so I rewrote the test-program such that:

  1. We generate randomized input
  2. We compare tpie::internal_priority_queue with std::sort.

The number of us spend for each sorting is as follows (rough estimates of the average over several runs).

N 10 100 1000 10000 100000 1000000
tpie::internal_priority_queue 10 170 2050 20100 285000 3350000
std::sort 6 80 1000 9900 150000 1600000

That is std::sort is roughly doubly as fast as the heap-sort. This is exactly what we would expect.

Here is the full version of the src/main.cpp file I created for this experiment.

// TPIE Imports
#include <tpie/tpie.h>

// ADIAR Imports
#include <adiar/adiar.h>

#include <chrono>
#include <algorithm>

#include <stdlib.h>     /* srand, rand */
#include <time.h>       /* time */

int main(int argc, char* argv[]) {
  size_t M = 1024;

  try {
    if (argc > 1) {
      M = std::stoi(argv[1]);
    }
  } catch (std::invalid_argument const &ex) {
    tpie::log_info() << "Invalid number: " << argv[1] << std::endl;
  } catch (std::out_of_range const &ex) {
    tpie::log_info() << "Number out of range: " << argv[1] << std::endl;
  }

  adiar::adiar_init(M * 1024 * 1024);

  {
    // ===== Your code starts here =====

    // Generate Inputs
    const size_t MAX = 1000000;

    tpie::internal_vector<int> input(MAX);

    srand (time(NULL));

    for (size_t i = 0; i < MAX; i++) {
      input.push_back(rand() % MAX);
    }

    // Sort with tpie::internal_priority_queue, i.e. heap-sort
    {
      auto start = std::chrono::high_resolution_clock::now();

      tpie::internal_priority_queue<int> pq(MAX);
      for (const auto i : input) { pq.push(i); }

      int latest_value = -2147483648;
      for (size_t i = 0; i < MAX; i++) {
        assert(latest_value <= pq.top());
        latest_value = pq.top();
        pq.pop();
      }

      auto stop = std::chrono::high_resolution_clock::now();
      auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start).count();

      std::cout << "tpie::internal_priority_queue: " << duration << "us" << std::endl;
    }

    // Sort with std::sort, i.e. quicksort
    {
      auto start = std::chrono::high_resolution_clock::now();

      tpie::internal_vector<int> v(MAX);
      for (const auto i : input) { v.push_back(i); }

      std::sort(v.begin(), v.end(), std::less<int>());

      int latest_value = -2147483648;
      for (const auto i : v) {
        assert(latest_value <= i);
        latest_value = i;
      }

      auto stop = std::chrono::high_resolution_clock::now();
      auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start).count();

      std::cout << "std::sort : " << duration << "us" << std::endl;
    }

    // =====  Your code ends here  =====
  }

  adiar::adiar_deinit();
  exit(0);
}