Closed SSoelvsten closed 2 years ago
I am currently running the 8x8 Knight's Tour problem (Bryant version) on the #128 pull request, and after several days it has written ~45 TiB to disk and only read ~5 TiB. This difference probably is due to file system caching, which hints at there being no need for having sent them to disk in the first place.
EDIT: This difference may not be so problematic. For example here is the data from the EPFL/Arbiter (size). Even though there was ~50 GiB of free memory for the file system cache, there is not a major difference in the amount read and written.
Max Mem used : 124.96G (s95n31)
Max Disk Write : 32.52T (s95n31)
Max Disk Read : 37.25T (s95n31)
As I am rerunning experiments on the BDD Benchmark repository I notice, that Adiar exhibits a considerable slowdown for smaller instances depending on the memory M it was initialised with. The more memory it was given, the more time it uses. For example, for the 8-Queens problem we get the following timings
Memory | Construction time | Counting time |
---|---|---|
128 MiB | 633 ms | 10 ms |
6144 MiB | 1509 ms | 226 ms |
I have tested the benchmarks with Adiar at different commits, and it always seems to have been an issue. But, I found the following interesting commits.
Commits f2355720c3e83dc7cb474e0488b91f98a71e64aa , afbe7b191d18f937a9cd12b7946ae103deb6ca8d , b6fa1e533b4ba9bc1e4a11b2a75e9d561e6d2796 , and 0762d19895fddbf774c8deb4fec9a3e0e84229c8 have resulted in a considerable performance increase for adiar::bdd_apply
for smaller instances. Larger instances seem to be unchanged.
The adiar::bdd_satcount
function exhibits a major slowdown ever since commits 7220df916f3a8fb0d884f4c7328d970d827aeb19 and 2d05ecfac0400b220daa92ff14833d99dc887854 . Prior to these commits the difference was 0ms at 128 MiB to 6ms at 6144 MiB, but after these commits it goes from 9ms to ~200ms.
I tested the original tpie::merge_sorter
to sort a small set of elements in src/main.cpp
const int MAX = 1000;
auto start = std::chrono::high_resolution_clock::now();
tpie::merge_sorter<int, false, std::less<int>> sorter;
sorter.set_available_memory(tpie::get_memory_manager().available());
tpie::dummy_progress_indicator pi;
sorter.begin();
// Add 1000 elements in reverse order
for (int i = MAX; i > 0; i = i-2) { sorter.push(i); }
for (int i = MAX - 1; i > 0; i = i-2) { sorter.push(i); }
sorter.end();
sorter.calc(pi);
for (int i = 1; i <= MAX; i++) { assert(sorter.pull() == i); }
auto stop = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
std::cout << "Time taken by function: " << duration.count() << " microseconds" << std::endl;
For different values of MAX we see the following timings (numbers are quick estimates of the average)
MAX | 128 MiB | 6144 MiB |
---|---|---|
10 | 90 us | 130 us |
100 | 110 us | 150 us |
1000 | 180 us | 220 us |
10000 | 1200 us | 1200 us |
100000 | 14000 us | 13500 us |
Similar issues may also exist for tpie::priority_queue
and tpie::file_stream
. We have very few elements in each sorter due to the nature of the levelized priority queue. We have 82·8·(1+1+2) = 18432 uses of the tpie::merge_sorter
throughout the entire Apply-Reduce cycle during construction of the 8-Queens board. The above mentioned slowdown of 876 ms is an average of 47 us slowdown per sorter; this matches what we see for 10 elements above.
I redid the above experiment on TPIE's internal memory data structures. Using the tpie::internal_vector
with tpie::parallel_sort
is indeed faster than using tpie::merge_sorter
and its running time is independent of the amount of memory supplied. But, the speed gain is underwhelming. The only difference is the guarantee that we don't get a 10x slowdown due to increasing the amount of memory from 128 MiB to 300 GiB.
MAX (n) | 10 | 100 | 1000 | 10000 | 100000 |
---|---|---|---|---|---|
Time (us) | 20 us | 25 us | 100 us | 970 us | 11000 us |
Improvement (%) | 77.8 | 77.3 | 44.4 | 19.2 | 18.5 |
I also ran the same test on tpie::internal_priority_queue
to sort this set of numbers.
MAX (n) | 10 | 100 | 1000 | 10000 | 100000 |
---|---|---|---|---|---|
Time (us) | 1 us | 9 us | 70 us | 1000 us | 10500 us |
Which is much more akin to what I was hoping for; even surpassing what I expected. More experiments are needed to verify this indeed is the case in general. But, if we can decrease the amount of time spent for very small instances to 1/90 as for n=10 then our algorithms will be on par with the other recursive BDD packages. This would eliminate one of the biggest arguments against adopting this BDD package.
I just ran perf record --call-graph dwarf build/src/adiar_queens
on the benchmark repository to see a breakdown in the performance. Almost all of the running time seems to be spent on the operations within TPIE. This is a good thing, since we then can expect most of our performance difference is due to these data structures rather than some overhead we introduce. In other words, we can assume that the performance increases we have measured above translate 1:1 into Adiar.
From the experiment above we see that the parallel quick sort is as slow as the binary heap. This is most likely due to the overhead of managing and spawning threads. From various papers we know that quicksort should be faster than heapsort. Our previous experiment contradicts this statement, so I rewrote the test-program such that:
tpie::internal_priority_queue
with std::sort
.The number of us spend for each sorting is as follows (rough estimates of the average over several runs).
N | 10 | 100 | 1000 | 10000 | 100000 | 1000000 |
---|---|---|---|---|---|---|
tpie::internal_priority_queue |
10 | 170 | 2050 | 20100 | 285000 | 3350000 |
std::sort |
6 | 80 | 1000 | 9900 | 150000 | 1600000 |
That is std::sort
is roughly doubly as fast as the heap-sort. This is exactly what we would expect.
Here is the full version of the src/main.cpp file I created for this experiment.
// TPIE Imports
#include <tpie/tpie.h>
// ADIAR Imports
#include <adiar/adiar.h>
#include <chrono>
#include <algorithm>
#include <stdlib.h> /* srand, rand */
#include <time.h> /* time */
int main(int argc, char* argv[]) {
size_t M = 1024;
try {
if (argc > 1) {
M = std::stoi(argv[1]);
}
} catch (std::invalid_argument const &ex) {
tpie::log_info() << "Invalid number: " << argv[1] << std::endl;
} catch (std::out_of_range const &ex) {
tpie::log_info() << "Number out of range: " << argv[1] << std::endl;
}
adiar::adiar_init(M * 1024 * 1024);
{
// ===== Your code starts here =====
// Generate Inputs
const size_t MAX = 1000000;
tpie::internal_vector<int> input(MAX);
srand (time(NULL));
for (size_t i = 0; i < MAX; i++) {
input.push_back(rand() % MAX);
}
// Sort with tpie::internal_priority_queue, i.e. heap-sort
{
auto start = std::chrono::high_resolution_clock::now();
tpie::internal_priority_queue<int> pq(MAX);
for (const auto i : input) { pq.push(i); }
int latest_value = -2147483648;
for (size_t i = 0; i < MAX; i++) {
assert(latest_value <= pq.top());
latest_value = pq.top();
pq.pop();
}
auto stop = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start).count();
std::cout << "tpie::internal_priority_queue: " << duration << "us" << std::endl;
}
// Sort with std::sort, i.e. quicksort
{
auto start = std::chrono::high_resolution_clock::now();
tpie::internal_vector<int> v(MAX);
for (const auto i : input) { v.push_back(i); }
std::sort(v.begin(), v.end(), std::less<int>());
int latest_value = -2147483648;
for (const auto i : v) {
assert(latest_value <= i);
latest_value = i;
}
auto stop = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start).count();
std::cout << "std::sort : " << duration << "us" << std::endl;
}
// ===== Your code ends here =====
}
adiar::adiar_deinit();
exit(0);
}
WIth #169 and #168 we added support for the levelized_priority_queue switching to internal memory if its max_size fits into the given amount of main memory. We now similarly want to derive a max_size for the levelized priority queues in all other algorithms and then switch to an internal memory variant if possible.
We'll start with using the simple bounds below. We will later ( #165 ) use some (most often) much better bounds. So, it is a good idea to already now add an (inline) helper function to compute each max_size below.
Worst-case bounds for top-down Levelized Priority Queue (and the others)
The following are some simple O(1) derivable bounds on the levelized priority queues. Based on these, we can then be used to switch between the different versions.
[x]
count
: N (Steffan) There are a total of 2N arcs to forward the count, but to create two requests from a single node at least one request has to be consumed.[x]
product_construction
: (Nf+2) · (Ng+2) + 2 (Anna) At worst, every node is paired with every other node (or leaf). Yet again, even though this totals 2(N1+2)·(N2+2) arcs at least one arc is consumed for every two created. Unlike for Path/SAT Count, we push the requests for the children before popping their parent.[x]
quantification
: Nf2 + 2 (Anna) At worst, every node is paired with every other node and we are also traversing the entire original BDD. Notice, that from the use ofis_negating
andis_commutative
we never have tuples with boolean values, so there is no +2 inside of the quadratic.[x]
bdd_ite
: Nf · (Ng+2) · (Nh+2) + Ng + Nh + 2 (Anna) Same argumentation as for the product construction, but similar to Quantification we can abuse the collapsing into only one or two parts of the input.[x]
substitution
: N+2 (Anna) There are a total of 2N number of arcs, where again at most half can be present at the same time except plus the two arcs for the current node.[x]
compare_check
: N1 · N2 (Steffan or Anna) This depends on the early-termination strategy for the level. The one used forzdd_subset
,zdd_subseteq
, andzdd_disjoint
all have the above quadratic.is_isomorphic
: N The early-termination on too many tuples treated on the same level results in exactly 2N requests placed in the queue, of which at most N can be there at the same time, since in-going arcs are consumed before pushing new arcs. The best thing is probably to place this derivation inside of a policy.[x]
intercut
: (Anna) This algorithm has two priority queues, each of which has its own responsibilities.