Where can I find the latest results of the benchmark?

sirandreww commented 2 days ago

Edit (Steffan): Tasks

[ ] Performance plots in README or separate document.
[x] List of publications.

SSoelvsten commented 2 days ago

What kind of "results" are you thinking of? As far as I know, this set of benchmarks has not yet been used as a survey of BDD packages. If it was done, more BDD packages should be added (see also #12). I hadn't thought to maintain a document with such information on this repository.

The best I can do, is create a list with papers that have used this set of benchmarks? But, those papers focus on a specific BDD package. That is, they often give a relative rather than an absolute view.

sirandreww commented 2 days ago

Adding a list of papers that contain benchmark results would be excellent. I think it would be great if one could have a graph in the README comparing the different implementations by speed and memory usage. That way the performance differences of these libraries is known, which would drive their improvement further.

SSoelvsten commented 1 day ago

With respect to : List of Publications

I have pushed a list of publications in dc57f16 .

@nhusung , I assume Configuring BDD Compilation Techniques for Feature Models didn't use this benchmarking suite but rather the newly added CNF benchmark is a generalisation for future work? Hence, it shouldn't be added to the list.

nhusung commented 1 day ago

Yes, for that paper, we directly used oxidd-cli. The recently added CNF benchmark is based on our findings in Configuring BDD Compilation Techniques for Feature Models, and corresponds to the “balanced” construction.

SSoelvsten commented 1 day ago

With respect to : Results

There are a lot of different variables one can tweak with respect to (1) the machine of choice, (2) the common settings used, and (3) the settings of each individual BDD package.

Example for (1) and (2): All my usages have been on machines with vast amounts of memory. This has primarily been due to my need for other BDD packages to survive some of the larger instances within Adiar's scope. Yet, is something like that reasonable for an "average" use-case?
Example for (1) and (2): If less memory is used, then other BDD packages' performance may suffer. For example, here is the running time of Adiar and BuDDy depending on the amount of memory for the RelProd benchmark on three different models:
Example for (3): If you compare Sylvan's performance in [Sølvsten22] with [Husung24], there is an odd slowdown for smaller instances (12-Queens). This is due to not using the same initial table size.
Example for (3): Some benchmarks are only solveable by Sylvan and BuDDy if they instantiate the node table to its full size immediately. For others, the opposite is true. See the TACAS 2022 artifact for the details.
Example for (3): While chatting with Bryant, he mentioned he was able to get CUDD to run faster by changing some of its magical constants (which are still fine-tuned for old hardware).

There exists the following comparative studies of BDD packages. Most of them are (unsurprisingly) from the 90s; so, a more up-to-date one would be beneficial to the community.

Again note, there are quite a lot of choices to be made when trying to make a "fair" comparison.

Things to work out

I'd have to think a bit about how I/we could do something like this (and how to keep it somewhat up to date). It very quickly turns into me (1) conducting a survey and/or (2) creating a BDD competition. I'm not entirely sure, I am keen on taking on either, right now.

These are just a few things one should have to make decisions on.

Time
- Actual clock time should be measured to also capture issues with the TLB and the cache.
- Should a timeout be used? If so, what should it be? A low timeout would be heavily biased against Adiar (small BDD computations are out of its scope).
- Currently, I have always ignored the initialisation time. This is because BuDDy spends a long time initialising the node table. Including it would muddy the picture (without changing the results significantly). So, should initialisation time be measured independently?
Memory:
- How much memory is "not enough" and how much is "too much"?
- Should memory be completely fixed or everything repeated with different orders of magnitude of RAM?
- For Adiar, how much disk space (compared to RAM) should be allowed?
- Should other BDD packages be allowed to use SWAP?
- How should memory usage even be measured? For example, BuDDy is currently set up to just take all memory immediately. So, should the number of allocated nodes be used instead?
Multi-threading:
- Sylvan and OxiDD are designed for multi-threaded computation. Yet, comparing them with multiple workers to single-threaded implementations is not really fair. But, not exploring their possible speed-up is also unfair. Hence, should multi-threaded implementations be separately compared?
- Should thread-safe implementations, i.e. Sylvan, OxiDD, and LibBDD (and in the future Adiar), also be separately benchmarked?

For this to be any use, one should then also need to include more BDD packages ( see also #12 ).

sirandreww commented 7 hours ago

I think the best approach would be to provide all the relevant data and let the user decide. And since BDDs might be used in similar ways to SAT solvers, we can take inspiration from the SAT competition.

A timeout of 1 hour should be enforced.
That time to solve should also be recorded.
Initialization time should be recorded separately only if this is easy to do.
A memory limit of let's say 16GB should be enforced? Or alternatively different memory limits should be enforced.
Memory consumption should be measured in GB to remain implementation independent.
Maybe BDD libraries that use the disk should be have their own tracks?
Multi-threaded libraries should participate in two tracks, single core and multi core tracks.
Thread safety is an implementation detail in my opinion, so it shouldn't matter when it comes to performance comparisons.

In conclusion 3 tracks:

Single Core (Not allowed to use the disk)
Multi Core (Not allowed to use the disk)
Disk based (with multiple cores) The mem limit for each track is not clear.

SSoelvsten / bdd-benchmark

Where can I find the latest results of the benchmark? #145