Compute Hash Values for Decision Diagrams

To use BDDs, resp. ZDDs, in any hash-related context (e.g. std::unordered_map and std::unordered_set), one needs to derive a hash value for each BDD, resp. ZDD, care about the hash value at the root.

Notice, we only care about the value of the root. So, we do not need to store the hash value within each and every node. Instead, similar to canonicity in #127, we can store the hash of the root (and its negation) as two numbers in the shared_levelized_file<node> and merely propagate the hash values in the priority queue of reduce.

It is important to note here, that until we have #433 we need both the hash for the unnegated and the negated BDDs. Yet, both can be computed in parallel by also negating all nodes during the hash computation.

Hash Functions

What follows are three hash function approaches of increasing complexity (and quality). They can also potentially be combined.

Meta Hash

Currently, we already provide quite a few bits of meta-information, such as the BDDs and ZDDs width and the number of arcs to terminals. We can also derive the size and the top-variable by using 1 I/O to load the first block from the disk.

We can combine all of these numbers into a single 64-bit number for a relatively decent hash function.

Linear Hash

In Adiar, a BDD is a list of nodes. Each node can be thought of as a character and hence the entire BDD as a string. If we somehow can hash a node into 64-bits, then we can accumulate all of them. This can be done easily by merely extending node_writer::unsafe_push(const node& n).

Hashing a single Node

Each BDD node is a triple ((x, id), low, high) of 64-bit integers.

Requirements

The hash of the false terminal is 0.
The hash of the true terminal is 1.
The hash of all other nodes are 2+.
The uid level identifier probably should not impact the hash. That is, the hash of ((x,42), f, g) is the same as as the one for ((x,21), f, g).
The children cannot just be XOR'ed, since a suppressable BDD node otherwise has the children cancel out. A node like this would be part of.

node

If it is a terminal:
1. Use .value() as the hash
For any other node:
1. Abuse the fact, that the low arc is never flagged for attributed edges [Brace90, Minato01]. So, we can right-shift the raw value of low by one. Relatively speaking, this is (almost) equivalent to the high child being multiplied by two.
2. Obtain the uid.level() and multiply it by some odd prime (probably 3 is a good choice). This ignores the level identifier.
3. Add all three values together.

Considering most BDD nodes have high level identifiers this seems prone to overflowing. It might be useful to invert the level identifier of both children; simply bit-wise negating all of the level-identifier bits (if .is_node() is true) may suffice.

Accumulating Hashes

Based on [Thorup15] the mathematical sound solution is to use a prime p and compute the sum of hash(n_i) aⁱ modulo p where a is a seed and p is a prime number. The index i is in ascending order as the when nodes are pushed. To make the modulo operation fast, we would need to use a Mersenne Prime such as 2³¹-1 or 2⁸¹-1.

Recursive

Note: this depends on #412 .

As Randal E. Bryant was thinking about ways to improve the performance of Adiar's equality checking, he got reminded of the work of [Blum80] about hashing BDDs and ZDDs. Since the equality checking in #127 already resolved this in a much better way, I have not pursued this further.

Implementation in Reduce

Let p be a prime number (though the math may work out even when doing all computations with the non-prime p = 2^k, i.e. by abusing the overflow of unsigned integers). Consider a hash function H (all numbers computed modulo p) defined as follows

Leaves hash to their value, i.e. H(0) = 0 and H(1) = 1
Variables x_i hash to a random value in [0;p)
Internal nodes has as follows: H((x_i), v₀, v₁) = H(x_i) H(v₁) + (1 - H(x_i) H(v₀))

Then the probability of two different BDDs share the same hash value is 1/p.

Testing

It seems hard to unit test a hash function. We may just want to create a benchmark to evaluate empirically the number of collisions?

Applications

[ ] Provide a std::hash for Adiar's BDDs and ZDDs that use these hash values as keys in a hash table.
[ ] Use the hash value as another constant time fail-fast to the equality checking algorithm (probably only if canonical).
[ ] Use the hash value(s) as a check-sum when scanning the BDD with a node_stream. This way, we can identify soft data corruptions.

References

[Blum80] Manuel Blum, Ashok K. Chandra, and Mark N. Wegman. “Equivalence of free boolean graphs can be decided probabilistically in polynomial time”. In: 27th ACM/IEEE Design Automation Conference. pp. 80 – 82 (1980)
[Brace90] Karl S. Brace, Richard L. Rudell, Randal E. Bryant. “Efficient implementation of a BDD package”. In: 27th ACM/IEEE Design Automation Conference. pp. 40 – 45 (1990)
[Minato01] Shin-ichi Minato. “Zero-suppressed BDDs and their applications”. In: International Journal on Software Tools for Technology Transfer pp. 156 – 170 (2001)
[Thorup15]. “High Speed Hashing for Integers and Strings”. In: arXiv. (2015)

SSoelvsten / adiar