Implement common subexpression elimination

Yes, what you're describing sounds exactly like common subexpression elimination (CSE). You already have the right idea: instead of recalculating the same expression multiple times, you reference the previously computed result. Here's how you can approach it efficiently:

Expression Hashing:

Assign a unique identifier (hash) to each subtree as it's generated. The hash would represent the structure and contents of the subtree.

As you traverse the tree, store these hashes in a lookup table (hash map). If a subtree with the same structure is encountered again, you'll know it's a repeated sub-expression.

Memoization Table:

Keep a table where each hash points to the first computation of that subtree. When you encounter a repeated sub-expression (based on the hash), instead of generating new code, you emit a reference to the result of the first computation.

In practice, this would be as simple as replacing the node with a pointer to the result of the initial computation.

Inserting Result References:

As you generate the code for evaluating the expression, if you detect that a subtree has already been computed, emit a reference to the previously computed result. Essentially, the node remains in place but is flagged with a reference to the stored result, so when the generator reaches that node, it doesn’t generate redundant code.

Steps:

Traversal: Perform a depth-first traversal of the expression tree.
Hashing: For each node, compute a hash based on its operator, operands, and any associated data.
Lookup: Check if this hash is already in the memoization table.

If yes, mark the current node to reference the earlier computation.

If no, insert the hash into the table with a reference to the current node's result.

Code Generation: When emitting code, check if the node is flagged as a previously computed sub-expression. If so, emit a reference to the stored result instead of re-evaluating it.

This method ensures you only compute any repeated sub-expression once and reuse the result, saving computational time during execution. Your approach of tagging the node to point to the initial computation result is exactly the right way to manage this.

This is essentially an optimization pass during the expression compilation, often applied after an intermediate representation (IR) has been generated. Let me know if you need more details or clarification.

Yes, Java's HashMap automatically handles collisions using chaining. When two keys hash to the same value (a collision), Java stores both keys in a bucket associated with that hash value. In this case, the HashMap internally uses a linked list or tree to store all the key-value pairs that share the same hash.

Here's what happens during a collision:

If multiple keys produce the same hash, they are placed in the same bucket.

Each bucket can store multiple entries as a list of key-value pairs.

Java then checks the key's equality using equals() to differentiate between entries in the same bucket.

In more recent versions (Java 8+), once a bucket becomes large (i.e., more than a threshold number of elements), Java optimizes the lookup by converting the bucket from a linked list to a balanced tree (red-black tree) for faster access, reducing the complexity from O(n) to O(log n) for lookups in the case of heavy collisions.

So, you don’t need to worry about collisions manually. The HashMap is designed to handle them efficiently as long as your hashCode() and equals() methods are implemented correctly for your objects (like expression nodes).

If you encounter too many collisions, you might want to look into improving the way your expressions are hashed, but Java's built-in handling is generally robust.

crowlogic / arb4j

Implement common subexpression elimination #518