fix: 1.9x improvement to median q-error by fixing multi-equality join selectivity

Summary: Previously, we computed multi-equality join selectivity by building an MST of the join graph. However, the correct method is to take the N-1 nodes with the highest n-distinct values.

Demo: This fix causes us to beat Postgres on median q-error for the first time ever. We also now beat Postgres on p90 q-error for the first time ever. Overall, it improves our median q-error by 1.9x, p90 q-error by 3.4x, p95 q-error by 42.1x, p99 q-error by 2.6x, and lets us beat Postgres on 9 queries we previously didn't beat them on.

Before (after changing DEFAULT_PRECISION and DEFAULT_K_TO_TRACK but before multi-equality fix): Screenshot 2024-04-27 at 16 00 18

After: Screenshot 2024-04-27 at 20 02 00

Details:

To see the problem, consider joining three tables where T1 has 2 distinct values, T2 has 3, and T3 has 4. Assume all values only occur once per table, and all values in the smaller tables appear in the larger tables. The cartesian product of all three tables is 24 and the join result is 2, so the overall selectivity should be 1/12. However, the old method would sometimes give an overall selectivity of 1/16 because there are two edges in the join graph (T1-T3 and T2-T3) which both have a selectivity of 1/4.
Added rigorous unit tests to test all possible permutations of three-table joins. Before the fix, some of these tests were failing. After the fix, these tests passed.
Properly handling cases where we are adding a predicate that either extends an existing connected component or merges two existing connected components. Added unit tests for both of these cases as well.

cmu-db / optd

fix: 1.9x improvement to median q-error by fixing multi-equality join selectivity #171