MaterializeInc / materialize

The data warehouse for operational workloads.
https://materialize.com
Other
5.68k stars 458 forks source link

Re-enable cardinality estimation in `JoinImplementation` #25805

Open aalexandrov opened 4 months ago

aalexandrov commented 4 months ago

We had to revert the cardinality estimation code in JoinImplementation #25784 because the introduction of the new JoinFusion in #25502 caused our tests to run with slightly different intermediate plans that revealed some inefficiencies in the abstract interpretation of Variadic::Or:

https://github.com/MaterializeInc/materialize/blob/a800d2cef82abe1fe6c8e1eeafaf55a3885c0cbb/src/transform/src/attribute/cardinality.rs#L206-L227

More specifically, the current speculation was that for the reproduction query reported in #25769 causes quite a large expansion on line 224 that involves a lot of clones (basically hitting the TODO on line 223). I suggest to use a trimmed down version of the closed formula for P(A[1] OR ... OR A[n]) expansion

Screenshot from 2024-03-05 22-29-45

where we expand only until the P(— AND — AND —) terms and leave out terms with more than three contributing variables or come up with another proxy for the true formula that would save us the computational cost of expanding the whole term.

A query or a plan similar to the ones #25769 should be used as a regression test (*.slt or *.spec).

mgree commented 4 months ago

Yikes! An alternative would be to change the cardinality estimation to be concrete throughout, and then we wouldn't need to do any expression copying at all, just some arithmetic this isn't quite right. We would still have a combinatorial amount of work to do, but we could probably avoid having to allocate as much as the old code did.

I had left the cardinality expressions symbolic in the hope that we would have other things to do with them, but after the weak results from initial work with cardinality estimation, we never found another use for them.

aalexandrov commented 4 months ago

@frankmcsherry recently introduced a new analysis framework (based on the Analysis trait) which is meant to replace Attribute. I'm not 100% sure, but it feels that holding a &'a dyn StatsOracle within the CardinalityEstiamtion implementation of the inference algorithm might not be 100% compatible with the constraints of the framework, as it currently only has an Analysis: 'static bound.

mgree commented 4 months ago

I see. We ought to be able to collect stats in advance---a bit of a refactor, but we can just have an owned map from IDs to cardinalities. IIRC we had talked about caching this kind of information in the controller/optimizer(s?) as part of platform v2.