Troubleshooting Materialize's database functionality, particularly its resource usage, is very very difficult. The dataflow technology is very different and "distant from" the SQL paradigm through which we have users interacting with the system.
Product outcome
Over a long period of time, working incrementally, we'd like to bridge the paradigm gap between dataflow technology and standard relational database mental models, enabling both Field Engineering (initially) and maybe some day end users (this is a very stretch goal and, frankly, with low probability of happening) to understand and troubleshoot why a particular query -- or change to a well-working query -- results in significant memory use changes or run time.
### Tasks
- [ ] Replace "arity" with "number of columns"
- [ ] https://github.com/MaterializeInc/materialize/issues/25732
- [ ] https://github.com/MaterializeInc/materialize/issues/22231
The Q2 Roadmap for this longer-term Epic is to do the following changes in the MIR-based plan:
Replace "arity" with "number of columns"
Change the "#
" indicator with the name of the column. If the same column name shows up from different tables, prepend with the "
." syntax
If we can get an estimate of the number of rows that will be queried from table/view/etc, add that information to the plan.
Expose when differential joins are part of the plan, possibly with some kind of syntax highlighting or outright explanation that additional memory use is involved in this kind of join
Eliminate showing the ArrangeBy that is part of a ReadIndex
Come up with a design that will convey long plans in a much shortened form, possibly with the short form being default, and the long form exposed with use of a "verbose" option
Mark min/max and topk that it's expensive. (maybe surround with ***, maybe collect them at the top or end, or explain insights )
Creating index on something that's filtered and/or projected. Take this into account in plan insights and/or write it down somewhere how to figure out whether you should create the index.
Change the "#" indicator with the name of the column. If the same column name shows up from different tables, prepend with the "." syntax
This was the case before, but it was changed to the current syntax in #22959 in order to address some internal feedback from Frank. The main reason is that column names (even if they are fully qualified) can be ambiguous.
If we can get an estimate of the number of rows that will be queried from table/view/etc, add that information to the plan.
We can, but this requires some additional plumbing of storage statistics through the optimizer. Also, currently our cardinality estimation is quite poor. Overall I think investing in this might take disproportionate time from the allocated three weeks with questionable utility.
Expose when differential joins are part of the plan, possibly with some kind of syntax highlighting or outright explanation that additional memory use is involved in this kind of join
I suggest to add a "Insights" section at the end of the plan (after the "Notices" section) that includes this information. We should exclude insights for differential joins with two inputs (they cannot be delta joins so we can't do better). For example:
Insights:
- Potentially memory-expensive differential join with and 4 inputs.
Alternatively, we can make this an optimizer notice and render it under Notices.
Eliminate showing the ArrangeBy that is part of a ReadIndex
Can do!
Come up with a design that will convey long plans in a much shortened form, possibly with the short form being default, and the long form exposed with use of a "verbose" option.
I'm not entirely sure how to do that. The short form is EXPLAIN RAW PLAN (which renders an HIR plan). Once we lower to MIR we lower a lot of high-level constructs into blobs of MIR nodes which makes the plans with outer joins and global aggregates much longer. There were previous attempts to add more operators to MIR but I don't think we have consensus that this is a good idea.
Mark min/max and topk that it's expensive. (maybe surround with ***, maybe collect them at the top or end, or explain insights )
As with differential joins, I propose to make this part of an "Insights" section at the end of the plan:
Alternatively, we can make this an optimizer notice and render it under Notices.
Creating index on something that's filtered and/or projected. Take this into account in plan insights and/or write it down somewhere how to figure out whether you should create the index.
Can you expand on this? Do you mean suggesting indexes that would speed up the implementation of a Filter or a Join?
est 3 weeks
I doubt that we can fit all of this work over three weeks. I'll start with the two tasks that can be immediately implemented, and we'll have to revisit the estimates of the rest once we have a better idea and broad agreement how to approach them.
There's a lot of good stuff here to respond to but at the moment what I wanted to say is that I really, really like the idea of an "Insights" section at the end. Let me run that by Field Eng, too, and see what they think.
Troubleshooting Materialize's database functionality, particularly its resource usage, is very very difficult. The dataflow technology is very different and "distant from" the SQL paradigm through which we have users interacting with the system.
Product outcome
Over a long period of time, working incrementally, we'd like to bridge the paradigm gap between dataflow technology and standard relational database mental models, enabling both Field Engineering (initially) and maybe some day end users (this is a very stretch goal and, frankly, with low probability of happening) to understand and troubleshoot why a particular query -- or change to a well-working query -- results in significant memory use changes or run time.