Representing statistics in logical or physical plans (or in both of them)

isidentical commented 1 year ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

There seem to be a few use cases that can be uncovered by being able to have statistics in the logical plan:

Helping projects (e.g. Dask SQL) that only use logical plans to have a way to access statistics (so that they can actually leverage them themselves, without needing to use the physical plan from DataFusion) (by @andygrove).
Choosing optimal physical plans at the planning stage (instead of the physical plan optimization stage) (by @mingmwang)
Easier join analysis when dealing with join trees (in terms of representation) and easier nested join reordering (in terms of the ease of doing complex rewrites).
A clear separation between logical cost (a cost-based optimization independent from the specifics of any physical operator implementation; e.g. reducing the join sizes with reordering) vs physical cost (something that only relates to implementation of our physical operators; e.g. hash join probe switching)

Describe the solution you'd like There were 3 options proposed in the discussion:

Represent statistics solely in the logical planning stage (basically a theoretical revert of #962).
Keep them as is (in the physical plan), and think about a more generic way to convert between logical<->physical plans (probably wouldn't help directly to Dask SQL, but might have other use cases)
Represent them in both places, with utility toolkits to do generic join/filter selectivity analysis as well as other fundamental cost estimations.

Describe alternatives you've considered Keep them as is, and not do anything else.

Additional context This is a spin-off from the cost calculations/estimations in #3929 (also related to #3983 and #3984). Original discussion can be found here by @mingmwang @isidentical @Dandandan @jackwener @andygrove @alamb. It also includes a lot of material regarding what other query engine / database systems are doing (so recommend reading it, this is just a main summary to continue to discussions in a more structured/public place).

alamb commented 1 year ago

Thank you for this discussion @isidentical - the summary is quite nice

There are some cases where the physical plan will have better information available to it (e.g. it may have read the parquet metadata header and have much more accurate statistics than just the file names) than the logical plan.

Thus I think having statistics available for both logical and physical planning makes sense (aka option 3) -- that way DataFusion can take best advantage of what information is available

My preferred solution is to keep statistics in both places and then keep the code that operates on them (expression range analysis code (e.g #3912), etc) in the physical exprs (as it is very tightly tied to how each expression is evaluated (nulls, etc)).

Dandandan commented 1 year ago

Another benefit of keeping the CBOs at physical level is that the costs might be also different for different ways. E.g. hash join and sort merge join do have different costs associated with them. For a hash join swapping the order of the build and probe side has a big impact, but a sort merge join (or other types of join) might not benefit from this change (or at least in very different ways).

If we would move the optimization to be on logical plans, we would need a way to expose this to the logical plan too.

mingmwang commented 1 year ago

Costs might be different for different implementations like hash join vs sort join, but the stats(output rows, size, min/max,...) are the same. Even for different join ordering [A inner join B inner C] vs [C inner join B inner A], the stats are exactly the same.

mingmwang commented 1 year ago

Thank you for this discussion @isidentical - the summary is quite nice

There are some cases where the physical plan will have better information available to it (e.g. it may have read the parquet metadata header and have much more accurate statistics than just the file names) than the logical plan.

Thus I think having statistics available for both logical and physical planning makes sense (aka option 3) -- that way DataFusion can take best advantage of what information is available

My preferred solution is to keep statistics in both places and then keep the code that operates on them (expression range analysis code (e.g #3912), etc) in the physical exprs (as it is very tightly tied to how each expression is evaluated (nulls, etc)).

In future, we can enhance Datafusion and provide 'ANALYZE TABLE [FOR COLUMNS]' capabilities, then in logical planning phase, we can also get better column level stats. And I think inferring stats directly from parquet file headers is not scalable, for example if we hit large table(>10k files), inferring stats will take longer time. It is quite common for Hadoop tables(Spark/Hive tables) to have more than 10k files.

alamb commented 1 year ago

In future, we can enhance Datafusion and provide 'ANALYZE TABLE [FOR COLUMNS]' capabilities, then in logical planning phase, we can also get better column level stats. And I think inferring stats directly from parquet file headers is not scalable, for example if we hit large table(>10k files), inferring stats will take longer time. It is quite common for Hadoop tables(Spark/Hive tables) to have more than 10k files.

Yes I agree -- and in general I don't think DataFusion should be handling the decision of "should the metadata be cached in some local catalog" -- I think that decision should be moved into the TableProvider / CatalogProvider implementations so that each system that uses DataFusion can make the optimal choice for it

apache / datafusion

Representing statistics in logical or physical plans (or in both of them) #4003