apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.48k stars 1.01k forks source link

Representing statistics in logical or physical plans (or in both of them) #4003

Open isidentical opened 1 year ago

isidentical commented 1 year ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

There seem to be a few use cases that can be uncovered by being able to have statistics in the logical plan:

Describe the solution you'd like There were 3 options proposed in the discussion:

Describe alternatives you've considered Keep them as is, and not do anything else.

Additional context This is a spin-off from the cost calculations/estimations in #3929 (also related to #3983 and #3984). Original discussion can be found here by @mingmwang @isidentical @Dandandan @jackwener @andygrove @alamb. It also includes a lot of material regarding what other query engine / database systems are doing (so recommend reading it, this is just a main summary to continue to discussions in a more structured/public place).

alamb commented 1 year ago

Thank you for this discussion @isidentical - the summary is quite nice

There are some cases where the physical plan will have better information available to it (e.g. it may have read the parquet metadata header and have much more accurate statistics than just the file names) than the logical plan.

Thus I think having statistics available for both logical and physical planning makes sense (aka option 3) -- that way DataFusion can take best advantage of what information is available

My preferred solution is to keep statistics in both places and then keep the code that operates on them (expression range analysis code (e.g #3912), etc) in the physical exprs (as it is very tightly tied to how each expression is evaluated (nulls, etc)).

Dandandan commented 1 year ago

Another benefit of keeping the CBOs at physical level is that the costs might be also different for different ways. E.g. hash join and sort merge join do have different costs associated with them. For a hash join swapping the order of the build and probe side has a big impact, but a sort merge join (or other types of join) might not benefit from this change (or at least in very different ways).

If we would move the optimization to be on logical plans, we would need a way to expose this to the logical plan too.

mingmwang commented 1 year ago

Costs might be different for different implementations like hash join vs sort join, but the stats(output rows, size, min/max,...) are the same. Even for different join ordering [A inner join B inner C] vs [C inner join B inner A], the stats are exactly the same.

mingmwang commented 1 year ago

Thank you for this discussion @isidentical - the summary is quite nice

There are some cases where the physical plan will have better information available to it (e.g. it may have read the parquet metadata header and have much more accurate statistics than just the file names) than the logical plan.

Thus I think having statistics available for both logical and physical planning makes sense (aka option 3) -- that way DataFusion can take best advantage of what information is available

My preferred solution is to keep statistics in both places and then keep the code that operates on them (expression range analysis code (e.g #3912), etc) in the physical exprs (as it is very tightly tied to how each expression is evaluated (nulls, etc)).

In future, we can enhance Datafusion and provide 'ANALYZE TABLE [FOR COLUMNS]' capabilities, then in logical planning phase, we can also get better column level stats. And I think inferring stats directly from parquet file headers is not scalable, for example if we hit large table(>10k files), inferring stats will take longer time. It is quite common for Hadoop tables(Spark/Hive tables) to have more than 10k files.

alamb commented 1 year ago

In future, we can enhance Datafusion and provide 'ANALYZE TABLE [FOR COLUMNS]' capabilities, then in logical planning phase, we can also get better column level stats. And I think inferring stats directly from parquet file headers is not scalable, for example if we hit large table(>10k files), inferring stats will take longer time. It is quite common for Hadoop tables(Spark/Hive tables) to have more than 10k files.

Yes I agree -- and in general I don't think DataFusion should be handling the decision of "should the metadata be cached in some local catalog" -- I think that decision should be moved into the TableProvider / CatalogProvider implementations so that each system that uses DataFusion can make the optimal choice for it