Open isidentical opened 1 year ago
Thank you for this discussion @isidentical - the summary is quite nice
There are some cases where the physical plan will have better information available to it (e.g. it may have read the parquet metadata header and have much more accurate statistics than just the file names) than the logical plan.
Thus I think having statistics available for both logical and physical planning makes sense (aka option 3) -- that way DataFusion can take best advantage of what information is available
My preferred solution is to keep statistics in both places and then keep the code that operates on them (expression range analysis code (e.g #3912), etc) in the physical exprs (as it is very tightly tied to how each expression is evaluated (nulls, etc)).
Another benefit of keeping the CBOs at physical level is that the costs might be also different for different ways. E.g. hash join and sort merge join do have different costs associated with them. For a hash join swapping the order of the build and probe side has a big impact, but a sort merge join (or other types of join) might not benefit from this change (or at least in very different ways).
If we would move the optimization to be on logical plans, we would need a way to expose this to the logical plan too.
Costs might be different for different implementations like hash join vs sort join, but the stats(output rows, size, min/max,...) are the same. Even for different join ordering [A inner join B inner C] vs [C inner join B inner A], the stats are exactly the same.
Thank you for this discussion @isidentical - the summary is quite nice
There are some cases where the physical plan will have better information available to it (e.g. it may have read the parquet metadata header and have much more accurate statistics than just the file names) than the logical plan.
Thus I think having statistics available for both logical and physical planning makes sense (aka option 3) -- that way DataFusion can take best advantage of what information is available
My preferred solution is to keep statistics in both places and then keep the code that operates on them (expression range analysis code (e.g #3912), etc) in the physical exprs (as it is very tightly tied to how each expression is evaluated (nulls, etc)).
In future, we can enhance Datafusion and provide 'ANALYZE TABLE [FOR COLUMNS]' capabilities, then in logical planning phase, we can also get better column level stats. And I think inferring stats directly from parquet file headers is not scalable, for example if we hit large table(>10k files), inferring stats will take longer time. It is quite common for Hadoop tables(Spark/Hive tables) to have more than 10k files.
In future, we can enhance Datafusion and provide 'ANALYZE TABLE [FOR COLUMNS]' capabilities, then in logical planning phase, we can also get better column level stats. And I think inferring stats directly from parquet file headers is not scalable, for example if we hit large table(>10k files), inferring stats will take longer time. It is quite common for Hadoop tables(Spark/Hive tables) to have more than 10k files.
Yes I agree -- and in general I don't think DataFusion should be handling the decision of "should the metadata be cached in some local catalog" -- I think that decision should be moved into the TableProvider
/ CatalogProvider
implementations so that each system that uses DataFusion can make the optimal choice for it
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
There seem to be a few use cases that can be uncovered by being able to have statistics in the logical plan:
Describe the solution you'd like There were 3 options proposed in the discussion:
Describe alternatives you've considered Keep them as is, and not do anything else.
Additional context This is a spin-off from the cost calculations/estimations in #3929 (also related to #3983 and #3984). Original discussion can be found here by @mingmwang @isidentical @Dandandan @jackwener @andygrove @alamb. It also includes a lot of material regarding what other query engine / database systems are doing (so recommend reading it, this is just a main summary to continue to discussions in a more structured/public place).