apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.3k stars 1.19k forks source link

Poor documentation and/or test coverage of `TableProvider::statistics` #13439

Open rroelke opened 2 hours ago

rroelke commented 2 hours ago

The extent of the documentation for TableProvider::statistics in version 43.0.0 is:

Get statistics for this table, if available

This offers no explanation as to how the statistics will or will not be used.

A user with experience in analytical database engines writing a custom TableProvider implementation may suspect that TableProvider::statistics is used by the datafusion query optimizer to determine join orders, perhaps among other things.

However, this conclusion is apparently incorrect, which I deduce from the following pieces of evidence: 1) I am a user fitting that description and found that my custom TableProvider::statistics was not called in the presence of a join query 2) cargo check --workspace --tests runs with no errors if I remove the fn statistics declaration from the trait TableProvider definition 3) having found the source code for the rule which changes join orders it is clear that it calls ExecutionPlan::statistics instead.

Expectation

The documentation should set appropriate expectations for what TableProvider::statistics is used for, so that developers can make informed choices about whether or not to implement it.

Additional context

The apparent answer to what TableProvider::statistics is used for is "nothing" based on the cargo check --workspace --tests comment above, but removing the trait method is a breaking change. Based on the slack discussion prior to filing this issue, at least one user is depending on TableProvider::statistics for their custom optimizer rules and removing it would require them to find a workaround.

Short of deprecating or removing the trait method, I would personally be satisfied just with updates to the method documentation.

alamb commented 2 hours ago

If we don't do this, we run the risk of deleting this API by accident in a future release

maybe someone who is relying on this API could write some tests / additonal documentation. @avantgardnerio do you know of anyone who might have some time?