apache / arrow-adbc

Database connectivity API standard and libraries for Apache Arrow
https://arrow.apache.org/adbc/
Apache License 2.0
381 stars 94 forks source link

format: add statistics for tables, columns, queries, etc. #685

Closed lidavidm closed 1 year ago

lidavidm commented 1 year ago

More research is needed on what systems typically support.

This would make ADBC more useful in situations where it supplies data to other systems, since then those systems could query statistics using a standard interface and integrate them into query planning. (Interestingly, Spark at least doesn't seem to have this in DataSourceV2 - I suppose the smarts are directly in their JDBC support.)

Examples:

lidavidm commented 1 year ago

Possibly: add ConnectionGetStatistics whose result set schema is something like

- catalog
- db_schema
- table_name
- statistic_type: null percentage, row count, ndv
- column_name: str (null if table-wide statistics)
- value: double

so for a table, you would have a row per column per statistic type, and a row per statistic type (for table-wide statistics: row count only)

(do we want to/how would we account for partitions, in the sense of Flight SQL, etc.?)

lidavidm commented 1 year ago

TODO: need to cross-compare against JDBC, PostgreSQL, SQL Server, etc.

lidavidm commented 1 year ago

Apache Hive

https://cwiki.apache.org/confluence/display/Hive/StatsDev https://cwiki.apache.org/confluence/display/Hive/Column+Statistics+in+Hive https://subscription.packtpub.com/book/data/9781782161080/8/ch08lvl1sec88/column-statistics-in-hive https://github.com/apache/hive/blob/d0a06239b09396d1b7a6414d85011f9a20f8486a/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/model/MTableColumnStatistics.java https://github.com/apache/hive/blob/d0a06239b09396d1b7a6414d85011f9a20f8486a/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/model/MPartitionColumnStatistics.java https://issues.apache.org/jira/browse/HIVE-1362

It's hard to find a definitive reference from just documentation.

What is 'bitVector' statistic? Appears to be the serialization of a NumDistinctValueEstimator which is either a Flajolet-Martin sketch or HyperLogLog. So it appears to be something fairly internal that gets exposed as a statistic.

Poking at a Hive Metastore instance, it seems bitVector, histogram are never set, and string columns don't record min/max.

JDBC

getIndexInfo

=> We may want a statistic for "abstract size"? (But the values wouldn't be comparable between drivers.) => Is ordering useful? => Exact/approximate may also be useful to indicate

ODBC

SQLStatistics

Microsoft SQL Server

https://learn.microsoft.com/en-us/sql/t-sql/database-console-commands/dbcc-show-statistics-transact-sql?view=sql-server-ver16

PostgreSQL

https://www.postgresql.org/docs/current/planner-stats.html and https://www.postgresql.org/docs/current/view-pg-stats.html

=> How should we define ndv? => Do we want to be able to map through "most common elements" etc. or should we leave that alone? (Probably leave it alone) => We may want to define a column width statistic

Snowflake

https://docs.snowflake.com/en/sql-reference/info-schema/tables

lidavidm commented 1 year ago

So proposal is for:

AdbcConnectionGetStatistics(struct AdbcConnection*, const char* catalog, const char* db_schema, const char* table_name, bool approximate, struct ArrowArrayStream* out);

- Parameters allow filtering down to an individual table, or you can request data for multiple tables at once
- "approximate" is an enum allowing you to request exact statistics, or just get approximate/best-effort/out of date statistics

The result set has schema:

- catalog: str
- db_schema: str
- table_name: str
- statistic_type: str (one of null percentage, row count, ndv, byte_width or a database-specific value)
- column_name: str (null if table-wide statistics)
- value: double
- is_approximate: bool

unknown values should be null, or the whole row should simply be omitted

Questions:

lidavidm commented 1 year ago

Other potential statistics:

adamkennedy commented 1 year ago

In addition to Calcite, Hive contains a fairly decent statistics set, especially since it has column statistics as well as table statistics.

lidavidm commented 1 year ago

Ah, thanks for the pointer. It seems Hive also stores min/max N values, histograms, percentiles, and average/sum of numeric columns.

This overlaps somewhat with PostgreSQL, so maybe we should try to support them. That said, encoding polymorphic types (if we want min/max of say a string column) and list types is a bit of a pain in Arrow (for min/max N, histograms, etc.) but it's doable via a union.

lidavidm commented 1 year ago

The proposal was updated to include min/max value and max byte width as standardized statistics. Digging into the Hive code, histograms are implemented but top/bottom K never were. The proposal allows for backends to return custom statistics so Hive/Postgres could still encode histograms (the encoding with Arrow gets very messy, however, given the lack of an 'any' type; they would have to pack the histogram values into a binary column).

lidavidm commented 1 year ago

Looking at JDBC drivers:

It seems we shouldn't expect much here from JDBC (and to be fair, getIndexInfo was meant to get info about indices, not really get us detailed statistics), so if we want detailed statistics we'll have to do it per database.