format: add statistics for tables, columns, queries, etc.

lidavidm commented 1 year ago

More research is needed on what systems typically support.

This would make ADBC more useful in situations where it supplies data to other systems, since then those systems could query statistics using a standard interface and integrate them into query planning. (Interestingly, Spark at least doesn't seem to have this in DataSourceV2 - I suppose the smarts are directly in their JDBC support.)

Examples:

Estimated result set size for queries (basically, a semi-standardized EXPLAIN (ANALYZE))
Parquet-style min/max/null % statistics per column

lidavidm commented 1 year ago

Possibly: add ConnectionGetStatistics whose result set schema is something like

- catalog
- db_schema
- table_name
- statistic_type: null percentage, row count, ndv
- column_name: str (null if table-wide statistics)
- value: double

so for a table, you would have a row per column per statistic type, and a row per statistic type (for table-wide statistics: row count only)

(do we want to/how would we account for partitions, in the sense of Flight SQL, etc.?)

lidavidm commented 1 year ago

TODO: need to cross-compare against JDBC, PostgreSQL, SQL Server, etc.

lidavidm commented 1 year ago

Apache Hive

https://cwiki.apache.org/confluence/display/Hive/StatsDev https://cwiki.apache.org/confluence/display/Hive/Column+Statistics+in+Hive https://subscription.packtpub.com/book/data/9781782161080/8/ch08lvl1sec88/column-statistics-in-hive https://github.com/apache/hive/blob/d0a06239b09396d1b7a6414d85011f9a20f8486a/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/model/MTableColumnStatistics.java https://github.com/apache/hive/blob/d0a06239b09396d1b7a6414d85011f9a20f8486a/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/model/MPartitionColumnStatistics.java https://issues.apache.org/jira/browse/HIVE-1362

It's hard to find a definitive reference from just documentation.

Provides partition-level statistics (but we don't really expose a concept of partitions, except during query execution time but those would be 'different' partitions)
Min/max values for numeric columns
Null count (not percentage)
NDV
Average and maximum column sizes
Number of true/false values
Histograms ('height-balanced' so the encoding is presumably a list of values dividing the range into equal-sized buckets)
Top/bottom K values (proposed, never implemented? HIVE-3421)
Thrift API supports typed statistics

What is 'bitVector' statistic? Appears to be the serialization of a NumDistinctValueEstimator which is either a Flajolet-Martin sketch or HyperLogLog. So it appears to be something fairly internal that gets exposed as a statistic.

Poking at a Hive Metastore instance, it seems bitVector, histogram are never set, and string columns don't record min/max.

JDBC

getIndexInfo

Oriented around database indices
Provides row count and ndv
Also provides page count and ordering
Differentiates between exact/approximate; this can affect performance of the call

=> We may want a statistic for "abstract size"? (But the values wouldn't be comparable between drivers.) => Is ordering useful? => Exact/approximate may also be useful to indicate

ODBC

SQLStatistics

Effectively the same as JDBC

Microsoft SQL Server

https://learn.microsoft.com/en-us/sql/t-sql/database-console-commands/dbcc-show-statistics-transact-sql?view=sql-server-ver16

Row count
"density" which is basically 1/ndv, except it looks deprecated?

PostgreSQL

https://www.postgresql.org/docs/current/planner-stats.html and https://www.postgresql.org/docs/current/view-pg-stats.html

Row count (may be out of date), page count
null percentage
a very odd ndv estimate
histograms, most common elements, etc.
average column width (~= average size of a row in a column; i.e. an estimate of string length, etc?)

=> How should we define ndv? => Do we want to be able to map through "most common elements" etc. or should we leave that alone? (Probably leave it alone) => We may want to define a column width statistic

Snowflake

https://docs.snowflake.com/en/sql-reference/info-schema/tables

Row count
Byte size

lidavidm commented 1 year ago

So proposal is for:

AdbcConnectionGetStatistics(struct AdbcConnection*, const char* catalog, const char* db_schema, const char* table_name, bool approximate, struct ArrowArrayStream* out);

- Parameters allow filtering down to an individual table, or you can request data for multiple tables at once
- "approximate" is an enum allowing you to request exact statistics, or just get approximate/best-effort/out of date statistics

The result set has schema:

- catalog: str
- db_schema: str
- table_name: str
- statistic_type: str (one of null percentage, row count, ndv, byte_width or a database-specific value)
- column_name: str (null if table-wide statistics)
- value: double
- is_approximate: bool

column_name is null if the statistic applies to the whole table
null_percentage is a value in [0, 1] representing the % of rows in the column that are null
ndv is the number of distinct values in the column (I'm tempted to take the PostgreSQL definition: positive means a fixed number of distinct values, negative means a percentage of distinct values. Or, break this into two separate statistic types)
row count is a value in [0, inf)
byte_width is a value in [0, inf) representing the average size in bytes of a row in the column (e.g. for a string column, this would be the average string size)

unknown values should be null, or the whole row should simply be omitted

Questions:

Do we care about min, max, etc.? IMO no, this complicates the encoding of the 'value' in the result, and utility is questionable. (But maybe we do want value to be at least union[double, string] to perhaps allow for this?)
Do we encode the statistic names as strings, or requiring dictionary encoding, or specifying an enumeration? (I would prefer dictionary encoding, but this complicates implementation a bit. The benefit is that if we specify some fixed dictionary values, we can save space on the common values and avoid lots of string comparisons while still allowing self-describing extensibility by vendors)

lidavidm commented 1 year ago

Other potential statistics:

Calcite defines these: https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/metadata/BuiltInMetadata.html

If we were to provide a direct equivalent of getIndexInfo, we could possibly also map ColumnUniqueness, Collation. It may work to allow column_name to be list<str>.

adamkennedy commented 1 year ago

In addition to Calcite, Hive contains a fairly decent statistics set, especially since it has column statistics as well as table statistics.

lidavidm commented 1 year ago

Ah, thanks for the pointer. It seems Hive also stores min/max N values, histograms, percentiles, and average/sum of numeric columns.

This overlaps somewhat with PostgreSQL, so maybe we should try to support them. That said, encoding polymorphic types (if we want min/max of say a string column) and list types is a bit of a pain in Arrow (for min/max N, histograms, etc.) but it's doable via a union.

lidavidm commented 1 year ago

The proposal was updated to include min/max value and max byte width as standardized statistics. Digging into the Hive code, histograms are implemented but top/bottom K never were. The proposal allows for backends to return custom statistics so Hive/Postgres could still encode histograms (the encoding with Arrow gets very messy, however, given the lack of an 'any' type; they would have to pack the histogram values into a binary column).

lidavidm commented 1 year ago

Looking at JDBC drivers:

PostgreSQL basically just gives you row count of columns with indices on them (including primary key). The debugger shows it executes this query internally:

SELECT 
tmp.TABLE_CAT, 
tmp.TABLE_SCHEM, 
tmp.TABLE_NAME, 
tmp.NON_UNIQUE, 
tmp.INDEX_QUALIFIER, 
tmp.INDEX_NAME, 
tmp.TYPE, 
tmp.ORDINAL_POSITION, 
trim(
  both '"' 
  from 
    pg_catalog.pg_get_indexdef(
      tmp.CI_OID, tmp.ORDINAL_POSITION, 
      false
    )
) AS COLUMN_NAME, 
CASE tmp.AM_NAME WHEN 'btree' THEN CASE tmp.I_INDOPTION[tmp.ORDINAL_POSITION - 1] & 1 :: smallint WHEN 1 THEN 'D' ELSE 'A' END ELSE NULL END AS ASC_OR_DESC, 
tmp.CARDINALITY, 
tmp.PAGES, 
tmp.FILTER_CONDITION 
FROM 
(
  SELECT 
    NULL AS TABLE_CAT, 
    n.nspname AS TABLE_SCHEM, 
    ct.relname AS TABLE_NAME, 
    NOT i.indisunique AS NON_UNIQUE, 
    NULL AS INDEX_QUALIFIER, 
    ci.relname AS INDEX_NAME, 
    CASE i.indisclustered WHEN true THEN 1 ELSE CASE am.amname WHEN 'hash' THEN 2 ELSE 3 END END AS TYPE, 
    (
      information_schema._pg_expandarray(i.indkey)
    ).n AS ORDINAL_POSITION, 
    ci.reltuples AS CARDINALITY, 
    ci.relpages AS PAGES, 
    pg_catalog.pg_get_expr(i.indpred, i.indrelid) AS FILTER_CONDITION, 
    ci.oid AS CI_OID, 
    i.indoption AS I_INDOPTION, 
    am.amname AS AM_NAME 
  FROM 
    pg_catalog.pg_class ct 
    JOIN pg_catalog.pg_namespace n ON (ct.relnamespace = n.oid) 
    JOIN pg_catalog.pg_index i ON (ct.oid = i.indrelid) 
    JOIN pg_catalog.pg_class ci ON (ci.oid = i.indexrelid) 
    JOIN pg_catalog.pg_am am ON (ci.relam = am.oid) 
  WHERE 
    true 
    AND ct.relname = 'mytable'
) AS tmp 
ORDER BY 
NON_UNIQUE, 
TYPE, 
INDEX_NAME, 
ORDINAL_POSITION

SQL Server gives you the row count for the table, and tells you about columns with a UNIQUE constraint, but doesn't give you much else. Internally it runs EXEC sp_statistics_100 ?,?,?,?,?, ?

It seems we shouldn't expect much here from JDBC (and to be fair, getIndexInfo was meant to get info about indices, not really get us detailed statistics), so if we want detailed statistics we'll have to do it per database.

apache / arrow-adbc