Closed lidavidm closed 1 year ago
Possibly: add ConnectionGetStatistics
whose result set schema is something like
- catalog
- db_schema
- table_name
- statistic_type: null percentage, row count, ndv
- column_name: str (null if table-wide statistics)
- value: double
so for a table, you would have a row per column per statistic type, and a row per statistic type (for table-wide statistics: row count only)
(do we want to/how would we account for partitions, in the sense of Flight SQL, etc.?)
TODO: need to cross-compare against JDBC, PostgreSQL, SQL Server, etc.
https://cwiki.apache.org/confluence/display/Hive/StatsDev https://cwiki.apache.org/confluence/display/Hive/Column+Statistics+in+Hive https://subscription.packtpub.com/book/data/9781782161080/8/ch08lvl1sec88/column-statistics-in-hive https://github.com/apache/hive/blob/d0a06239b09396d1b7a6414d85011f9a20f8486a/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/model/MTableColumnStatistics.java https://github.com/apache/hive/blob/d0a06239b09396d1b7a6414d85011f9a20f8486a/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/model/MPartitionColumnStatistics.java https://issues.apache.org/jira/browse/HIVE-1362
It's hard to find a definitive reference from just documentation.
What is 'bitVector' statistic? Appears to be the serialization of a NumDistinctValueEstimator which is either a Flajolet-Martin sketch or HyperLogLog. So it appears to be something fairly internal that gets exposed as a statistic.
Poking at a Hive Metastore instance, it seems bitVector, histogram are never set, and string columns don't record min/max.
=> We may want a statistic for "abstract size"? (But the values wouldn't be comparable between drivers.) => Is ordering useful? => Exact/approximate may also be useful to indicate
https://www.postgresql.org/docs/current/planner-stats.html and https://www.postgresql.org/docs/current/view-pg-stats.html
=> How should we define ndv? => Do we want to be able to map through "most common elements" etc. or should we leave that alone? (Probably leave it alone) => We may want to define a column width statistic
https://docs.snowflake.com/en/sql-reference/info-schema/tables
So proposal is for:
AdbcConnectionGetStatistics(struct AdbcConnection*, const char* catalog, const char* db_schema, const char* table_name, bool approximate, struct ArrowArrayStream* out);
- Parameters allow filtering down to an individual table, or you can request data for multiple tables at once
- "approximate" is an enum allowing you to request exact statistics, or just get approximate/best-effort/out of date statistics
The result set has schema:
- catalog: str
- db_schema: str
- table_name: str
- statistic_type: str (one of null percentage, row count, ndv, byte_width or a database-specific value)
- column_name: str (null if table-wide statistics)
- value: double
- is_approximate: bool
unknown values should be null, or the whole row should simply be omitted
Questions:
value
to be at least union[double, string]
to perhaps allow for this?)Other potential statistics:
Calcite defines these: https://calcite.apache.org/javadocAggregate/org/apache/calcite/rel/metadata/BuiltInMetadata.html
If we were to provide a direct equivalent of getIndexInfo, we could possibly also map ColumnUniqueness, Collation. It may work to allow column_name
to be list<str>
.
In addition to Calcite, Hive contains a fairly decent statistics set, especially since it has column statistics as well as table statistics.
Ah, thanks for the pointer. It seems Hive also stores min/max N values, histograms, percentiles, and average/sum of numeric columns.
This overlaps somewhat with PostgreSQL, so maybe we should try to support them. That said, encoding polymorphic types (if we want min/max of say a string column) and list types is a bit of a pain in Arrow (for min/max N, histograms, etc.) but it's doable via a union.
The proposal was updated to include min/max value and max byte width as standardized statistics. Digging into the Hive code, histograms are implemented but top/bottom K never were. The proposal allows for backends to return custom statistics so Hive/Postgres could still encode histograms (the encoding with Arrow gets very messy, however, given the lack of an 'any' type; they would have to pack the histogram values into a binary column).
Looking at JDBC drivers:
PostgreSQL basically just gives you row count of columns with indices on them (including primary key). The debugger shows it executes this query internally:
SELECT
tmp.TABLE_CAT,
tmp.TABLE_SCHEM,
tmp.TABLE_NAME,
tmp.NON_UNIQUE,
tmp.INDEX_QUALIFIER,
tmp.INDEX_NAME,
tmp.TYPE,
tmp.ORDINAL_POSITION,
trim(
both '"'
from
pg_catalog.pg_get_indexdef(
tmp.CI_OID, tmp.ORDINAL_POSITION,
false
)
) AS COLUMN_NAME,
CASE tmp.AM_NAME WHEN 'btree' THEN CASE tmp.I_INDOPTION[tmp.ORDINAL_POSITION - 1] & 1 :: smallint WHEN 1 THEN 'D' ELSE 'A' END ELSE NULL END AS ASC_OR_DESC,
tmp.CARDINALITY,
tmp.PAGES,
tmp.FILTER_CONDITION
FROM
(
SELECT
NULL AS TABLE_CAT,
n.nspname AS TABLE_SCHEM,
ct.relname AS TABLE_NAME,
NOT i.indisunique AS NON_UNIQUE,
NULL AS INDEX_QUALIFIER,
ci.relname AS INDEX_NAME,
CASE i.indisclustered WHEN true THEN 1 ELSE CASE am.amname WHEN 'hash' THEN 2 ELSE 3 END END AS TYPE,
(
information_schema._pg_expandarray(i.indkey)
).n AS ORDINAL_POSITION,
ci.reltuples AS CARDINALITY,
ci.relpages AS PAGES,
pg_catalog.pg_get_expr(i.indpred, i.indrelid) AS FILTER_CONDITION,
ci.oid AS CI_OID,
i.indoption AS I_INDOPTION,
am.amname AS AM_NAME
FROM
pg_catalog.pg_class ct
JOIN pg_catalog.pg_namespace n ON (ct.relnamespace = n.oid)
JOIN pg_catalog.pg_index i ON (ct.oid = i.indrelid)
JOIN pg_catalog.pg_class ci ON (ci.oid = i.indexrelid)
JOIN pg_catalog.pg_am am ON (ci.relam = am.oid)
WHERE
true
AND ct.relname = 'mytable'
) AS tmp
ORDER BY
NON_UNIQUE,
TYPE,
INDEX_NAME,
ORDINAL_POSITION
EXEC sp_statistics_100 ?,?,?,?,?, ?
It seems we shouldn't expect much here from JDBC (and to be fair, getIndexInfo was meant to get info about indices, not really get us detailed statistics), so if we want detailed statistics we'll have to do it per database.
More research is needed on what systems typically support.
This would make ADBC more useful in situations where it supplies data to other systems, since then those systems could query statistics using a standard interface and integrate them into query planning. (Interestingly, Spark at least doesn't seem to have this in DataSourceV2 - I suppose the smarts are directly in their JDBC support.)
Examples: