apache / arrow-adbc

Database connectivity API standard and libraries for Apache Arrow
https://arrow.apache.org/adbc/
Apache License 2.0
360 stars 89 forks source link

format: document and define acceptable values and formats for GetObjects column metadata #1906

Open birschick-bq opened 3 months ago

birschick-bq commented 3 months ago

What feature or improvement would you like to see?

As we develop drivers for various data sources, we find that consumers of the driver not only need a reliable API, but also reliable metadata results to make consuming different drivers less data source specific.

While consumers can use GetTableSchema, it may not provide enough information about the data source's unique column properties.

So consumers will use GetObjects to get more information about the native metadata of the data source. However, there is a large amount of flexibility afforded the values in the COLUMN_SCHEMA structure.

/// 3. Optional value.  Should be null if not supported by the driver.
///    xdbc_ values are meant to provide JDBC/ODBC-compatible metadata
///    in an agnostic manner.

I'd like to propose a more restrictive or suggestive description of the field contents so that consuming this information can be more portable. I believe the "agnostic manner" intention is to use JDBC/ODBC values, if possible. Or reading into this more, values that can be reliable understood by the consumer of the call. The other possibility is to add new fields to the structure which would follow more restrictive specifications.

Examples: xdbc_type_name should contain string values taken from either or both the JDBC JDBCType enumeration or ODBC identifiers - in a ADBC defined list of acceptable values.

xdbc_data_type and xdbc_sql_data_type are not clearly defined nor is their difference (if any). It could be that xdbc_data_type is defined by the JDBC values and xdbc_sql_data_type could be defined by the ODBC values.

Still, carrying around these legacy value is not ideal and we should likely associate an ADBC-defined value to one or both of these two fields

xdbc_nullable as int16 - should be explicitly defined as 0 (not nullable) and 1 (nullable) and 2 (unknown) or null (unsupported by data source)

xdbc_is_nullable - should be explicitly defined as "NO", "YES", "" (unknown) or null (unsupported by data source)

The result of this discussion should be

  1. Improved documentation on allowable values
  2. Tests in each driver to confirm values are in scope.
CurtHagenlocher commented 3 months ago

This seems to overlap partially with #1704 in that returning the Arrow types for each column would let us know definitively what to expect when fetching the data, including the type and nullability of the columns.

lidavidm commented 3 months ago

data_type/sql_data_type were taken from Flight SQL which in turn inherited from JDBC (@jduo correct me if I'm wrong): https://docs.oracle.com/en/java/javase/11/docs/api/java.sql/java/sql/DatabaseMetaData.html#getColumns(java.lang.String,java.lang.String,java.lang.String,java.lang.String)

DATA_TYPE int => SQL type from java.sql.Types SQL_DATA_TYPE int => unused

jduo commented 3 months ago

The type codes are based on JDBC Types constant and ODBC SQL_* types (which are usually the same)

There are a few cases where they aren't such as JDBC having ARRAY and ODBC having interval types.

The xdbc_sql_data_type field differs in that it can store a database-specific type code to give a more specific type (for example a DECIMAL that is actually a currency vs. an arbitrary high precision number).