Closed ryan-syed closed 8 months ago
To elaborate when a connection is being made using the current driver without a database name, no USE Database
call is made and therefore any call to INFORMATION_SCHEMA needs to be prefixed with the database name. Without the prefix as implemented currently they will fail with the following error: Cannot perform SELECT. This session does not have a current database. Call 'USE DATABASE', or use a qualified name
We were able to reduce the time it takes for the GetObjects
call by removing the cursor implementation, however, this inadvertently relies on the fact that a initial database name is provided so that INFORMATION_SCHEMA
calls can be made. There is also a bug, where INFORMATION_SCHEMA.Tables
without the db name prefixed will only return tables in the intial database name that is set. Same is the case for INFORMATION_SCHEMA.Columns
.
I was looking at adding the cursor back to have INFORMATION_SCHEMA
calls with the db name prefix and it would slow the calls down, however, there are still other optimizations that can be made as discussed in the proposal below.
It fetches the catalogs, schemas, tables, table constraints (currently not implemented for Snowflake Go Driver), and columns. It also does this by making the following calls:
SQL | Notes | Depth |
---|---|---|
SHOW TERSE DATABASES | This is used to get the matching catalogNames and the driver does the filtering using PatternsToRegexp and catalogPatternMatch | ObjectDepthCatalogs |
SELECT CATALOG_NAME, SCHEMA_NAME FROM INFORMATION_SCHEMA.SCHEMATA | This is used to get the catalogName and corresponding schemaNames | ObjectDepthDBSchemas |
SELECT table_catalog, table_schema, table_name, table_type FROM INFORMATION_SCHEMA.TABLES | This is used to get the tableName and tableType for the corresponding catalogs, and schemas. | ObjectDepthTables |
SELECT table_catalog, table_schema, table_name, column_name, ordinal_position, is_nullable::boolean, data_type, numeric_precision, numeric_precision_radix, numeric_scale, is_identity::boolean, identity_generation, identity_increment, character_maximum_length, character_octet_length, datetime_precision, comment FROM dbname.INFORMATION_SCHEMA.COLUMNS | This is used to get the columns metadata. | ObjectDepthAll or ObjectDepthColumns |
and missing implementation:
SQL | Notes |
---|---|
SELECT * FROM INFORMATION_SCHEMA.TABLE_CONSTRAINTS |
This will be used to get the table constraints like primary key and foreign key |
The following analysis is based on SIMBA ODBC Snowflake driver, ODBC Test and QueryHistory for client generated calls in Snowflake.
I believe the GetObjects
call is equivalent to the following ODBC APIs:
ODBC API | SQL | Notes | Equivalent ADBC |
---|---|---|---|
SQLTables | show objects /* ODBC:TableMetadataSource */in account |
Gets the names, catalogName, schemaName, kind (VIEW or TABLE), comment, and filters within the driver for the given patterns for catalog, schema, and tables | GetObjects with ObjectDepthTables |
SQLColumns | show /* ODBC:ColumnMetadataSource */ columns in database "foo_db" or show /* ODBC:ColumnMetadataSource */ columns in schema "foo_db"."bar_schema" |
The call was based on whether the schema name was a pattern | It is different from ADBC as catalogName for SQLColumns can't be a pattern. However, it is roughly equivalent to GetObjects with ObjectDepthAll or ObjectDepthColumns |
SQLPrimaryKeys | show primary keys /* ODBC:PrimaryKeysMetadataSource */ in table "foo_db"."bar_schema" |
Gets the primary keys | GetObjects with ObjectDepthTables , though not currently implemented for Snowflake Go Driver. |
SQLForeignKeys | show exported keys /* ODBC:ForeignKeysMetadataSource */ in account |
Get the foreign keys | GetObjects with ObjectDepthTables , though not currently implemented for Snowflake Go Driver. |
Therefore, the GetObjects
API masquerades as a single API and performs the functionality of four ODBC APIs based on the depth provided. However, due to this if you wish to only get the foreign keys using the GetObjects
you would end up making several other calls.
Instead of making four separate calls, the design could be changed to just making one SQL call that is for the highest depth and inferring all the other information for the previous depths from it.
Example: For the worst case scenario with the depth as ObjectDepthAll
or ObjectDepthColumns
we can make only SELECT table_catalog, table_schema, table_name, column_name, ordinal_position, is_nullable::boolean, data_type, numeric_precision, numeric_precision_radix, numeric_scale, is_identity::boolean, identity_generation, identity_increment, character_maximum_length, character_octet_length, datetime_precision, comment FROM dbname.INFORMATION_SCHEMA.COLUMNS call and derive information for tables, schemas and catalogs based on the data from this single call.
Now, to decide the underlying API, there are two ways we can go about this:
Make show
calls similar to ODBC Driver
SHOW
calls are faster
SHOW
to get all the information till the ObjectDepthTables
depth whereas to get the same information using SELECT
takes ~ 5sSHOW
to get all the information till the ObjectDepthColumns
depth whereas to get the same information using SELECT
takes ~ 8s show objects in account
returned only ~600 entries. However, I need to test for show columns in account
which return ~8K entries.Continue with the SELECT
calls
SELECT
calls are slower than SHOW
calls and metadata fetch will be slower than ODBC.
GetObjects
is equivalent to four APIs in ODBC as previously mentioned.Please let me know your thoughts about the changes in design and which API would be better among SHOW
and SELECT
for metadata calls.
Hybrid approach, where we let the caller pick which API should be used based on the tradeoffs.
GetObjects
API to be some dense in terms of data being fetched?@davidhcoe @zeroshade @jduo @lidavidm Please review and suggest your thoughts about the proposal.
TL;DR
GetObjects
makes fours SQL calls at the highest depth of ObjectDepthAll
or ObjectDepthColumns
. Reduce that call to a single call for the depth provided and infer data for the previous depths based on that call. Also, considering that we need to refactor the SQL calls, should we consider using SHOW
instead of SELECT
for getting the metadata similar to what ODBC does? We could also have a hybrid approach to allow the caller to toggle between the implementations (though that adds more complexity and maintenance overhead). I have mentioned the trade-offs associated with each.
I think proposal 3 is probably best, allowing the consumer to choose the pros and cons. I'm not a fan of the fact that it would truncate the results if we go with the first proposal, but the performance benefits seem significant. So it's a tough call.
@lidavidm thoughts?
Created a draft pull request for reducing the SQL calls: https://github.com/apache/arrow-adbc/pull/1351
I will create another draft PR with removing the cursor call and generating the necessary SQL in Go.
So long as the maintenance burden is not excessive. We should also default to the 'correct' method and only use the 'fast' method if desired by the caller.
Created a draft PR with reduced SQL calls (though this makes 2 and doesn't use cursor): #1352
So long as the maintenance burden is not excessive. We should also default to the 'correct' method and only use the 'fast' method if desired by the caller.
Yeah it makes sense to choose the one that's better in the long run. The code without cursor makes an extra SQL call but is slightly faster in the conditions that I have tested. Need to see which would be better consider all scenarios.
@ryan-syed - I think this can be closed now, yes?
Closing this as the number of SQL calls have been reduced and the performance is reasonable. Approach 3 is used. Show is used for getting the databases and then only one SELECT call is used to get all the relevant data.
Currently the Snowflake Go driver needs the database name to make the
GetObjects
call. However, ODBC based driver just requires the server name and warehouse. To make the experience more inline with other existing snowflake drivers, should theGetObjects
be able get the information without needing a database being passed as a default.