go/adbc/driver/snowflake: Without the DB name the GetObjects call fails

ryan-syed commented 11 months ago

Currently the Snowflake Go driver needs the database name to make the GetObjects call. However, ODBC based driver just requires the server name and warehouse. To make the experience more inline with other existing snowflake drivers, should the GetObjects be able get the information without needing a database being passed as a default.

ryan-syed commented 10 months ago

To elaborate when a connection is being made using the current driver without a database name, no USE Database call is made and therefore any call to INFORMATION_SCHEMA needs to be prefixed with the database name. Without the prefix as implemented currently they will fail with the following error: Cannot perform SELECT. This session does not have a current database. Call 'USE DATABASE', or use a qualified name

We were able to reduce the time it takes for the GetObjects call by removing the cursor implementation, however, this inadvertently relies on the fact that a initial database name is provided so that INFORMATION_SCHEMA calls can be made. There is also a bug, where INFORMATION_SCHEMA.Tables without the db name prefixed will only return tables in the intial database name that is set. Same is the case for INFORMATION_SCHEMA.Columns.

I was looking at adding the cursor back to have INFORMATION_SCHEMA calls with the db name prefix and it would slow the calls down, however, there are still other optimizations that can be made as discussed in the proposal below.

A bit about the current GetObjects API:

It fetches the catalogs, schemas, tables, table constraints (currently not implemented for Snowflake Go Driver), and columns. It also does this by making the following calls:

SQL	Notes	Depth
SHOW TERSE DATABASES	This is used to get the matching catalogNames and the driver does the filtering using PatternsToRegexp and catalogPatternMatch	`ObjectDepthCatalogs`
SELECT CATALOG_NAME, SCHEMA_NAME FROM INFORMATION_SCHEMA.SCHEMATA	This is used to get the *catalogName* and corresponding *schemaNames*	`ObjectDepthDBSchemas`
SELECT table_catalog, table_schema, table_name, table_type FROM INFORMATION_SCHEMA.TABLES	This is used to get the tableName and tableType for the corresponding catalogs, and schemas.	`ObjectDepthTables`
SELECT table_catalog, table_schema, table_name, column_name, ordinal_position, is_nullable::boolean, data_type, numeric_precision, numeric_precision_radix, numeric_scale, is_identity::boolean, identity_generation, identity_increment, character_maximum_length, character_octet_length, datetime_precision, comment FROM dbname.INFORMATION_SCHEMA.COLUMNS	This is used to get the columns metadata.	`ObjectDepthAll` or `ObjectDepthColumns`

and missing implementation:

SQL	Notes
`SELECT * FROM INFORMATION_SCHEMA.TABLE_CONSTRAINTS`	This will be used to get the table constraints like primary key and foreign key

Constrast to ODBC:

The following analysis is based on SIMBA ODBC Snowflake driver, ODBC Test and QueryHistory for client generated calls in Snowflake. I believe the GetObjects call is equivalent to the following ODBC APIs:

ODBC API	SQL	Notes	Equivalent ADBC
SQLTables	`show objects /* ODBC:TableMetadataSource */in account`	Gets the names, catalogName, schemaName, kind (VIEW or TABLE), comment, and filters within the driver for the given patterns for catalog, schema, and tables	`GetObjects` with `ObjectDepthTables`
SQLColumns	`show /* ODBC:ColumnMetadataSource / columns in database "foo_db"` or `show / ODBC:ColumnMetadataSource */ columns in schema "foo_db"."bar_schema"`	The call was based on whether the schema name was a pattern	It is different from ADBC as catalogName for `SQLColumns` can't be a pattern. However, it is roughly equivalent to `GetObjects` with `ObjectDepthAll` or `ObjectDepthColumns`
SQLPrimaryKeys	`show primary keys /* ODBC:PrimaryKeysMetadataSource */ in table "foo_db"."bar_schema"`	Gets the primary keys	`GetObjects` with `ObjectDepthTables`, though not currently implemented for Snowflake Go Driver.
SQLForeignKeys	`show exported keys /* ODBC:ForeignKeysMetadataSource */ in account`	Get the foreign keys	`GetObjects` with `ObjectDepthTables`, though not currently implemented for Snowflake Go Driver.

Therefore, the GetObjects API masquerades as a single API and performs the functionality of four ODBC APIs based on the depth provided. However, due to this if you wish to only get the foreign keys using the GetObjects you would end up making several other calls.

Proposal:

Instead of making four separate calls, the design could be changed to just making one SQL call that is for the highest depth and inferring all the other information for the previous depths from it.

Example: For the worst case scenario with the depth as ObjectDepthAll or ObjectDepthColumns we can make only SELECT table_catalog, table_schema, table_name, column_name, ordinal_position, is_nullable::boolean, data_type, numeric_precision, numeric_precision_radix, numeric_scale, is_identity::boolean, identity_generation, identity_increment, character_maximum_length, character_octet_length, datetime_precision, comment FROM dbname.INFORMATION_SCHEMA.COLUMNS call and derive information for tables, schemas and catalogs based on the data from this single call.

Now, to decide the underlying API, there are two ways we can go about this:

Underlying API Proposal 1:

Make show calls similar to ODBC Driver

Pros:

SHOW calls are faster
- For my instance, It takes 1-2s for SHOW to get all the information till the ObjectDepthTables depth whereas to get the same information using SELECT takes ~ 5s
- For my instance, It is takes 1-2s for SHOW to get all the information till the ObjectDepthColumns depth whereas to get the same information using SELECT takes ~ 8s
The command does not require a running warehouse to execute.. Therefore probably no cost is incurred for metadata calls

Cons:

we need to filter (patterns) and sort within the driver
- filtering is already done though using PatternsToRegexp and catalogPatternMatch
- This will have to be done in-memory, and there will be a performance hit for consuming all the data then filtering/sorting. This may not be that significant when I did some testing as show objects in account returned only ~600 entries. However, I need to test for show columns in account which return ~8K entries.
The command returns a maximum of 10K records for the specified object type, as dictated by the access privileges for the role used to execute the command; any records above the 10K limit are not returned, even with a filter applied.
- this may become a limitation where there are way too many objects in the DB and something to be aware of.

Underlying API Proposal 2:

Continue with the SELECT calls

Pros:

Existing implementation
No limit for the data returned
- Not sure, why this is not an issue for the ODBC Driver (probably a known limitation)

Cons:

SELECT calls are slower than SHOW calls and metadata fetch will be slower than ODBC.
- In the current API spec anyways ADBC calls will be slower than ODBC because GetObjects is equivalent to four APIs in ODBC as previously mentioned.

Please let me know your thoughts about the changes in design and which API would be better among SHOW and SELECT for metadata calls.

Underlying API Proposal 3:

Hybrid approach, where we let the caller pick which API should be used based on the tradeoffs.

Additional Questions:

What is the use case for GetObjects API to be some dense in terms of data being fetched?
Should we break it up to several APIs similar to ODBC? I notice another thread on it: https://github.com/apache/arrow-adbc/issues/621.

ryan-syed commented 10 months ago

@davidhcoe @zeroshade @jduo @lidavidm Please review and suggest your thoughts about the proposal.

TL;DR

GetObjects makes fours SQL calls at the highest depth of ObjectDepthAll or ObjectDepthColumns. Reduce that call to a single call for the depth provided and infer data for the previous depths based on that call. Also, considering that we need to refactor the SQL calls, should we consider using SHOW instead of SELECT for getting the metadata similar to what ODBC does? We could also have a hybrid approach to allow the caller to toggle between the implementations (though that adds more complexity and maintenance overhead). I have mentioned the trade-offs associated with each.

zeroshade commented 10 months ago

I think proposal 3 is probably best, allowing the consumer to choose the pros and cons. I'm not a fan of the fact that it would truncate the results if we go with the first proposal, but the performance benefits seem significant. So it's a tough call.

@lidavidm thoughts?

ryan-syed commented 10 months ago

Created a draft pull request for reducing the SQL calls: https://github.com/apache/arrow-adbc/pull/1351

I will create another draft PR with removing the cursor call and generating the necessary SQL in Go.

lidavidm commented 10 months ago

So long as the maintenance burden is not excessive. We should also default to the 'correct' method and only use the 'fast' method if desired by the caller.

ryan-syed commented 10 months ago

Created a draft PR with reduced SQL calls (though this makes 2 and doesn't use cursor): #1352

ryan-syed commented 10 months ago

So long as the maintenance burden is not excessive. We should also default to the 'correct' method and only use the 'fast' method if desired by the caller.

Yeah it makes sense to choose the one that's better in the long run. The code without cursor makes an extra SQL call but is slightly faster in the conditions that I have tested. Need to see which would be better consider all scenarios.

davidhcoe commented 9 months ago

@ryan-syed - I think this can be closed now, yes?

ryan-syed commented 8 months ago

Closing this as the number of SQL calls have been reduced and the performance is reasonable. Approach 3 is used. Show is used for getting the databases and then only one SELECT call is used to get all the relevant data.

apache / arrow-adbc