airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
15.22k stars 3.92k forks source link

[destination-snowflake] executes excessive metadata queries #37311

Open barakavra opened 4 months ago

barakavra commented 4 months ago

Connector Name

destination-snowflake

Connector Version

3.7.0

What step the error happened?

None

Relevant information

i was monitoring the Snowflake queries from the Airbyte platform and there is a great performance improvement that should be implemented on the Snowflake Destination code. So there are 3 queries here that are executed many times during sync while you can cash the query results instead, all of them are related to metadata of the destination table, so for example instead of executing the following query 409860 times a week ( = 409860) it could be * if we cash it in the beginning of each refresh per connector

SELECT table_schema, table_name, column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_catalog = ?
  AND table_schema IN (?)
  AND table_name IN (?)
ORDER BY table_schema, table_name, ordinal_position;
by altering to
SELECT table_schema, table_name, column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_catalog = ?
  AND table_schema IN (?)
ORDER BY table_schema, table_name, ordinal_position;

as its only a small dataset the cost will worth it and wrap it with exception will guarantee that even if schema changes happens during the refresh the load will still executed as planned

image (27) image (26) image (25)

Relevant log output

No response

Contribute

marcosmarxm commented 4 months ago

Thanks for reporting this @barakavra I added the suggestion in the team backlog for further discussion.

barakavra commented 3 months ago

thanks @marcosmarxm, just to stress this out, altering this behaviour could spare Snowflakers around 30 daily credits of snowflake (±30K/$ per year)

marcosmarxm commented 3 months ago

Thanks for the info @barakavra @airbytehq/destinations can you take a look in this issue?

pcorbel commented 2 months ago

Hello @marcosmarxm, do you have any news regarding this issue? It prevents us from upgrading to the latest version of the Snowflake destination (we have ~2k connectors relying on it)

evantahler commented 2 months ago

Since this issue was filed, we did reduce the number of 'metadata queries' to one-per-stream per sync. This can probably be reduced even further to a single query for all streams in the sync.

barakavra commented 2 months ago

indeed - the following query is still running for hours (and we have only ±20 connectors) SELECT column_name, data_type, is_nullable FROM information_schema.columns WHERE table_catalog = ? AND table_schema = ? AND table_name = ? ORDER BY ordinal_position; although i am green in the AIrbyte platform BE, my thoughts are - as Airbyte is the sole owner of the metadata of these tables/stages (as it should be) , I dont see any reason not to have this data stored when settings are saved, and queried only once out of Airbyte internal database (postgress) when connection is trigger. this is from just yesterday - image

evantahler commented 2 months ago

Grooming notes:

barakavra commented 2 months ago

@evantahler kindly research metadata-only queries costs, it has snowflake's "cloud service credits", in case of high volume of queries can cost more than the warehouse itself. attached are our snowflake daily ±costs - i.e. on JDBC kafka connect connector, becore each flush a select dual command - removing this command saved us about 30% of the costs.

dpolman07 commented 1 month ago

Hi @evantahler @marcosmarxm - Is there any timeline on when this might be resolved? We're running into the same problem when using Snowflake as the destination. The queries below ran for over 7 hours the other day:

SELECT column_name, data_type, is_nullable
FROM information_schema.columns
WHERE table_catalog = ?
  AND table_schema = ?
  AND table_name = ?
ORDER BY ordinal_position;
SELECT SCHEMA_NAME
FROM information_schema.schemata
WHERE schema_name = ?
AND catalog_name = ?;

We're trying to finalize Airbyte as our ELT solution, and a resolution for this would be great to see! Thank you!

evantahler commented 3 weeks ago

Some additional notes from @stephane-airbyte:

Basically there's 2 types of compute: warehouse cost and "cloud services" cost. SELECT (almost) always uses a warehouse, while DESC/SHOW only use service cost (there's not even a need for a running warehouse). In our case, there's a significant difference between doing a SHOW TABLES and SHOW TABLES LIKE 'xxx' IN SCHEMA 'bbb'. Think of it in term of index utilization. The 1st one will just bring all the data (up to 10k rows), after doing some internal joins (in FoundationDB) and process it in cloud, while the 2nd one (because it's not using a range query, but only an exact match) should do 2-3 index lookups (one for the database, one for the schema and one for the table) and bring all that data directly into cloud-services which has nothing to do after that. If we were to use a '%abc' in the LIKE clause, all the filtering would be done in cloud compute, which can add up in term of costs

evantahler commented 3 weeks ago

The code in question can be found here: https://github.com/airbytehq/airbyte/blob/224db75b2283b1b0efa2b0d21b5b8fb420beff38/airbyte-integrations/connectors/destination-snowflake/src/main/kotlin/io/airbyte/integrations/destination/snowflake/typing_deduping/SnowflakeDestinationHandler.kt#L498-L517