airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
16.32k stars 4.16k forks source link

Destination snowflake (and maybe bigquery?): Create staging area with namespace #30492

Closed edgao closed 1 year ago

edgao commented 1 year ago

snowflake's create stage query is doing roughly create stage airbyte_internal.<stream_name>. If there are multiple streams with the same name but different namespace, they will clobber each other. We should give the stage name the same name as the raw table, i.e. create stage airbyte_internal.<namespace>_raw__stream_<name>.

Check if bigquery has a similar problem for its GCS files.

Errors look like:

ERROR i.a.i.d.s.SnowflakeInternalStagingSqlOperations(uploadRecordsToStage):85 Failed to upload records into stage 2023/09/14/14/720F6F38-07B5-48A6-A757-D348B352174A/ net.snowflake.client.jdbc.SnowflakeSQLException: SQL compilation error:
Stage 'XXX."airbyte_internal"."sales"' does not exist or not authorized.
edgao commented 1 year ago

@pcorbel / @charbelrseif since you both upvoted this issue - are you using snowflake or bigquery? https://github.com/airbytehq/airbyte/issues/28372 exists for bigquery, but if you're both on snowflake then I might just focus on that for now

pcorbel commented 1 year ago

@edgao We are using Snowflake

edgao commented 1 year ago

got it. One more question - are these streams in the same connection? Or are they separated across connections?

(I'm trying to repro this error locally so I can validate that I've actually fixed it)

edgao commented 1 year ago

managed to repro with concurrent syncs, but please let me know if you were seeing this within a single sync (i.e. no other connections running at the same time)!

pcorbel commented 1 year ago

@edgao I am using the same destination with different namespaces for 1000+ connectors and I often have the issue. However, on my staging instance where I have 10-ish connectors, the issue does not come up because I think I do not have concurrent syncs.

edgao commented 1 year ago

got it. Then I think https://github.com/airbytehq/airbyte/pull/30599 will solve the problem + implements a test to prevent future regressions. Will update here when we release a new version of destination-snowflake with this fix. Thanks for the info!

pcorbel commented 1 year ago

Thanks @edgao In the meantime, do you know if I can test the new code simply? Like is there a Docker image available for this PR somewhere, and could I update something in the database for Airbyte to use the right tag?

edgao commented 1 year ago

lemme start a prerelease publish! That would be super helpful to verify that it works, thanks for the offer :)

running in https://github.com/airbytehq/airbyte/actions/runs/6263421961; will update here when it's ready for use.

edgao commented 1 year ago

try updating your destination-snowflake version to 3.1.12-dev.05cdd1566b - that will switch you to the prerelease version.

pcorbel commented 1 year ago

The fix is working great, thanks!

edgao commented 1 year ago

awesome, thanks for being the guinea pig :) I'll work with the team to get this released for real.

edgao commented 1 year ago

latest destination-snowflake has a regression due to the java cdk refactor including an outdated version of some core code. working on a cdk republish + updated destination-snowflake release.