airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
15.4k stars 3.97k forks source link

[EPIC] Support schema selection in database and warehouse source connectors #2781

Closed davinchia closed 1 month ago

davinchia commented 3 years ago

Tell us about the problem you're trying to solve

We currently try to discover all schemas in a database when discovering a source's schema. This can lead to issues if a source has too many schema tables. i.e. the catalog becomes too big and cannot be saved in our database.

This led to this https://github.com/airbytehq/airbyte/issues/2619.

Definitely a nice-to-have rather than a must have.

Describe the solution you’d like

Expose some sort of schema regex so users can specify what they want included in the discover job.

Describe the alternative you’ve considered or used

Allow users to specify tables to sync in addition to schemas.

TODOs

Source Connector Status Priority Note
Postgres -
Oracle -
SQL Server High
Redshift https://github.com/airbytehq/airbyte/issues/9525 High
Snowflake High
BigQuery High The schema equivalent concept in BigQuery is datasets.
DB2 Low
CockroachDB Low
Clickhouse - - Clickhouse does not support schema.
MySQL - - MySQL does not support schema.
MongoDB - - MondoDB does not support schema.
Deninc commented 3 years ago

I'm evaluating Airbyte and our source DB has too many tables, it caused "io.temporal.failure.ServerFailure: Complete result exceeds size limit.".

IMO this is the key feature for us to continue using Airbyte.

cgardens commented 3 years ago

@Deninc which database are you using?

Deninc commented 3 years ago

@cgardens I'm using Oracle.

OK so I was able to pass the issue by setting BlobSizeLimitError. However it's still very slow and laggy to scroll pass a thousand tables. I know I only need 10 tables out of 1000 so this feature will defenitely help the user experience.

ogirardot commented 2 years ago

Hi there, I think I have the same kind of issue using the latest (0.1.7) MongoDB source, there are a lot of connections and the timeout of 1h is reached with no way to discover the schema or use any kind of fallback. Is there any countermeasure in the meantime ?

cgardens commented 2 years ago

@tuliren for visibility. i'm not sure what priority this should be against other db issues but just wanted to make sure you saw it.

alafanechere commented 2 years ago

A workaround could be to suggest our user to create a mongo user dedicated to Airbyte, and only discover collections on which the mongo user has reading privilege.

maharshi-zluri commented 2 years ago

@alafanechere I absolutely agree. It would save us a lot of time in the future and make having connection processes much smoother.

tuliren commented 2 years ago

@cgardens I'm using Oracle.

OK so I was able to pass the issue by setting BlobSizeLimitError. However it's still very slow and laggy to scroll pass a thousand tables. I know I only need 10 tables out of 1000 so this feature will defenitely help the user experience.

@Deninc, by the way, Oracle connector does support schema specification since version 0.3.3.

Amphagory commented 1 year ago

Was there a Snowflake work around for this problem? Is there a way to increase the BlobSizeLimitError via a argument or in the config file?

cgardens commented 1 month ago

done