airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
15.5k stars 3.99k forks source link

Batch File Upload to Cloud Databases #1958

Closed cgardens closed 3 years ago

cgardens commented 3 years ago

Tell us about the problem you're trying to solve

Describe the solution you’d like

Open Questions

ChristopheDuong commented 3 years ago

On BQ, the related issue is https://github.com/airbytehq/airbyte/issues/1479

smomen commented 3 years ago

+1 to this! Snowflake (as well as BQ/Redshift) provide rich functionality for bulk loading, with support for a huge variety of file encodings and file format nuisances - some of this will overlap with configuration offered by "pandas.read_csv" but there will be others as well.

For example, they even support loading parquet/json/avro. Personally (for my use case) I'd love if all of Snowflake's config options were surfaced https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html in Airbyte in someway, and Airbyte then exceled at the scheduling, error transparency, and other concerns related "operationalization of the loads". Just my 2cents.

erichartono commented 3 years ago

+1 to this please!

michel-tricot commented 3 years ago

Someone else complained about it: https://airbytehq.slack.com/archives/C01A4CAP81L/p1615952521001700

I think if we can generalize the solution, it will accelerate the deltalake integration.

@chriskl is trying to add 80,000 rows from Salesforce and it is taking way too long

jwarlander commented 3 years ago

Definitely +1 to this!

I've been following Airbyte progress closely while considering if we have a solid use case for it, and the only way that would feel viable is if it could replace all of our "regular" ingestion pipelines via REST APIs, Kafka & PostgreSQL extracts. Data volumes go from 60k - 100k on the low end of the scale (one of our smaller Kafka topics), a few million somewhere in the middle (from a REST API with our CRM data), and all the way to around 100 million from our daily full extract of a handful of tables from a PostgreSQL backend database (which could ideally be incremental perhaps).

Right now we usually dump data into CSV files in S3, but Avro, ORC and Parquet would definitely work too. Importing into Exasol runs in parallel across all the cluster nodes.

cgardens commented 3 years ago

Redshift documents how to do the file upload, including how to handle incremental syncs using that method: docs

cgardens commented 3 years ago

Same for Snowflake docs.

BigQuery doesn't mention specifically the merging use case. But I'm assuming we just use their file upload (docs) and then write a query to do the merge.

ChristopheDuong commented 3 years ago

Seems like Postgres also follows same pattern: docs

davinchia commented 3 years ago

Redshift portion of this ticket #2569 ~currently being worked on~ is done.

jnoynaert commented 3 years ago

Snowflake upload throughput is definitely way too low with the default method. This also significantly impacts cost for the customer since you end with up with the ingestion warehouse running the whole time.

michel-tricot commented 3 years ago

@jnoynaert that's a very good point!

jnoynaert commented 3 years ago

@michel-tricot @jrhizor In case it helps with implementation...Here are some good starting summaries of Snowflake update strategies in Rivery's docs:

As I understand it the Snowflake ecosystem team also works directly with integration vendors to optimize load/update strategies if you can get in touch with them and get a slot on their priority list.

jrhizor commented 3 years ago

I'm not planning on adding Azure staging for Snowflake as part of this unless we hear about a demand. This is mostly because we have to set up an account.

cgardens commented 3 years ago

is redshift s3 staging part of the refactor as well?

jrhizor commented 3 years ago

yes

marcosmarxm commented 3 years ago

@cgardens could we use minIO as a generic solution for this? It's open-source s3 compatible (just saying because of the API) and runs on Docker so we can use it as middleware to cache, generate parquets/zip to dump into data lakes etc. And we don't need to worry about latency:

MinIO is the world’s fastest object storage server. With READ/WRITE speeds of 183 GB/s and 171 GB/s on standard hardware,

davinchia commented 3 years ago

@marcosmarxm how would using minIO as a generic solution work? do you mean always copying from minIO when delivering to cloud databases/warehouses?

marcosmarxm commented 3 years ago

@marcosmarxm how would using minIO as a generic solution work? do you mean always copying from minIO when delivering to cloud databases/warehouses?

Yes

jrhizor commented 3 years ago

Some warehouses like Snowflake can't copy data from arbitrary cloud storage, even if they are s3 compatible.

We are tentatively planning on using minIO for single node / testing deploys on Kubernetes.

jrhizor commented 3 years ago

I'm closing this issue.

Current state:

There are definitely more speed improvements we can do (parallelize writing streams, change some of our batching/buffering strategies, etc), but I think this covers the main point of this issue.

There are some feature requests mentioned here that don't really fall under the purview of the original issue.

@smomen We aren't exactly addressing the general bulk loading request you made. Our file reading source, however, allows you to read from a variety of formats (including parquet), and that data will be ultimately batch loaded into the warehouse. This isn't as efficient as it theoretically could be since it downloads and processes it locally, but it does give more functionality around selecting fields, syncing strategies, etc. Were there specific formats you were interested in supporting on the source side? Or do you have certain throughput goals for syncs that we aren't reaching yet? If so, I would recommend creating separate issues for this.

@jnoynaert / @jwarlander / @erichartono / @chriskl / anyone watching this ticket: We'd love to hear if the newest versions of these connectors are loading data quickly enough for your use cases. Also, if our changes here were somewhat orthogonal to what you were hoping for, please create an issue.

Thanks!