Closed cgardens closed 3 years ago
On BQ, the related issue is https://github.com/airbytehq/airbyte/issues/1479
+1 to this! Snowflake (as well as BQ/Redshift) provide rich functionality for bulk loading, with support for a huge variety of file encodings and file format nuisances - some of this will overlap with configuration offered by "pandas.read_csv" but there will be others as well.
For example, they even support loading parquet/json/avro. Personally (for my use case) I'd love if all of Snowflake's config options were surfaced https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html in Airbyte in someway, and Airbyte then exceled at the scheduling, error transparency, and other concerns related "operationalization of the loads". Just my 2cents.
+1 to this please!
Someone else complained about it: https://airbytehq.slack.com/archives/C01A4CAP81L/p1615952521001700
I think if we can generalize the solution, it will accelerate the deltalake integration.
@chriskl is trying to add 80,000 rows from Salesforce and it is taking way too long
Definitely +1 to this!
I've been following Airbyte progress closely while considering if we have a solid use case for it, and the only way that would feel viable is if it could replace all of our "regular" ingestion pipelines via REST APIs, Kafka & PostgreSQL extracts. Data volumes go from 60k - 100k on the low end of the scale (one of our smaller Kafka topics), a few million somewhere in the middle (from a REST API with our CRM data), and all the way to around 100 million from our daily full extract of a handful of tables from a PostgreSQL backend database (which could ideally be incremental perhaps).
Right now we usually dump data into CSV files in S3, but Avro, ORC and Parquet would definitely work too. Importing into Exasol runs in parallel across all the cluster nodes.
Redshift documents how to do the file upload, including how to handle incremental syncs using that method: docs
Seems like Postgres also follows same pattern: docs
Redshift portion of this ticket #2569 ~currently being worked on~ is done.
Snowflake upload throughput is definitely way too low with the default method. This also significantly impacts cost for the customer since you end with up with the ingestion warehouse running the whole time.
@jnoynaert that's a very good point!
@michel-tricot @jrhizor In case it helps with implementation...Here are some good starting summaries of Snowflake update strategies in Rivery's docs:
As I understand it the Snowflake ecosystem team also works directly with integration vendors to optimize load/update strategies if you can get in touch with them and get a slot on their priority list.
COPY FROM STDIN
(client-side copy) (hoping for Wednesday)I'm not planning on adding Azure staging for Snowflake as part of this unless we hear about a demand. This is mostly because we have to set up an account.
is redshift s3 staging part of the refactor as well?
yes
@cgardens could we use minIO as a generic solution for this? It's open-source s3 compatible (just saying because of the API) and runs on Docker so we can use it as middleware to cache, generate parquets/zip to dump into data lakes etc. And we don't need to worry about latency:
MinIO is the world’s fastest object storage server. With READ/WRITE speeds of 183 GB/s and 171 GB/s on standard hardware,
@marcosmarxm how would using minIO as a generic solution work? do you mean always copying from minIO when delivering to cloud databases/warehouses?
@marcosmarxm how would using minIO as a generic solution work? do you mean always copying from minIO when delivering to cloud databases/warehouses?
Yes
Some warehouses like Snowflake can't copy data from arbitrary cloud storage, even if they are s3 compatible.
We are tentatively planning on using minIO for single node / testing deploys on Kubernetes.
I'm closing this issue.
Current state:
There are definitely more speed improvements we can do (parallelize writing streams, change some of our batching/buffering strategies, etc), but I think this covers the main point of this issue.
There are some feature requests mentioned here that don't really fall under the purview of the original issue.
@smomen We aren't exactly addressing the general bulk loading request you made. Our file reading source, however, allows you to read from a variety of formats (including parquet), and that data will be ultimately batch loaded into the warehouse. This isn't as efficient as it theoretically could be since it downloads and processes it locally, but it does give more functionality around selecting fields, syncing strategies, etc. Were there specific formats you were interested in supporting on the source side? Or do you have certain throughput goals for syncs that we aren't reaching yet? If so, I would recommend creating separate issues for this.
@jnoynaert / @jwarlander / @erichartono / @chriskl / anyone watching this ticket: We'd love to hear if the newest versions of these connectors are loading data quickly enough for your use cases. Also, if our changes here were somewhat orthogonal to what you were hoping for, please create an issue.
Thanks!
Tell us about the problem you're trying to solve
insert
statements.Describe the solution you’d like
Open Questions