Redshift destination performance improvements

airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.

https://airbyte.com

Other

16.2k stars 4.14k forks source link

Redshift destination performance improvements #4871

Open sherifnada opened 3 years ago

sherifnada commented 3 years ago

Tell us about the problem you're trying to solve

Came across this blog post the other day and realized many of these optimizations can apply to our redshift destination as well. We should apply them where possible.

yahu98 commented 3 years ago

I'm interested in taking on this task, I'm wondering if I could get assigned to this issue if no one else is currently working on this?

sherifnada commented 3 years ago

@yahu98 no one else is working on it atm! Feel free to self-assign. Is there any help or support we can offer?

yahu98 commented 3 years ago

@yahu98 no one else is working on it atm! Feel free to self-assign. Is there any help or support we can offer?

Thank you! I don't have any questions so far as I'm going through documentations and destination code to wrap my head around how everything works, but I'll reach out if I run into any blockers or questions.

yahu98 commented 3 years ago

@sherifnada Hi Sherif, I wonder if I could be added to the repo as a collaborator? Currently I don't have the access to assign myself to this issue.

sherifnada commented 3 years ago

@yahu98 unfortunately I can't add you as a collaborator for security reasons. The standard workflow followed by community contributors is usually to fork the repository then create a PR from your fork into Airbyte's repo

archaean commented 3 years ago

I think one of the biggest improvements that can be made to the current Redshift Destination connector is splitting the data up into partitions and then issuing a single copy command for multiple files (via a manifest file).

Depending on the number of compute nodes that you have on your Redshift cluster this improvement can be quite significant.

Currently, this optimization was specifically passed on due to complexity. Though it looks like with this change to support multiple files for Snowflake, Redshift support may not be far off for that optimization.