airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
16.24k stars 4.15k forks source link

Destination S3: add delta lake/delta table support #16322

Open mustafa-rmd opened 2 years ago

mustafa-rmd commented 2 years ago

My current requirement is to have the following data pipeline: PostgreSQL (Source) Air byte Minio - S3 storage (Destination) Apache spark configure with (Minio and Delta lake formatting) since spark doesn’t support ACID transactions.

The goals to have air bye move data from PostgreSQL (Source) to Minio storage (Destination) saved in delta format. Spark then will come and read data from S3 expected to be with delta format.

My main issue with the output format for Air bye S3 connector. Currently is only supports 3 data types: CSV, Avro and JSON Lines (JSONL).

What is the recommend way to solve this problem? since I think, many companies are trying to build this data pipeline. Is there plan to have this feature released in upcoming releases? Should we implement this feature? If so, is there a good documentation of how to start about it? Or, is there another method of going about it?

Thanks,

natalyjazzviolin commented 2 years ago

Hi @mustafa-rmd , could you please edit your request to follow our feature request template? This will ensure all details are understood clearly. I've copied it below. Thank you!

Tell us about the problem you're trying to solve

What are you trying to do, and why is it hard? A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you’d like

A clear and concise description of what you want to see happen, or the change you would like to see

Describe the alternative you’ve considered or used

A clear and concise description of any alternative solutions or features you've considered or are using today.

Additional context

Add any other context or screenshots about the feature request here.

Are you willing to submit a PR?

Remove this with your answer :-)

mustafa-rmd commented 2 years ago

Problem

Deltalake (Delta table) format is an essential format for many pipeline architecture epically for ones that uses apache Spark in their pipeline.

Solution

I would like Delta format to be added along with apache avro, Json, etc.

Describe the alternative you’ve considered or used

Not alternatives

Additional context

When choosing a destination format I would like to see Delta format as one of the options image

Are you willing to submit a PR?

Yes

misteryeo commented 2 years ago

@dennyglee Noted in your discussion that you're adding this to your roadmap. Just wanted to confirm that you're planning to contribute here?

dennyglee commented 2 years ago

@misteryeo Yes, we are planning to contribute here - it may or may not be me personally, but feel free to ping me on this until we figure this out :)

wkargul commented 1 year ago

Hey @dennyglee is there any update on that?

seunggs commented 1 year ago

@dennyglee @mustafa-rmd Any updates on this by any chance?

herry13 commented 1 year ago

Hi @dennyglee @mustafa-rmd Any updates on this feature request? I am using Airbyte & DeltaLake in production. So I would love to see this destination connector to be available as soon as possible. I'm willing to give you some hands if needed.

NatElkins commented 1 year ago

Just want to chime in that I'm also interested in this!

Edited to add that I'm interested in writing a delta table to S3. I'm not sure I'll end up making a PR for this, but for anyone else who wants the same thing it looks like a PR would have to be made here: https://github.com/airbytehq/airbyte/tree/0e9fdba1181b2d302b81a057f6fa16a198925eaa/airbyte-integrations/bases/base-java-s3/src/main/java/io/airbyte/integrations/destination/s3

You'd also have to make a PR here: https://github.com/airbytehq/airbyte/blob/0e9fdba1181b2d302b81a057f6fa16a198925eaa/airbyte-integrations/connectors/destination-s3/src/main/resources/spec.json

arorapankaj commented 10 months ago

Do we have any update on this feature request ?