airbytehq / airbyte

The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
https://airbyte.com
Other
15.4k stars 3.97k forks source link

S3 Destination: Support partition keys #6981

Closed sherifnada closed 2 days ago

sherifnada commented 2 years ago

Tell us about the problem you're trying to solve

To achieve optimal performance, one often needs to partition data into S3 directories based on the value of a field in the incoming records e.g: based on the date field, records with date October should go into the october/ directory, those with value november should go in november/ directory etc.

This is pretty important for performance downstream. Often times, S3-based solutions (e.g: redshift, Hive, Trino, etc..) leverage the partition key structure to optimize queries.

This is an interesting problem for us to solve in that it is very connection-specific. Currently no configuration has a way of being customized on a per-connection basis. The partition key for a particular stream is different than for other streams, also between different connections.

Describe the solution you’d like

I would like to be able to set the partition key for each stream in my connection

agroh1 commented 2 years ago

We are looking into airbyte and this functionality would really help us.

We are in the advertising industry and will be using airbyte to pull advertising reports for a 30 day window daily. So each day, we will pull the last 30 days.

In order to assemble this into a longer view (say a year worth of data) we need to find the latest version of each day from a years worth of 30 day reports. If these reports partitioned their data into 30 separate S3 files, one for each day in the report, then this is a fairly simple task. We just need to loop over the S3 files finding the latest S3 file for each day. (Hopefully this description made sense). The reason for this is that the reports are constantly being updated for old days.

ameyabapat-bsft commented 2 years ago

Eagerly waiting for it. We are already maintaining incoming data in S3 dir hierarchy such as customer/yyyy/mm/dd. Current S3 Connector dumps everything at single place and we would need to do extra processing to handle single data dump location specific for airbyte use cases. This type of date wise partitioning dump will exactly fit into our existing flow and unify the process for data ingestion. Any estimated timelines for it?

Thelin90 commented 2 years ago

I am also very keen to see some kind of roadmap/plan around the support of this, if possible?

misteryeo commented 2 years ago

Issue was linked to Harvestr Discovery: Destination Amazon S3: Support partition keys

gilbertovilarunc commented 1 year ago

Veeery interested on this feature!

gilbertovilarunc commented 1 year ago

Any updates here guys?

evantahler commented 2 days ago

S3 Destination has a "filename pattern" option. From our docs:

S3 Filename pattern

The pattern allows you to set the file-name format for the S3 staging file(s), next placeholders combinations are currently supported: {date}, {date:yyyy_MM}, {timestamp}, {timestamp:millis}, {timestamp:micros}, {part_number}, {sync_id}, {format_extension}. Please, don't use empty space and not supportable placeholders, as they won't be recognized.

...

But it is possible to further customize by using the available variables to format the bucket path:

${NAMESPACE}: Namespace where the stream comes from or configured by the connection namespace fields. ${STREAM_NAME}: Name of the stream ${YEAR}: Year in which the sync was writing the output data in. ${MONTH}: Month in which the sync was writing the output data in. ${DAY}: Day in which the sync was writing the output data in. ${HOUR}: Hour in which the sync was writing the output data in. ${MINUTE} : Minute in which the sync was writing the output data in. ${SECOND}: Second in which the sync was writing the output data in. ${MILLISECOND}: Millisecond in which the sync was writing the output data in. ${EPOCH}: Milliseconds since Epoch in which the sync was writing the output data in. ${UUID}: random uuid string

I think this is complete!