Closed sherifnada closed 2 days ago
We are looking into airbyte and this functionality would really help us.
We are in the advertising industry and will be using airbyte to pull advertising reports for a 30 day window daily. So each day, we will pull the last 30 days.
In order to assemble this into a longer view (say a year worth of data) we need to find the latest version of each day from a years worth of 30 day reports. If these reports partitioned their data into 30 separate S3 files, one for each day in the report, then this is a fairly simple task. We just need to loop over the S3 files finding the latest S3 file for each day. (Hopefully this description made sense). The reason for this is that the reports are constantly being updated for old days.
Eagerly waiting for it. We are already maintaining incoming data in S3 dir hierarchy such as customer/yyyy/mm/dd
. Current S3 Connector dumps everything at single place and we would need to do extra processing to handle single data dump location specific for airbyte use cases. This type of date wise partitioning dump will exactly fit into our existing flow and unify the process for data ingestion.
Any estimated timelines for it?
I am also very keen to see some kind of roadmap/plan around the support of this, if possible?
Issue was linked to Harvestr Discovery: Destination Amazon S3: Support partition keys
Veeery interested on this feature!
Any updates here guys?
S3 Destination has a "filename pattern" option. From our docs:
S3 Filename pattern
The pattern allows you to set the file-name format for the S3 staging file(s), next placeholders combinations are currently supported: {date}, {date:yyyy_MM}, {timestamp}, {timestamp:millis}, {timestamp:micros}, {part_number}, {sync_id}, {format_extension}. Please, don't use empty space and not supportable placeholders, as they won't be recognized.
...
But it is possible to further customize by using the available variables to format the bucket path:
${NAMESPACE}: Namespace where the stream comes from or configured by the connection namespace fields. ${STREAM_NAME}: Name of the stream ${YEAR}: Year in which the sync was writing the output data in. ${MONTH}: Month in which the sync was writing the output data in. ${DAY}: Day in which the sync was writing the output data in. ${HOUR}: Hour in which the sync was writing the output data in. ${MINUTE} : Minute in which the sync was writing the output data in. ${SECOND}: Second in which the sync was writing the output data in. ${MILLISECOND}: Millisecond in which the sync was writing the output data in. ${EPOCH}: Milliseconds since Epoch in which the sync was writing the output data in. ${UUID}: random uuid string
I think this is complete!
Tell us about the problem you're trying to solve
To achieve optimal performance, one often needs to partition data into S3 directories based on the value of a field in the incoming records e.g: based on the
date
field, records with dateOctober
should go into theoctober/
directory, those with valuenovember
should go innovember/
directory etc.This is pretty important for performance downstream. Often times, S3-based solutions (e.g: redshift, Hive, Trino, etc..) leverage the partition key structure to optimize queries.
This is an interesting problem for us to solve in that it is very connection-specific. Currently no configuration has a way of being customized on a per-connection basis. The partition key for a particular stream is different than for other streams, also between different connections.
Describe the solution you’d like
I would like to be able to set the partition key for each stream in my connection