The leading data integration platform for ETL / ELT data pipelines from APIs, databases & files to data warehouses, data lakes & data lakehouses. Both self-hosted and Cloud-hosted.
I want to import data in a way that is compatible to Redshift Spectrum:
data is stored (partitioned) to S3
an external table is added to Redshift, pointing to each partition of the S3 data
normalization/transformations are applied incrementally to only new data
Describe the solution you’d like
a combined S3+redshift (spectrum) destination that allows me to do the above
dbt basic normalization updated to read its sources incrementally
Describe the alternative you’ve considered or used
use the S3 destination, then manually implement the rest of the pipeline
Additional context
use case is to import reporting data, that will increase linearly with time, and not have excessive duplication in the redshift cluster (e.g. facebook marketing has a lookback window of 28 days, therefore with daily sync there will be 28x more rows than needed, plus the _scd table (not needed, data is already historic) and the actual table)
the S3 destination allows me to write files with the following location pattern:
<bucket>/<prefix>/<namespace>/<source_name>/<filename>
but to partition data in redshift, files need the following pattern:
<bucket>/<prefix>/<table_name>/<partition_column_1=value>/.../<partition_column_n=value>/<filename>
i.e. I need more folder levels after the source name, and they need to be dynamically generated
a manually implemented pipeline can not be triggered at the end of sync, and will therefore need to be scheduled independently
Tell us about the problem you're trying to solve
I want to import data in a way that is compatible to Redshift Spectrum:
Describe the solution you’d like
Describe the alternative you’ve considered or used
use the S3 destination, then manually implement the rest of the pipeline
Additional context
use case is to import reporting data, that will increase linearly with time, and not have excessive duplication in the redshift cluster (e.g. facebook marketing has a lookback window of 28 days, therefore with daily sync there will be 28x more rows than needed, plus the _scd table (not needed, data is already historic) and the actual table)
the S3 destination allows me to write files with the following location pattern:
<bucket>/<prefix>/<namespace>/<source_name>/<filename>
but to partition data in redshift, files need the following pattern:<bucket>/<prefix>/<table_name>/<partition_column_1=value>/.../<partition_column_n=value>/<filename>
i.e. I need more folder levels after the source name, and they need to be dynamically generateda manually implemented pipeline can not be triggered at the end of sync, and will therefore need to be scheduled independently