conduitio-labs / conduit-connector-redshift

Conduit connector for Amazon Redshift
Apache License 2.0
1 stars 0 forks source link
conduit go golang redshift

Conduit Connector Amazon Redshift

General

Amazon Redshift connector is one of Conduit plugins. It provides both, a Source and a Destination Amazon Redshift connectors.

Prerequisites

How to build it

Run make build.

Testing

Run make test to run all unit and integration tests. To run the integration test, set the Amazon Redshift database DSN to the environment variables as an REDSHIFT_DSN.

Source

Conduit's Redshift source connector allows you to move data from Redshift tables with the specified dsn and tables configuration parameters. Upon starting, the source connector takes a snapshot of given tables in the database, then switches into change data capture (CDC) mode. In CDC mode, the connector will only detect new rows.

Snapshots

At the first launch of the connector, the snapshot mode is enabled and the last value of the orderingColumn is stored to the position, to know the boundary of this mode. The connector reads all rows of a table in batches, using a keyset pagination, limiting the rows by batchSize and ordering by orderingColumn. The connector stores the last processed element value of an orderingColumn in a position, so the snapshot process can be paused and resumed without losing data. Once all rows in that initial snapshot are read the connector switches into CDC mode.

This behavior is enabled by default, but can be turned off by adding "snapshot": false to the Source configuration.

Change Data Capture

In this mode, only rows added after the first launch of the connector are moved in batches. It’s using a keyset pagination, limiting by batchSize and ordering by orderingColumn.

Configuration

[!IMPORTANT] Parameters starting with tables.* are used to configure the orderingColumn and keyColumns for a specific table. The * in the parameter name should be replaced with the table name.

name description required example default value
dsn DSN to connect to Redshift. true postgres://username:password@endpoint:5439/database
tables.*.orderingColumn Column used to order the rows.
Keep in mind that the data will be sorted by this column, so the column must contain unique, consistent values suitable for sorting.
true
tables.*.keyColumns Comma-separated list of column names to build the sdk.Record.Key. See more: Key handling. false
snapshot Whether the connector will take a snapshot of the entire table before starting cdc mode. false false "true"
batchSize Size of rows batch. Min is 1 and max is 100000. false 100 "1000"
table Name of the table from which the connector reads from. Deprecated: use tables instead. false table_name
orderingColumn Column used to order the rows.
Keep in mind that the data will be sorted by this column, so the column must contain unique, consistent values suitable for sorting. *Deprecated: use `tables..orderingColumn` instead.**
false id
keyColumns Comma-separated list of column names to build the sdk.Record.Key. *Deprecated: use `tables..keyColumns` instead.** false id,name

Example

Collections

The following configuration reads records from users and orders table

version: 2.2
pipelines:
  - id: example
    status: running
    connectors:
      - id: example
        type: source
        plugin: redshift
        settings:
          dsn: "sample_dsn"
          # table "users"
          tables.users.orderingColumn: "foo"
          tables.users.keyColumns: "foo,bar"
          # table "orders"
          tables.orders.orderingColumn: "id"

Key handling

The connector builds sdk.Record.Key as sdk.StructuredData. The keys of this field consist of elements of the keyColumns of a specific table. If keyColumns is empty, the connector uses the primary keys of the table; otherwise, if the table has no primary keys, it uses the value of the orderingColumn field of that table. The values of sdk.Record.Key field are taken from sdk.Payload.After by the keys of this field.

Table Name

For each record, the connector adds a opencdc.collection property to the metadata that contains the table name.

Destination

Conduit's Redshift destination connector allows you to move data to a Redshift table with the specified dsn and table configuration parameters. It takes an sdk.Record and parses it into a valid SQL query.

Note: Redshift does not support map or slice types, so they will be stored as marshaled strings.

Configuration Options

name description required example default
dsn DSN to connect to Redshift. true postgres://username:password@endpoint:5439/database ""
table Name of the table the connector writes to. It can contain a Go template that will be executed for each record to determine the table. By default, the table is the value of the opencdc.collection metadata field. false table_name {{ index .Metadata \"opencdc.collection\" }}
keyColumns Comma-separated list of column names to build the where clause in case if sdk.Record.Key is empty.
See more: Key handling.
false id,name ""

Key handling

Table Name

Records are written to the table specified by the opencdc.collection property in the metadata, if it exists. If not, the connector will fall back to the table configured in the connector.

Known limitations

Creating a source or destination connector will fail in the next cases:

Useful resources

scarf pixel