A Singer redshift target, for use with Singer streams generated by Singer taps.
version
and ACTIVATE_VERSION
messages.pip install target-redshift
Follow the
Singer.io Best Practices
for setting up separate tap
and target
virtualenvs to avoid version
conflicts.
Create a config file at
~/singer.io/target_redshift_config.json
with redshift connection
information and target redshift schema.
{
"redshift_host": "aws.something.or.other",
"redshift_port": 5439,
"redshift_database": "my_analytics",
"redshift_username": "myuser",
"redshift_password": "1234",
"redshift_schema": "mytapname",
"default_column_length": 1000,
"target_s3": {
"aws_access_key_id": "AKIA...",
"aws_secret_access_key": "supersecret",
"bucket": "target_redshift_staging",
"key_prefix": "__tmp"
}
}
Run target-redshift
against a Singer tap.
~/.virtualenvs/tap-something/bin/tap-something \
| ~/.virtualenvs/target-redshift/bin/target-redshift \
--config ~/singer.io/target_redshift_config.json
The fields available to be specified in the config file are specified here.
Field | Type | Default | Details |
---|---|---|---|
redshift_host |
["string"] |
N/A |
|
redshift_port |
["integer", "null"] |
5432 |
|
redshift_database |
["string"] |
N/A |
|
redshift_username |
["string"] |
N/A |
|
redshift_password |
["string"] |
N/A |
|
redshift_schema |
["string", "null"] |
"public" |
|
invalid_records_detect |
["boolean", "null"] |
true |
Include false in your config to disable target-redshift from crashing on invalid records |
invalid_records_threshold |
["integer", "null"] |
0 |
Include a positive value n in your config to allow for target-redshift to encounter at most n invalid records per stream before giving up. |
disable_collection |
["string", "null"] |
false |
Include true in your config to disable Singer Usage Logging. |
logging_level |
["string", "null"] |
"INFO" |
The level for logging. Set to DEBUG to get things like queries executed, timing of those queries, etc. See Python's Logger Levels for information about valid values. |
max_batch_rows |
["integer", "null"] |
200000 |
The maximum number of rows to buffer in memory before writing to the destination table in Redshift |
max_buffer_size |
["integer", "null"] |
104857600 (100MB in bytes) |
The maximum number of bytes to buffer in memory before writing to the destination table in Redshift |
batch_detection_threshold |
["integer", "null"] |
5000 , or 1/40th max_batch_rows |
How often, in rows received, to count the buffered rows and bytes to check if a flush is necessary. There's a slight performance penalty to checking the buffered records count or bytesize, so this controls how often this is polled in order to mitigate the penalty. This value is usually not necessary to set as the default is dynamically adjusted to check reasonably often. |
persist_empty_tables |
["boolean", "null"] |
False |
Whether the Target should create tables which have no records present in Remote. |
default_column_length |
["integer", "null"] |
1000 |
All columns with the VARCHAR(CHARACTER VARYING) type will be have this length.Range: 1-65535. |
state_support |
["boolean", "null"] |
True |
Whether the Target should emit STATE messages to stdout for further consumption. In this mode, which is on by default, STATE messages are buffered in memory until all the records that occurred before them are flushed according to the batch flushing schedule the target is configured with. |
target_s3 |
["object"] |
N/A |
See S3 below |
Field | Type | Default | Details |
---|---|---|---|
aws_access_key_id |
["string"] |
N/A |
|
aws_secret_access_key |
["string"] |
N/A |
|
aws_session_token |
["string"] |
N/A |
STS session token if using temporary credentials |
bucket |
["string"] |
N/A |
Bucket where staging files should be uploaded to. |
key_prefix |
["string", "null"] |
"" |
Prefix for staging file uploads to allow for better delineation of tmp files |
STATE
Singer messages.['number']
['string']
['string', 'null']
['string', 'integer']
['integer', 'number']
['any']
['null']
anyOf
and allOf
are not supported.$ref
s$ref
s must be present within the schema:$ref
is broken, the behaviour is considered unexpectedstring
NULL
will be streamed to Redshift as the literal null
_
, lowercase letters, numbers, $
$
nullable
Singer.io requires official taps and targets to collect anonymous usage data. This data is only used in aggregate to report on individual tap/targets, as well as the Singer community at-large. IP addresses are recorded to detect unique tap/targets users but not shared with third-parties.
To disable anonymous data collection set disable_collection
to true
in the configuration JSON file.
target-redshift
utilizes setup.py for package
management, and PyTest for testing.
If you have Docker and Docker Compose installed, you can easily run the following to get a local env setup quickly.
First, make sure to create a .env
file in the root of this repo (it has been .gitignore
d so don't worry about accidentally staging it).
Therein, fill out the following information:
REDSHIFT_HOST='<your-host-name>' # Most likely 'localhost'
REDSHIFT_DATABASE='<your-db-name>' # Most likely 'dev'
REDSHIFT_SCHEMA='<your-schema-name>' # Probably 'public'
REDSHIFT_PORT='<your-port>' # Probably 5439
REDSHIFT_USERNAME='<your-user-name'
REDSHIFT_PASSWORD='<your-password>'
DEFAULT_COLUMN_LENGTH='<your-default-column-length>'
TARGET_S3_AWS_ACCESS_KEY_ID='<AKIA...>'
TARGET_S3_AWS_SECRET_ACCESS_KEY='<secret>'
TARGET_S3_BUCKET='<bucket-string>'
TARGET_S3_KEY_PREFIX='<some-string>' # We use 'target_redshift_test'
$ docker-compose up -d --build
$ docker logs -tf target-redshift_target-redshift_1 # You container names might differ
As soon as you see INFO: Dev environment ready.
you can shell into the container and start running test commands:
$ docker exec -it target-redshift_target-redshift_1 bash # Your container names might differ
root@... $ source /code/venv--target-redshift/bin/activate
root@... $ pytest
See the PyTest commands below!
To run the tests, you will need an actual Redshift cluster running, and a user that either:
Has the ability to create schemas therein
REDSHIFT_SCHEMA
envvarhas access to the public
schema
REDSHIFT_SCHEMA
is seen to be the string "public"
, the tests will ignore creating and dropping schemasGRANT CREATE ON DATABASE db TO user;
is viewed as too riskyMake sure to set the following env vars for PyTest:
$ EXPORT REDSHIFT_HOST='<your-host-name>' # Most likely 'localhost'
$ EXPORT REDSHIFT_DATABASE='<your-db-name>' # Most likely 'dev'
$ EXPORT REDSHIFT_SCHEMA='<your-schema-name>' # Probably 'public'
$ EXPORT REDSHIFT_PORT='<your-port>' # Probably 5439
$ EXPORT REDSHIFT_USERNAME='<your-user-name'
$ EXPORT REDSHIFT_PASSWORD='<your-password>' # Redshift requires passwords
$ EXPORT DEFAULT_COLUMN_LENGTH='<your-default-column-length>'
To run the tests, you will need an actual S3 bucket available.
Make sure to set the following env vars for PyTest:
$ EXPORT TARGET_S3_AWS_ACCESS_KEY_ID='<AKIA...>'
$ EXPORT TARGET_S3_AWS_SECRET_ACCESS_KEY='<secret>'
$ EXPORT TARGET_S3_BUCKET='<bucket-string>'
$ EXPORT TARGET_S3_KEY_PREFIX='<some-string>' # We use 'target_redshift_test'
To run tests, try:
$ python setup.py pytest
If you've bash
shelled into the Docker Compose container (see above), you should be able to simply use:
$ pytest
Target Redshift is sponsored by Data Mill (Data Mill Services, LLC) datamill.co.
Data Mill helps organizations utilize modern data infrastructure and data science to power analytics, products, and services.
Copyright Data Mill Services, LLC 2018