datamill-co / target-postgres

A Singer.io Target for Postgres
MIT License
101 stars 76 forks source link
etl json-schema postgres singer stream

Target Postgres

CircleCI

PyPI version

A Singer postgres target, for use with Singer streams generated by Singer taps.

Features

Install

  1. Add libpq dependency
# macos
brew install postgresql
# ubuntu
sudo apt install libpq-dev
  1. install singer-target-postgres
pip install singer-target-postgres

Usage

  1. Follow the Singer.io Best Practices for setting up separate tap and target virtualenvs to avoid version conflicts.

  2. Create a config file at ~/singer.io/target_postgres_config.json with postgres connection information and target postgres schema.

    {
     "postgres_host": "localhost",
     "postgres_port": 5432,
     "postgres_database": "my_analytics",
     "postgres_username": "myuser",
     "postgres_password": "1234",
     "postgres_schema": "mytapname"
    }
  3. Run target-postgres against a Singer tap.

    ~/.virtualenvs/tap-something/bin/tap-something \
     | ~/.virtualenvs/target-postgres/bin/target-postgres \
       --config ~/singer.io/target_postgres_config.json >> state.json

    If you are running windows, the following is equivalent:

    venvs\tap-exchangeratesapi\Scripts\tap-exchangeratesapi.exe | ^
    venvs\target-postgresql\Scripts\target-postgres.exe ^
    --config target_postgres_config.json

Config.json

The fields available to be specified in the config file are specified here.

Field Type Default Details
postgres_host ["string", "null"] "localhost"
postgres_port ["integer", "null"] 5432
postgres_database ["string"] N/A
postgres_username ["string", "null"] N/A
postgres_password ["string", "null"] null
postgres_schema ["string", "null"] "public"
postgres_sslmode ["string", "null"] "prefer" Refer to the libpq docs for more information about SSL
postgres_sslcert ["string", "null"] "~/.postgresql/postgresql.crt" Only used if a SSL request w/ a client certificate is being made
postgres_sslkey ["string", "null"] "~/.postgresql/postgresql.key" Only used if a SSL request w/ a client certificate is being made
postgres_sslrootcert ["string", "null"] "~/.postgresql/root.crt" Used for authentication of a server SSL certificate
postgres_sslcrl ["string", "null"] "~/.postgresql/root.crl" Used for authentication of a server SSL certificate
invalid_records_detect ["boolean", "null"] true Include false in your config to disable target-postgres from crashing on invalid records
invalid_records_threshold ["integer", "null"] 0 Include a positive value n in your config to allow for target-postgres to encounter at most n invalid records per stream before giving up.
disable_collection ["string", "null"] false Include true in your config to disable Singer Usage Logging.
logging_level ["string", "null"] "INFO" The level for logging. Set to DEBUG to get things like queries executed, timing of those queries, etc. See Python's Logger Levels for information about valid values.
persist_empty_tables ["boolean", "null"] False Whether the Target should create tables which have no records present in Remote.
max_batch_rows ["integer", "null"] 200000 The maximum number of rows to buffer in memory before writing to the destination table in Postgres
max_buffer_size ["integer", "null"] 104857600 (100MB in bytes) The maximum number of bytes to buffer in memory before writing to the destination table in Postgres
batch_detection_threshold ["integer", "null"] 5000, or 1/40th max_batch_rows How often, in rows received, to count the buffered rows and bytes to check if a flush is necessary. There's a slight performance penalty to checking the buffered records count or bytesize, so this controls how often this is polled in order to mitigate the penalty. This value is usually not necessary to set as the default is dynamically adjusted to check reasonably often.
state_support ["boolean", "null"] True Whether the Target should emit STATE messages to stdout for further consumption. In this mode, which is on by default, STATE messages are buffered in memory until all the records that occurred before them are flushed according to the batch flushing schedule the target is configured with.
add_upsert_indexes ["boolean", "null"] True Whether the Target should create column indexes on the important columns used during data loading. These indexes will make data loading slightly slower but the deduplication phase much faster. Defaults to on for better baseline performance.
before_run_sql ["string", "null"] None Raw SQL statement(s) to execute as soon as the connection to Postgres is opened by the target. Useful for setup like SET ROLE or other connection state that is important.
after_run_sql ["string", "null"] None Raw SQL statement(s) to execute as soon as the connection to Postgres is opened by the target. Useful for setup like SET ROLE or other connection state that is important.
before_run_sql_file ["string", "null"] None Similar to before_run_sql but reads an external file instead of SQL in the JSON config file.
after_run_sql_file ["string", "null"] None Similar to after_run_sql but reads an external file instead of SQL in the JSON config file.
application_name ["string", "null"] None Set the postgresql application_name connection option to help with debugging, etc...

Supported Versions

target-postgres only supports JSON Schema Draft4. While declaring a schema is optional, any input schema which declares a version other than 4 will be rejected.

target-postgres supports all versions of PostgreSQL which are presently supported by the PostgreSQL Global Development Group. Our CI config defines all versions we are currently supporting.

Version Current minor Supported First Release Final Release
15 15.0 Yes October 13, 2022 November 11, 2027
14 14.5 Yes September 30, 2021 November 12, 2026
13 13.8 Yes September 24, 2020 November 13, 2025
12 12.12 Yes October 3, 2019 November 14, 2024
11 11.17 Yes October 18, 2018 November 9, 2023
10 10.22 Yes October 5, 2017 November 10, 2022

The above is copied from the current list of versions on Postgresql.org

Known Limitations

Indexes

If the add_upsert_indexes config option is enabled, which it is by default, target-postgres adds indexes on the tables it creates for its own queries to be more performant. Specifically, target-postgres automatically adds indexes to the _sdc_sequence column and the _sdc_level_<n>_id columns which are used heavily when inserting and upserting.

target-postgres doesn't have any facilities for adding other indexes to the managed tables, so if there are more indexes required, they should be added by another downstream tool, or can just be added by an administrator when necessary. Note that these indexes incur performance overhead to maintain as data is inserted, These indexes can also prevent target-postgres from dropping columns in the future if the schema of the table changes, in which case an administrator should drop the index so target-postgres is able to drop the columns it needs to.

Note: Index adding is new as of version 0.2.1, and target-postgres does not retroactively create indexes for tables it created before that time. If you want to add indexes to older tables target-postgres is loading data into, they should be added manually.

Usage Logging

Singer.io requires official taps and targets to collect anonymous usage data. This data is only used in aggregate to report on individual tap/targets, as well as the Singer community at-large. IP addresses are recorded to detect unique tap/targets users but not shared with third-parties.

To disable anonymous data collection set disable_collection to true in the configuration JSON file.

Developing

target-postgres utilizes poetry for package management, and PyTest for testing.

Documentation

See also:

Docker

If you have Docker and Docker Compose installed, you can easily run the following to get a local env setup quickly.

$ docker-compose up -d --build
$ docker logs -tf target-postgres_target-postgres_1 # You container names might differ

As soon as you see INFO: Dev environment ready. you can shell into the container and start running test commands:

$ docker-compose exec target-postgres bash
(target-postgres) root@...:/code# pytest

The environment inside the docker container has a virtualenv set up and activated, with an --editable install of target-postgres inside it and your local code mounted as a Docker volume. If you make changes on your host and re-run pytest any changes should be reflected inside the container.

See the PyTest commands below!

DB

To run the tests, you will need a PostgreSQL server running.

NOTE: Testing assumes that you've exposed the traditional port 5432.

Make sure to set the following env vars for PyTest:

$ EXPORT POSTGRES_HOST='<your-host-name>' # Most likely 'localhost'
$ EXPORT POSTGRES_DB='<your-db-name>'     # We use 'target_postgres_test'
$ EXPORT POSTGRES_USER='<your-user-name'  # Probably just 'postgres', make sure this user has no auth

PyTest

To run tests, try:

$ poetry run pytest

If you've bash shelled into the Docker Compose container (see above), you should be able to simply use:

$ pytest

Collaboration and Contributions

Join the conversation over at the Singer.io Slack and on the #target-postgres channel.

Try to adhere to the following for contributing:

Users can file an issue without submitting a pull request but be aware not all issues can or will be addressed.

Sponsorship

Target Postgres is sponsored by Data Mill (Data Mill Services, LLC) datamill.co.

Data Mill helps organizations utilize modern data infrastructure and data science to power analytics, products, and services.


Copyright Data Mill Services, LLC 2018