jorritsandbrink commented 4 months ago

Tell us what you do here

[X] implementing verified source (please link a relevant issue labeled as verified source)
[ ] fixing a bug (please link a relevant bug report)
[ ] improving, documenting, or customizing an existing source (please link an issue or describe below)
[ ] anything else (please link an issue or describe below)

Relevant issue

https://github.com/dlt-hub/dlt/issues/933

More PR info

Adds initial support for postgres replication. Some things are still missing, but this is a good time to get feedback.

Implements replication functionality based on logical decoding
Processes changes in batch (not streaming)
Uses built-in pgoutput plugin
Uses psycopg2's support for logical replication—this streams messages from pgoutput into Python in an endless loop
Uses pypgoutput to decode pgoutput's binary messages—the library's functionality to consume messages and transform them into "change events" (Pydantic models) is not used because it only works on Linux
Uses "text mode", meaning data is provided as string values. These string values are coerced into dlt-compatible Python objects, e.g. the string "t" becomes the boolean True. "Binary mode" would be faster, but less robust.
Relies on a dedicated replication slot and publication for a table. I.e. two tables means two slots and two publications. This provides granular control and does not intruduce significant overhead if I'm not mistaken. No longer the case, changed because of user feedback. It is now possible to replicate one table, multiple tables, or an entire schema using a single publication.
Maintains resource-scoped state to keep track of progress. At beginning of run, last-seen LSN of previous run is retrieved from state and used to advance the replication slot (flush messages from it).
~~Adds two resource types: table_snapshot for initial load, and table_changes for CDC.~~
- ~~table_snapshot persists the state of the table in the snapshot that gets exported when creating a replication slot into a physical table, and then uses sql_table resource to do the rest~~
- table_changes generates and yields "data items" (TDataItem) and "metadata items" (DataItemWithMeta) from decoded replication messages. Items are first stored in-memory in a list, before they are yielded from this list.
Handles both initial load and change propagation.
- Provides init_replication to setup a replication slot and publication. This function optionally persists snapshot tables representing the state of the exact moment the replication slot got created. It then returns sql_table resources to enable an initial load. Users do not need to use init_replication—they can create a slot and publication in any other way they see fit.
- Provides replication_resource to create a DltResource that consumes a slot/publication and generates data items with metadata (DataItemWithMeta). It dispatches data to multiple tables if the publication publishes changes for multiple tables.
Uses include_columns argument to exclude any columns not provided as input (or includes all columns if not provided)
~~Organizes code in subfolder under sql_database: /sources/sql_database/pg_replication~~ Moved to its own top-level folder.

What's not (yet) included:

~~Chunking mechanism to limit batch size in table_changes~~ implemented
~~DltSource to handle multiple tables / an entire database~~ no longer applies—multiple tables are now handled at the resource level
Support for the truncate operation
~~Perhaps some more data type mapping~~ done—common types are handled and exotic types default to text
~~Deletion of snapshot table after it has been consumed~~ not implemented—couldn't find a good way to do this
~~Example pipeline~~ done
~~More tests~~ done
User docs

jorritsandbrink commented 4 months ago

@rudolfix see my replies on your comments.

Regarding pgoutput vs wal2json:

my main consideration to go with pypgoutput first was convenience for the user (no additional setup needed)—for what it's worth, Debezium also supports pypgoutput but not wal2json
another reason is that publications only work with pgoutput, meaning that additional client-side filtering is needed (selecting the right table, and the right DML operation if you want to only propagate inserts for example), offsetting some of the benefits of server-side decoding—see this SO on the topic
wal2json will probably be a little simpler, but not a lot—the message objects generated by pypgoutput (e.g. Insert, Update, ...) are almost as easy to work with as wal2json's output

That being said, wal2json might be better (necessary) when data volumns are large. I'd suggest we go with pgoutput first, then add wal2json support later if needed.

jorritsandbrink commented 3 months ago

Thanks for your feedback Martin, let's continue the discussion here for visibility.

I think using one slot + publication for multiple tables can be enabled with some minor modifications to the code.
If you have multiple tables in a single publication, it would make sense (be most efficient) if you process all tables concurrently while iterating over the replication messages once.
The alternative would be to iterate over the replication messages multiple times, once for each table, and skipping all messages that aren't relevant for that table. Less efficient, but resources/tables can be processed independently.

Can you process individual tables in fivetran / peardb / the other systems you're referring to, or do you have to process all of them simultaneously?

rudolfix commented 3 months ago

@jorritsandbrink it should be easy. if there > 1 table we still have one LSN with which we track new messages right? if so the change is trivial: you can specify a table name which is resolved dynamically: https://dlthub.com/docs/general-usage/resource#dispatch-data-to-many-tables

drawbacks:

lambda needs table name in the data items. but you can also remove that name in the same lambda if we do not need this in the final table.
full refresh of a table requires a reset of all tables in publication, right? also initial load will generate many sql_table resources

jorritsandbrink commented 3 months ago

@rudolfix I've addressed all your comments, please review.

To be done: increase dlt version in requirements.txt to include PR 1127.

rudolfix commented 2 months ago

@jorritsandbrink I finally enabled tests on our CI, and most of them are passing but:

this one consistently fails

with src_pl.sql_client() as c:
        qual_name = src_pl.sql_client().make_qualified_table_name("items")
        c.execute_sql(f"UPDATE {qual_name} SET foo = 'baz' WHERE id = 2;")
        c.execute_sql(f"DELETE FROM {qual_name} WHERE id = 2;")
    extract_info = dest_pl.extract(changes)
>       assert extract_info.asdict()["job_metrics"] == []
E       AssertionError: assert [{'created': ...e': 501, ...}] == []
E         Left contains one more item: {'created': 1713123289.1342065, 'extract_idx': 1, 'file_path': '/home/rudolfix/src/pipelines/_storage/.dlt/pipelines/d...lize/a93aba1460a1d099/1713123287.6514962/new_jobs/_dlt_pipeline_state.bfcae69f3b.0.typed-jsonl', 'file_size': 501, ...}
E         Full diff:
E           [
E         -  ,
E         +  {'created': 1713123289.1342065,
E         +   'extract_idx': 1,
E         +   'file_path': '/home/rudolfix/src/pipelines/_storage/.dlt/pipelines/dest_pl/normalize/a93aba1460a1d099/1713123287.6514962/new_jobs/_dlt_pipeline_state.bfcae69f3b.0.typed-jsonl',...
E
E         ...Full output truncated (6 lines hidden), use '-vv' to show

those require postgres 15 while we are on 13 on CI. maybe you could take a look? is there a way to use the old syntax?

try:
>           cur.execute(
            f"ALTER PUBLICATION {esc_pub_name} ADD TABLES IN SCHEMA {esc_schema_name};"
        )
E           psycopg2.errors.SyntaxError: syntax error at or near "TABLES"
E           LINE 1: ALTER PUBLICATION "test_pub6589482f" ADD TABLES IN SCHEMA "s...

jorritsandbrink commented 2 months ago

@jorritsandbrink I finally enabled tests on our CI, and most of them are passing but:

this one consistently fails

with src_pl.sql_client() as c:
            qual_name = src_pl.sql_client().make_qualified_table_name("items")
            c.execute_sql(f"UPDATE {qual_name} SET foo = 'baz' WHERE id = 2;")
            c.execute_sql(f"DELETE FROM {qual_name} WHERE id = 2;")
        extract_info = dest_pl.extract(changes)
>       assert extract_info.asdict()["job_metrics"] == []
E       AssertionError: assert [{'created': ...e': 501, ...}] == []
E         Left contains one more item: {'created': 1713123289.1342065, 'extract_idx': 1, 'file_path': '/home/rudolfix/src/pipelines/_storage/.dlt/pipelines/d...lize/a93aba1460a1d099/1713123287.6514962/new_jobs/_dlt_pipeline_state.bfcae69f3b.0.typed-jsonl', 'file_size': 501, ...}
E         Full diff:
E           [
E         -  ,
E         +  {'created': 1713123289.1342065,
E         +   'extract_idx': 1,
E         +   'file_path': '/home/rudolfix/src/pipelines/_storage/.dlt/pipelines/dest_pl/normalize/a93aba1460a1d099/1713123287.6514962/new_jobs/_dlt_pipeline_state.bfcae69f3b.0.typed-jsonl',...
E
E         ...Full output truncated (6 lines hidden), use '-vv' to show

those require postgres 15 while we are on 13 on CI. maybe you could take a look? is there a way to use the old syntax?

try:
>           cur.execute(
                f"ALTER PUBLICATION {esc_pub_name} ADD TABLES IN SCHEMA {esc_schema_name};"
            )
E           psycopg2.errors.SyntaxError: syntax error at or near "TABLES"
E           LINE 1: ALTER PUBLICATION "test_pub6589482f" ADD TABLES IN SCHEMA "s...

@rudolfix The first issue is actually also version related. I was testing on Postgres 16 locally, but have been able to reproduce both issues on Postgres 13.

1) It seems Postgres 13 publishes "empty transactions" for updates/deletes when they are excluded from the publish publication parameter (e.g. when publish = 'insert'). Postgres 16 does not do this. As a result we do find messages to process (the "empty transactions") when we've done an update or delete, even though we told Postgres we're not interested in them. last_commit_lsn in resource state needs to be updated accordingly. An item gets extracted for _dlt_pipeline_state because of the state update, where our test asserted nothing is extracted. Solved by making the test more specific and asserting nothing gets extracted for the items table: https://github.com/dlt-hub/verified-sources/pull/392/commits/34610b634f29348d6362a2f26fa946ed3b93cf37

2) Not really feasible to make this work for older Postgres versions. I could fetch all tables from the schema and add them one by one, but that wouldn't accomodate the case where a table gets added to the schema later. To keep things clean, I introduced a Postgres version requirement for schema replication instead: https://github.com/dlt-hub/verified-sources/pull/392/commits/7a070453e93bed4c7863b5ee82f681dd12e909e5

rudolfix commented 2 months ago

@jorritsandbrink should we spawn another postgres instance just to test replication? I do not want to make it too complicated and I'm totally fine with 15. version.

jorritsandbrink commented 2 months ago

@rudolfix yes, using a separate instance for replication sounds good.

dlt-hub / verified-sources

Postgres replication #392

Tell us what you do here

Relevant issue

More PR info