dlt-hub / verified-sources

Contribute to dlt verified sources 🔥
https://dlthub.com/docs/walkthroughs/add-a-verified-source
Apache License 2.0
52 stars 40 forks source link

Postgres replication #392

Closed jorritsandbrink closed 2 months ago

jorritsandbrink commented 4 months ago

Tell us what you do here

Relevant issue

https://github.com/dlt-hub/dlt/issues/933

More PR info

Adds initial support for postgres replication. Some things are still missing, but this is a good time to get feedback.

What's not (yet) included:

jorritsandbrink commented 4 months ago

@rudolfix see my replies on your comments.

Regarding pgoutput vs wal2json:

That being said, wal2json might be better (necessary) when data volumns are large. I'd suggest we go with pgoutput first, then add wal2json support later if needed.

jorritsandbrink commented 3 months ago

image Thanks for your feedback Martin, let's continue the discussion here for visibility.

Can you process individual tables in fivetran / peardb / the other systems you're referring to, or do you have to process all of them simultaneously?

rudolfix commented 3 months ago

@jorritsandbrink it should be easy. if there > 1 table we still have one LSN with which we track new messages right? if so the change is trivial: you can specify a table name which is resolved dynamically: https://dlthub.com/docs/general-usage/resource#dispatch-data-to-many-tables

drawbacks:

  1. lambda needs table name in the data items. but you can also remove that name in the same lambda if we do not need this in the final table.
  2. full refresh of a table requires a reset of all tables in publication, right? also initial load will generate many sql_table resources
jorritsandbrink commented 3 months ago

@rudolfix I've addressed all your comments, please review.

To be done: increase dlt version in requirements.txt to include PR 1127.

rudolfix commented 2 months ago

@jorritsandbrink I finally enabled tests on our CI, and most of them are passing but:

jorritsandbrink commented 2 months ago

@jorritsandbrink I finally enabled tests on our CI, and most of them are passing but:

  • this one consistently fails
with src_pl.sql_client() as c:
            qual_name = src_pl.sql_client().make_qualified_table_name("items")
            c.execute_sql(f"UPDATE {qual_name} SET foo = 'baz' WHERE id = 2;")
            c.execute_sql(f"DELETE FROM {qual_name} WHERE id = 2;")
        extract_info = dest_pl.extract(changes)
>       assert extract_info.asdict()["job_metrics"] == []
E       AssertionError: assert [{'created': ...e': 501, ...}] == []
E         Left contains one more item: {'created': 1713123289.1342065, 'extract_idx': 1, 'file_path': '/home/rudolfix/src/pipelines/_storage/.dlt/pipelines/d...lize/a93aba1460a1d099/1713123287.6514962/new_jobs/_dlt_pipeline_state.bfcae69f3b.0.typed-jsonl', 'file_size': 501, ...}
E         Full diff:
E           [
E         -  ,
E         +  {'created': 1713123289.1342065,
E         +   'extract_idx': 1,
E         +   'file_path': '/home/rudolfix/src/pipelines/_storage/.dlt/pipelines/dest_pl/normalize/a93aba1460a1d099/1713123287.6514962/new_jobs/_dlt_pipeline_state.bfcae69f3b.0.typed-jsonl',...
E
E         ...Full output truncated (6 lines hidden), use '-vv' to show
  • those require postgres 15 while we are on 13 on CI. maybe you could take a look? is there a way to use the old syntax?
try:
>           cur.execute(
                f"ALTER PUBLICATION {esc_pub_name} ADD TABLES IN SCHEMA {esc_schema_name};"
            )
E           psycopg2.errors.SyntaxError: syntax error at or near "TABLES"
E           LINE 1: ALTER PUBLICATION "test_pub6589482f" ADD TABLES IN SCHEMA "s...

@rudolfix The first issue is actually also version related. I was testing on Postgres 16 locally, but have been able to reproduce both issues on Postgres 13.

1) It seems Postgres 13 publishes "empty transactions" for updates/deletes when they are excluded from the publish publication parameter (e.g. when publish = 'insert'). Postgres 16 does not do this. As a result we do find messages to process (the "empty transactions") when we've done an update or delete, even though we told Postgres we're not interested in them. last_commit_lsn in resource state needs to be updated accordingly. An item gets extracted for _dlt_pipeline_state because of the state update, where our test asserted nothing is extracted. Solved by making the test more specific and asserting nothing gets extracted for the items table: https://github.com/dlt-hub/verified-sources/pull/392/commits/34610b634f29348d6362a2f26fa946ed3b93cf37

2) Not really feasible to make this work for older Postgres versions. I could fetch all tables from the schema and add them one by one, but that wouldn't accomodate the case where a table gets added to the schema later. To keep things clean, I introduced a Postgres version requirement for schema replication instead: https://github.com/dlt-hub/verified-sources/pull/392/commits/7a070453e93bed4c7863b5ee82f681dd12e909e5

rudolfix commented 2 months ago

@jorritsandbrink should we spawn another postgres instance just to test replication? I do not want to make it too complicated and I'm totally fine with 15. version.

jorritsandbrink commented 2 months ago

@rudolfix yes, using a separate instance for replication sounds good.